means any character except newline characters
(like carriage return and line-feed to mention most common newline characters) one
or more times on using the Unix or the Perl compatible regular expression engine.
So there is a problem with overlapping character set definition as the character "
is also matched by any character except newline characters
. The second issue with this expression is that it begins with .+
and not with a fixed string to find or an anchor like ^
for beginning of line (= top of the file or after newline characters). For that reason the regular expression search function can start matching any character except newline characters nearly everywhere in the character stream of a file. That is not good.
The Perl compatible regular expression engine has the concept of greedy and non-greedy expressions.
is a greedy expression which results with the Perl compatible regular expression engine to start matching characters at top of file or beginning of a line and match first everything up to end of a line or end of file or last "
in the line. But there is one more .+"
and so it continues matching again all characters up to end of line / file or last "
and fails to find this time one more "
. So the Perl compatible engine "thinks", ups, I was too greedy already on first part of the expression, so go back once again to initial start and this time match any characters only up to last but one "
for the first part and then try to get a still positive match for the second part of the expression. And this restart of the search is then done a third time on a line with at least three "
with three other characters except newline characters (includes also "
!) to get a positive match from beginning of a line to last occurrence of "
for the longest possible positive match.
The opposite with Perl compatible regular expression is .+?"
which is a non-greedy expression. Now the Perl regular expression engine tries to find the shortest possible positive match. So with the Perl compatible regular expression could be used ^.+?".+?".+?"
The Unix regular expression engine is not so powerful as the Perl compatible regular expression engine. It does not support the concept of greedy and non-greedy expressions. And it does not go forward and backward in character stream to get somehow a positive match. The Unix regular expression starts the search also at beginning of a line / file and matches everything up to end of line / file (negative search) or last last "
(positive match for first part). In the second case it continues the matching behavior according to second .+"
and reaches end of line / file without a positive match for "
. The Unix regular expression does not go back to initial search position in character stream and try to match by the first expression part less characters to get a positive match also for the second part and of course does not do that a third time to get finally a positive match for the entire expression where already first .+
matches all characters which should be matched also by the rest of the expression.
The problems caused by overlapping character sets and greediness can be avoided by using with Unix or Perl compatible regular expression engine the search expression: ^[^\r\n"]+"[^\r\n"]+"[^\r\n"]+"
The negative character set definition [^\r\n"]+
means find one or more characters NOT
being a carriage return or a line-feed or a double quote. So this expression does definitely not match a double quote character and for that reason this expression stops always on next occurrence of "
or a carriage return (negative match as next character is not a double quote, but a carriage return) or a line-feed (negative match as next character is not a double quote, but a line-feed). For that reason this expression returns also no positive match for a line containing for example just """"""
as there must be different characters between the double quote characters than a double quote character.
But the expression using the negative character classes matches also everything from the beginning of a line to third occurrence of a double quote character with at least one other character than a double quote between on lines having even more double quotes. So if you want to find lines with exactly
three double quotes and at least one other character between them, there must be used the search expression: ^[^\r\n"]+"[^\r\n"]+"[^\r\n"]+"[^\r\n"]+$
Please note that the Unix regular expression interprets end of file not as end of line for anchor $
. So if the file ends with a "line" with exactly three double quotes with other characters between without a line termination, this "line" is not matched by the Unix regular expression engine. I call that a "string at end of file" and not a "line" as it has no line ending/termination. The Perl compatible regular expression engine interprets end of file (= end of character stream) also as end of line with a positive result for $
I recommend to look also on How to find lines in CSV file with less or more than X tabs within a line?
can be replaced by "
for a double quote.