Regex for looking for three quotes - " - in one line

mattgreer · Aug 05, 2021#12021-08-05T15:30+00:00

I am having troubles setting this up. I thought this would work: .+".+".+" which is basically any character(s) followed by a quote followed by any character(s), and so on, but it doesn't.

I would appreciate your help. Thanks!

I am using UltraEdit for Windows 14.00

Mofi · Aug 05, 2021#22021-08-05T18:22+00:00

.+ means any character except newline characters (like carriage return and line-feed to mention most common newline characters) one or more times on using the Unix or the Perl compatible regular expression engine.

So there is a problem with overlapping character set definition as the character " is also matched by any character except newline characters. The second issue with this expression is that it begins with .+ and not with a fixed string to find or an anchor like ^ for beginning of line (= top of the file or after newline characters). For that reason the regular expression search function can start matching any character except newline characters nearly everywhere in the character stream of a file. That is not good.

The Perl compatible regular expression engine has the concept of greedy and non-greedy expressions.

.+" is a greedy expression which results with the Perl compatible regular expression engine to start matching characters at top of file or beginning of a line and match first everything up to end of a line or end of file or last " in the line. But there is one more .+" and so it continues matching again all characters up to end of line / file or last " and fails to find this time one more ". So the Perl compatible engine "thinks", ups, I was too greedy already on first part of the expression, so go back once again to initial start and this time match any characters only up to last but one " for the first part and then try to get a still positive match for the second part of the expression. And this restart of the search is then done a third time on a line with at least three " with three other characters except newline characters (includes also "!) to get a positive match from beginning of a line to last occurrence of " for the longest possible positive match.

The opposite with Perl compatible regular expression is .+?" which is a non-greedy expression. Now the Perl regular expression engine tries to find the shortest possible positive match. So with the Perl compatible regular expression could be used ^.+?".+?".+?".

The Unix regular expression engine is not so powerful as the Perl compatible regular expression engine. It does not support the concept of greedy and non-greedy expressions. And it does not go forward and backward in character stream to get somehow a positive match. The Unix regular expression starts the search also at beginning of a line / file and matches everything up to end of line / file (negative search) or last last " (positive match for first part). In the second case it continues the matching behavior according to second .+" and reaches end of line / file without a positive match for ". The Unix regular expression does not go back to initial search position in character stream and try to match by the first expression part less characters to get a positive match also for the second part and of course does not do that a third time to get finally a positive match for the entire expression where already first .+ matches all characters which should be matched also by the rest of the expression.

The problems caused by overlapping character sets and greediness can be avoided by using with Unix or Perl compatible regular expression engine the search expression: ^[^\r\n"]+"[^\r\n"]+"[^\r\n"]+"
The negative character set definition [^\r\n"]+ means find one or more characters NOT being a carriage return or a line-feed or a double quote. So this expression does definitely not match a double quote character and for that reason this expression stops always on next occurrence of " or a carriage return (negative match as next character is not a double quote, but a carriage return) or a line-feed (negative match as next character is not a double quote, but a line-feed). For that reason this expression returns also no positive match for a line containing for example just """""" as there must be different characters between the double quote characters than a double quote character.

But the expression using the negative character classes matches also everything from the beginning of a line to third occurrence of a double quote character with at least one other character than a double quote between on lines having even more double quotes. So if you want to find lines with exactly three double quotes and at least one other character between them, there must be used the search expression: ^[^\r\n"]+"[^\r\n"]+"[^\r\n"]+"[^\r\n"]+$

Please note that the Unix regular expression interprets end of file not as end of line for anchor $. So if the file ends with a "line" with exactly three double quotes with other characters between without a line termination, this "line" is not matched by the Unix regular expression engine. I call that a "string at end of file" and not a "line" as it has no line ending/termination. The Perl compatible regular expression engine interprets end of file (= end of character stream) also as end of line with a positive result for $.

I recommend to look also on How to find lines in CSV file with less or more than X tabs within a line? whereby \t can be replaced by " for a double quote.

mattgreer · Aug 12, 2021#32021-08-12T20:35+00:00

I'm definitely struggling with this; very sorry for being so obtuse.

Using the information above, I'm now working with a new comma delimited file that I want to modify as follows:

First group: The beginning of the line until the second comma.
Second group: The next character after that comma whether it's a space or otherwise to the end of the line.
Surround the second group with double-quote marks. So what I'm trying to do is force Excel to see three columns of information, the info until the first comma, info until the second comma, and everything else as the third column.

I just realized as I write this that a simple recorded macro to find a comma, then another comma, then enter a quote, go to the end of the line and enter another quote, would do the trick I'd still like to try to do this with Regular Expressions as it would be another good opportunity to learn. I'm using the Perl engine.

I tried this:

Find: ^[^\r\n,]+,[^\r\n,]+,[.+]

But that wouldn't work for me.

Thank you for your continued help!

fleggy · Aug 12, 2021#42021-08-12T21:12+00:00

Hi,

you have to distinguish between [] and ().
[] means a character set
() usually means a capturing group

So your Find regexp should look like:
^[^\r\n,]+,[^\r\n,]+,\K.+
And the replace expression:
"$&"

Your task does not require any capturing group. In the above find regexp the symbol \K means: show the match (select the matched text) from here - from the position right after the second comma - to the EOL.
$& means the matched (selected] text which you want to put between quotes.

BR, Fleggy

mattgreer · Aug 12, 2021#52021-08-12T21:43+00:00

OK, I was about to post this, that I'm getting closer:
^([^\r\n\,]+)(,)([^\r\n\,]+)(,)(.+)(\r\n)
\1\2\3\4"\5"\6

But once it "fixes" that line, the search engine goes backwards, which of course is undesirable.

So your post, fleggy, thank you for your help! I just discovered, somewhat by accident, that I wasn't looking at the right help section in UltraEdit, so now that I have that taken care of, I'm not seeing anything regarding \K nor using the &. Where would I find info on that?

When I tried your solution it informed me that the search expression wasn't found. I'm running UltraEdit 14.00+2, by the way. Perhaps that's the issue?

-matt

Mofi · Aug 13, 2021#62021-08-13T06:24+00:00

Examples make everything easier to understand. So let´s say the active CSV file contains following lines:

Code: Select all

field value 1,field value 2,long third field value
field value 1,,long third field value
,field value 2,long third field value
,,long third field value
field value 1,field value 2,"long third field value"
field value 1,,"long third field value"
,field value 2,"long third field value"
,,"long third field value"

A Unix or Perl regular expression replace with search expression ^([^\r\n,]*,[^\r\n,]*,)([^\r\n"].*)$ and replace expression \1"\2" modifies the lines in the CSV file to:

Code: Select all

field value 1,field value 2,"long third field value"
field value 1,,"long third field value"
,field value 2,"long third field value"
,,"long third field value"
field value 1,field value 2,"long third field value"
field value 1,,"long third field value"
,field value 2,"long third field value"
,,"long third field value"

So the first four lines are now with " around the third field value while nothing changed on the last four lines as the third field value starts already with a double quote. That Unix and Perl regular expression replace was tested by me with UltraEdit for Windows v14.00b+1.

What Fleggy mentioned is a Perl (not Unix) regular expression replace with something like ^(?:[^\r\n,]*,){2}\K(?!")(.+)$ as search expression and "\1" as replace expression using a non-capturing group with (?:...) and multiplier {2} to specify that the expression in the non-capturing group must be applied exactly two times for a positive matches with \K to keep back (deselect) everything found by the expression up to now and a negative look-ahead with (?!...) before matching the rest of the line with .* in a capturing group for matching characters up to end of the line. But this Perl regular expression replace cannot be done with UltraEdit for Windows v14.00+2 released February 2008 using the Boost Perl regular expression library. There have been made many enhancements in the last 13 years on the Boost Perl regular expression library and UltraEdit to support more complex Perl regular expression finds/replaces.

See also the announcement topic Readme for the Find/Replace/Regular Expressions forum. But please take into account on following the links in this topic to pages or entire websites explaining the Perl regular expression capabilities that you are using a thirteen year old version of UltraEdit and Boost Perl regular expression library. So many explained expressions will not work with UltraEdit v14.00+2.

mattgreer · Aug 13, 2021#72021-08-13T15:18+00:00

So many explained expressions will not work with UltraEdit v14.00+2.

Yeah, I was getting that impression. I appreciate you guys trying to help me despite my old version of software.

Being a non-programmer (professionally) I just can't justify updating the software when it still works quite well.

Thanks again!

-Matt

Aug 19, 2021#82021-08-19T20:39+00:00

Hey Mofi,

As I've plugged away at this task over the past week, I've realized how much incredibly useful information is packed into this thread. I cannot thank you enough! This has been a lot of fun as well; I'm learning a lot. Figuring out negative lookbacks over the weekend helped, and most recently what you posted about [^\r\n"]+ was the key to solving the last little puzzle. And, somewhat humorously, I'm doing this all with my super old version of UE. For a couple days I switched to Notepad++ but, honestly, the macro capabilities of UE are far beyond what's available with other offerings. UE is simply the best. But I digress.

Thank you thank you thank you.

-matt