So, you think there's a bug in the Perl RegEx engine of UE 12.00 (no hotfix)? Well, read this post I just wrote elsewhere and see if it helps your understanding. What you may think is a bug might just be because UE isn't quite Perl syntax compliant and you're trying to use it with Perl syntax.
It came about with a user trying to delete blank lines. Here are my comments:
..hence you CANNOT match multiple lines with ^ and $
Instead you must make UEdit shove the whole file into the matcher by not using ^ $. Just get rid of them.
At this point I must say another little understood fact, and its a VERY IMPORTANT thing you need to know about UE is that it operates ALL regex in PESSIMISTIC mode by default. This it NOT usual. GREEDY is the default mode, but people don't understand greedy if they are used to MS DOS * wildcard matching.
FIND foo.*bar IN foothingbarfoothingbar REPLACE WITH foobaz
with greedy:
-> outputs foobaz
with pessimistic:
-> outputs foobazfoobaz
Pessimistic reads the input like a human, one character at a time, and it keeps going while the match is true. It reads foo, then starts reading anything and looking for bar, so it finds FOOTHINGBAR
Greedy on the other hand, reads the whole line, then splits it out until it matches. Hence the first foo and the last bar are matches and the .* matches the thingbarfoothing inbetween. This is counter intuitive to most humans and I think why UE doesn't operate in this mode.Remember it though!
The pattern [\r\n]+ will match a new line in the whole input (considering all lines as a stream).
Why is it not [\r\n]* ?
Because that means 0 or more newlines. So everything that is not a newline character and is also a nothing character will match. What's a nothing character? It's the character in between two real characters
ABCDEF <-there are 5 nothings in here between A and B, B and C ...
Try it.. find for [\r\n]* and see the cursor jump between all the letters. This wouldn't happen in GREEDY because the whole input would be swallowed, then spat back until an long sequence of \r\n was found, then matching would continue from there.
If you don't understand this, heres an analogy:
You and me are standing in a room and I say "shout the word BANG when I say the word zero, one or two okay? Each time you say bang I'll restart counting"
In greedy, I count down from 10, 9, 8, 7...
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
I get to 2 and you say BANG.. for a 10 character input, you matched when I got to 2. That's greedy.. we match 2 characters which is a long input (longer than 0 anyways).
Now go pessimistic:
0 BANG
0 BANG
0 BANG
You're matching instantly, and thats why [\r\n]* in pessimistic mode (UE default) matches the nothing character in between words - because it is valid to say that the nothing character is indeed true for "zero or more occurrences if a newline".. i.e. not a newline (but not a word character either).
Now you must define what is a blank line? What if there are 10 whitespace on that line? I'll assume there are.
So whats our newline matcher? [\r\n]+ (one or more of.. or we could force the matcher into greedy mode.. Thing is I have no idea how because with Perl, [abc]* is greedy, [abc]*? is pessimistic and [abc]*+ is possessive (eats whole input and never spits back, rarely used). So if * behaves greedy in UE I have no idea what Mr Mead did to the engine to because it's broken perl syntax.
So anyways, we have [\r\n]+ for a newline, now we want to match any number of whitespace \s*
Remember that in pessimistic this match will run until it succeeds but if you don't put anything else onto the end of the expression then you'll get 0 spaces matched!
Why?
<NEWLINE> <NEWLINE>
The first newline will be matched and 0 or more whitespace.. so the pessimistic starts from 0 and finds a match! Yes! 0 occurrences of space is an ok match! So it just matches the first newline it finds.
So now we put another newline in our regex:
[\r\n]+\s*[\r\n]+
Now it will keep going matching up to 10 spaces before it finds a whole newline.
Yow you can see for yourself, this file:
The newline after colon, and any spaces on that blank line, and the newline on the end of it will be found.
Now just replace with one newline.. ^p in uedit syntax.
Ok tutorial over, I hope this kills many questions in this thread, and remember the UE works in pessimistic mode, and thats NOT perl default!
--
OK, so to summarise this, I'm making assumptions about the way UE is working because I can't see the code for the app, but here's a summary of my guesses:
Using ^$ to match start and end of input causes each line to become input, rather than the whole document. I.e. the doc is split into lines then each line is matched. These metacharacters represent the "nothing character" before the start of the line and after the end of the line. I'm not sure if UEdit adds the CRLF back onto the line after it uses it to split the document into lines, but before it feeds into the matcher.
UEdit operates in pessimistic mode by default. Normally in Perl you would say foo.*?bar to match the bold text in this string: sampleTextfooMatchedInputbarSampleText but in UE it's sufficient to say foo.*bar
This is NOT Perl syntax! I don't know how to get the matcher out of this mode (maybe in UE perl .*? means greedy and .* means pessimistic, I don't know).
So, try work around the second point there. Working in pessimistic when you're used to greedy can cause WEIRD stuff to happen, as I've discussed in this article. If you're aware of it, you might be able to avoid it!
It came about with a user trying to delete blank lines. Here are my comments:
You know that ^ and $ match the start and the end of the input, and I THINK that UE hence splits the document into an array of lines and feeds them each into the matcher as lines of input with a start and end.Captain wrote:With UltraEdit regular expressions I could delete blank links with the following:
Replace: "^p$" (without the quotes)
With "" (without the quotes - i.e. nothing).
Now using Perl regular expressions I expected the following to work:
Replace: "^\s*$" (without the quotes)
With "" (without the quotes - i.e. nothing).
But it does not. Any suggestions?
..hence you CANNOT match multiple lines with ^ and $
Instead you must make UEdit shove the whole file into the matcher by not using ^ $. Just get rid of them.
At this point I must say another little understood fact, and its a VERY IMPORTANT thing you need to know about UE is that it operates ALL regex in PESSIMISTIC mode by default. This it NOT usual. GREEDY is the default mode, but people don't understand greedy if they are used to MS DOS * wildcard matching.
FIND foo.*bar IN foothingbarfoothingbar REPLACE WITH foobaz
with greedy:
-> outputs foobaz
with pessimistic:
-> outputs foobazfoobaz
Pessimistic reads the input like a human, one character at a time, and it keeps going while the match is true. It reads foo, then starts reading anything and looking for bar, so it finds FOOTHINGBAR
Greedy on the other hand, reads the whole line, then splits it out until it matches. Hence the first foo and the last bar are matches and the .* matches the thingbarfoothing inbetween. This is counter intuitive to most humans and I think why UE doesn't operate in this mode.Remember it though!
The pattern [\r\n]+ will match a new line in the whole input (considering all lines as a stream).
Why is it not [\r\n]* ?
Because that means 0 or more newlines. So everything that is not a newline character and is also a nothing character will match. What's a nothing character? It's the character in between two real characters
ABCDEF <-there are 5 nothings in here between A and B, B and C ...
Try it.. find for [\r\n]* and see the cursor jump between all the letters. This wouldn't happen in GREEDY because the whole input would be swallowed, then spat back until an long sequence of \r\n was found, then matching would continue from there.
If you don't understand this, heres an analogy:
You and me are standing in a room and I say "shout the word BANG when I say the word zero, one or two okay? Each time you say bang I'll restart counting"
In greedy, I count down from 10, 9, 8, 7...
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
I get to 2 and you say BANG.. for a 10 character input, you matched when I got to 2. That's greedy.. we match 2 characters which is a long input (longer than 0 anyways).
Now go pessimistic:
0 BANG
0 BANG
0 BANG
You're matching instantly, and thats why [\r\n]* in pessimistic mode (UE default) matches the nothing character in between words - because it is valid to say that the nothing character is indeed true for "zero or more occurrences if a newline".. i.e. not a newline (but not a word character either).
Now you must define what is a blank line? What if there are 10 whitespace on that line? I'll assume there are.
So whats our newline matcher? [\r\n]+ (one or more of.. or we could force the matcher into greedy mode.. Thing is I have no idea how because with Perl, [abc]* is greedy, [abc]*? is pessimistic and [abc]*+ is possessive (eats whole input and never spits back, rarely used). So if * behaves greedy in UE I have no idea what Mr Mead did to the engine to because it's broken perl syntax.
So anyways, we have [\r\n]+ for a newline, now we want to match any number of whitespace \s*
Remember that in pessimistic this match will run until it succeeds but if you don't put anything else onto the end of the expression then you'll get 0 spaces matched!
Why?
<NEWLINE> <NEWLINE>
The first newline will be matched and 0 or more whitespace.. so the pessimistic starts from 0 and finds a match! Yes! 0 occurrences of space is an ok match! So it just matches the first newline it finds.
So now we put another newline in our regex:
[\r\n]+\s*[\r\n]+
Now it will keep going matching up to 10 spaces before it finds a whole newline.
Yow you can see for yourself, this file:
Code: Select all
"this is an
example input
text with a
blank line
now:
and the text
continues"
Now just replace with one newline.. ^p in uedit syntax.
Ok tutorial over, I hope this kills many questions in this thread, and remember the UE works in pessimistic mode, and thats NOT perl default!
--
OK, so to summarise this, I'm making assumptions about the way UE is working because I can't see the code for the app, but here's a summary of my guesses:
Using ^$ to match start and end of input causes each line to become input, rather than the whole document. I.e. the doc is split into lines then each line is matched. These metacharacters represent the "nothing character" before the start of the line and after the end of the line. I'm not sure if UEdit adds the CRLF back onto the line after it uses it to split the document into lines, but before it feeds into the matcher.
UEdit operates in pessimistic mode by default. Normally in Perl you would say foo.*?bar to match the bold text in this string: sampleTextfooMatchedInputbarSampleText but in UE it's sufficient to say foo.*bar
This is NOT Perl syntax! I don't know how to get the matcher out of this mode (maybe in UE perl .*? means greedy and .* means pessimistic, I don't know).
So, try work around the second point there. Working in pessimistic when you're used to greedy can cause WEIRD stuff to happen, as I've discussed in this article. If you're aware of it, you might be able to avoid it!