Bugs in Perl syntax? Maybe Not! Read This..

cjard · Apr 11, 2006#12006-04-11T13:43+00:00

So, you think there's a bug in the Perl RegEx engine of UE 12.00 (no hotfix)? Well, read this post I just wrote elsewhere and see if it helps your understanding. What you may think is a bug might just be because UE isn't quite Perl syntax compliant and you're trying to use it with Perl syntax.

It came about with a user trying to delete blank lines. Here are my comments:

Captain wrote:With UltraEdit regular expressions I could delete blank links with the following:

Replace: "^p$" (without the quotes)
With "" (without the quotes - i.e. nothing).

Now using Perl regular expressions I expected the following to work:

Replace: "^\s*$" (without the quotes)
With "" (without the quotes - i.e. nothing).

But it does not. Any suggestions?

You know that ^ and $ match the start and the end of the input, and I THINK that UE hence splits the document into an array of lines and feeds them each into the matcher as lines of input with a start and end.

..hence you CANNOT match multiple lines with ^ and $

Instead you must make UEdit shove the whole file into the matcher by not using ^ $. Just get rid of them.

At this point I must say another little understood fact, and its a VERY IMPORTANT thing you need to know about UE is that it operates ALL regex in PESSIMISTIC mode by default. This it NOT usual. GREEDY is the default mode, but people don't understand greedy if they are used to MS DOS * wildcard matching.

FIND foo.*bar IN foothingbarfoothingbar REPLACE WITH foobaz

with greedy:
-> outputs foobaz

with pessimistic:
-> outputs foobazfoobaz

Pessimistic reads the input like a human, one character at a time, and it keeps going while the match is true. It reads foo, then starts reading anything and looking for bar, so it finds FOOTHINGBAR

Greedy on the other hand, reads the whole line, then splits it out until it matches. Hence the first foo and the last bar are matches and the .* matches the thingbarfoothing inbetween. This is counter intuitive to most humans and I think why UE doesn't operate in this mode.Remember it though!

The pattern [\r\n]+ will match a new line in the whole input (considering all lines as a stream).
Why is it not [\r\n]* ?
Because that means 0 or more newlines. So everything that is not a newline character and is also a nothing character will match. What's a nothing character? It's the character in between two real characters
ABCDEF <-there are 5 nothings in here between A and B, B and C ...

Try it.. find for [\r\n]* and see the cursor jump between all the letters. This wouldn't happen in GREEDY because the whole input would be swallowed, then spat back until an long sequence of \r\n was found, then matching would continue from there.
If you don't understand this, heres an analogy:

You and me are standing in a room and I say "shout the word BANG when I say the word zero, one or two okay? Each time you say bang I'll restart counting"
In greedy, I count down from 10, 9, 8, 7...
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG
10 9 8 7 6 5 4 3 2 BANG

I get to 2 and you say BANG.. for a 10 character input, you matched when I got to 2. That's greedy.. we match 2 characters which is a long input (longer than 0 anyways).

Now go pessimistic:
0 BANG
0 BANG
0 BANG

You're matching instantly, and thats why [\r\n]* in pessimistic mode (UE default) matches the nothing character in between words - because it is valid to say that the nothing character is indeed true for "zero or more occurrences if a newline".. i.e. not a newline (but not a word character either).

Now you must define what is a blank line? What if there are 10 whitespace on that line? I'll assume there are.

So whats our newline matcher? [\r\n]+ (one or more of.. or we could force the matcher into greedy mode.. Thing is I have no idea how because with Perl, [abc]* is greedy, [abc]*? is pessimistic and [abc]*+ is possessive (eats whole input and never spits back, rarely used). So if * behaves greedy in UE I have no idea what Mr Mead did to the engine to because it's broken perl syntax.

So anyways, we have [\r\n]+ for a newline, now we want to match any number of whitespace \s*
Remember that in pessimistic this match will run until it succeeds but if you don't put anything else onto the end of the expression then you'll get 0 spaces matched!
Why?

<NEWLINE> <NEWLINE>

The first newline will be matched and 0 or more whitespace.. so the pessimistic starts from 0 and finds a match! Yes! 0 occurrences of space is an ok match! So it just matches the first newline it finds.

So now we put another newline in our regex:

[\r\n]+\s*[\r\n]+

Now it will keep going matching up to 10 spaces before it finds a whole newline.

Yow you can see for yourself, this file:

Code: Select all

"this is an
example input
text with a
blank line
now:

and the text
continues"

The newline after colon, and any spaces on that blank line, and the newline on the end of it will be found.

Now just replace with one newline.. ^p in uedit syntax.

Ok tutorial over, I hope this kills many questions in this thread, and remember the UE works in pessimistic mode, and thats NOT perl default!

--

OK, so to summarise this, I'm making assumptions about the way UE is working because I can't see the code for the app, but here's a summary of my guesses:

Using ^$ to match start and end of input causes each line to become input, rather than the whole document. I.e. the doc is split into lines then each line is matched. These metacharacters represent the "nothing character" before the start of the line and after the end of the line. I'm not sure if UEdit adds the CRLF back onto the line after it uses it to split the document into lines, but before it feeds into the matcher.

UEdit operates in pessimistic mode by default. Normally in Perl you would say foo.*?bar to match the bold text in this string: sampleTextfooMatchedInputbarSampleText but in UE it's sufficient to say foo.*bar
This is NOT Perl syntax! I don't know how to get the matcher out of this mode (maybe in UE perl .*? means greedy and .* means pessimistic, I don't know).

So, try work around the second point there. Working in pessimistic when you're used to greedy can cause WEIRD stuff to happen, as I've discussed in this article. If you're aware of it, you might be able to avoid it!

Apr 11, 2006#22006-04-11T13:49+00:00

You can also get some more info in ^ $ and matching of newlines and the way any character matching like . doesn't match newlines here:

http://www.perl.com/doc/manual/html/pod/perlre.html

mjcarman · Apr 11, 2006#32006-04-11T19:16+00:00

cjard wrote:What you may think is a bug might just be because UE isn't quite Perl syntax compliant and you're trying to use it with Perl syntax.

Can you blame anyone? The option is named Perl compatible Regular Expressions after all.

cjard wrote:I THINK that UE hence splits the document into an array of lines and feeds them each into the matcher.

I think UE treats the file as a single string. If it didn't you couldn't search on "^p^p" using UE syntax. Either way, it's undocumented.

Another problem is that there's no (exposed) support for modifiers. (e.g. /m changes the behavior of '^' and '$')

cjard wrote:A VERY IMPORTANT thing you need to know about UE is that it operates ALL regex in PESSIMISTIC mode by default. This is NOT usual. GREEDY is the default mode, but people dont understand greedy if they are used to MS DOS * wildcard matching.

That's terrifying if true, because it would be very incompatible with Perl. I just tried it though, and ".*" is greedy. (I'm running UE v12.00+3)

cjard wrote:[stuff about quantifiers and greediness]

That patterns like "[\r\n]*" always match has nothing to do with whether or not the "*" quantifier is greedy. The match happens because "zero-or-more" of something can happen anywhere. I.e. you can always get a null (zero-times) match. Greediness determines how much of the string is consumed on a non-null match.

cjard wrote:[abc]* is greedy, [abc]*? is pessimistic and [abc]*+ is possessive (eats whole input and never spits back, rarely used)

"*+" isn't a valid quantifier. If you try to use it you'll get an error: "Nested quantifiers in regex..." To prevent backtracking, use "(?>pattern)"

urchin · Apr 25, 2006#42006-04-25T18:52+00:00

I am using the registered version of UE32 11.10c+1

I am operating on the following text

<0>
<a>
Some stuff
</a>
<a>
Some more stuff
</a>

</0>

I have "Unix-style regular expressions" checked in Advanced->Config-> Find options tab.

I expect to match each of:
Some stuff

then, finding again, match
Some more stuff

using the expression
([^`]+)

However, and I'm probably using these words wrong, because my expression is greedy it matches from start tag to end skipping all the pairs in between, ie--

Some stuff
</a>

<a>
Some more stuff

Any clues on how to modify the expression to not be greedy? I've tried various versions of the stuff posted here but to no avail.

Mofi · Apr 25, 2006#52006-04-25T23:36+00:00

cjard has written about Perl and you are talking about legacy Unix style.

Yes, the [^`]+ in Unix style or the UltraEdit style equivalent [~`]+ is "greedy". I have written this several time in the forum that this expression (and only this NOT expression) is "aggressive", because it does not stop at the end of a line and will act as you described. You either add additional characters inside the square brackets like \r, \n or < or you use the method designed for selecting a block: Search for and if found search for and hold the SHIFT key while pressing the Find button.

InsertMode
ColumnModeOff
HexOff
Find ""
StartSelect
Find Select ""

Do not enable the macro property Continue if a Find with Replace not found for this macro to immediatelly stop the execution of this macro when the first Find is not found or use IfSel after the first Find and EndIf at the end of the macro.