UE's implementation of RegEx - Improper multiline support

Bracket · Apr 08, 2008#12008-04-08T15:20+00:00

I'm not happy with the way I understand RegEx is being implemented in UltraEdit (currently working with 14.00a).

In dealing with a more specific sub-issue, I was told by IDM support that UE handles RegEx on a line-by-line basis, instead of working on the whole document at once. This is unfortunate, because it means that RegEx is going to work incorrectly in many cases - particularly when it involves multi-line parsing.

For example, if you use the following test text (and Perl RegEx):

--------------------------------------

Here we have some...
multiline

text. Here we explore first.

And here we explore last.

this is a green result

green, this result is not

--------------------------------------

The following RegEx does not work. It should match the word "this" if it's at the beginning of a line, but instead, UE says that nothing can be found.

(?<=\r\n)this

The following RegEx also does not work. It should match the word "green" if it's not at the beginning of a line, but instead, UE says that nothing can be found.

(?<!\r\n)green

And if we apply the following RegEx:

(?s)we.*explore

...the way this RegEx is *supposed* to work, is that everything from the word "we" on line 1 to the *second* occurance of the word "explore" on line 7 should be selected. Instead, UE only selects up to the *first* occurance. Making deliberate use of a greedy repetition operator is thwarted because of UE's line-by-line parsing.

What's the point of implementing RegEx, if it's not implemented fully? The question is - is this hardcoded into UE, or is there a way to configure UE to handle RegEx properly?

mjcarman · Apr 08, 2008#22008-04-08T16:01+00:00

I don't believe that UE does line-by-line matching. If that were true, it wouldn't be possible to match text that spans lines. That said, UE does have some issues with lookaround adjacent to newlines. Send a bug report to IDM, and don't use them (for now).

There are better ways to anchor at/after the beginning of a line:

Use the "^" anchor to match at the beginning of a line. e.g. "^this"
To not match after (but not at) the beginning of a line add ".+" to require additional characters. e.g. "^.+this"

A "." matches any character except newline. That's not a bug; it's Perl regex syntax. Perl contains a modifier (/s) to make it match newlines too but UE doesn't provide a way to set it. The simplest way to work around this is to use a negated character set with a character that isn't in your file. Tim's favorite is "[^ÿ]". To be more robust (and more strictly correct) you could specify an alternation: "(?:.|\r\n)"

Bracket · Apr 08, 2008#32008-04-08T16:33+00:00

mjcarman wrote:I don't believe that UE does line-by-line matching. If that were true, it wouldn't be possible to match text that spans lines. That said, UE does have some issues with lookaround adjacent to newlines. Send a bug report to IDM, and don't use them (for now).

I got that info from IDM support directly, and it's proven true by the test I showed above. And it is possible to match text that spans lines with line-by-line parsing - the RegEx is only applied to one string at a time, but UE is managing the handling, so they're basically jury rigging multi line support.

mjcarman wrote:There are better ways to anchor at/after the beginning of a line:

I was only using those as an example. They came up because I originally was looking for a way to anchor at the beginning of the document. The negative look behind doesn't work because UE is applying regex on a line by line basis. And "\A" anchors to the beginning of EVERY line, instead of only at the beginning of the document for the exact same reason - the regex engine is being fed only one string at a time, so the beginning of every line is what regex sees as the beginning of the buffer every time it's applied.

mjcarman wrote:A "." matches any character except newline. That's not a bug; it's Perl regex syntax. Perl contains a modifier (/s) to make it match newlines too but UE doesn't provide a way to set it.

That's not correct. Perl Regex does indeed have a way to set that option within a regex itself, UE supports it, and you saw me use it.

"(?s)" is what enables that option for the regex that follows it, and the regex I provided proves it. If you apply "we.*explore" to the test text, you'll only get results that occur on the same line, because the "." is not matching line breaks. But if you try "(?s)we.*explore", you'll see that you do get a multi line match (just not the correct one), because the "." IS being allowed to match line breaks.

mjcarman · Apr 08, 2008#42008-04-08T18:51+00:00

Bracket wrote:I got [the line-by-line] info from IDM support directly, and it's proven true by the test I showed above. And it is possible to match text that spans lines with line-by-line parsing - the RegEx is only applied to one string at a time, but UE is managing the handling, so they're basically jury rigging multi line support.

Then UE's behavior is really a single/multi line hybrid. It must be starting with a single line and expanding the search text until the regex matches. I'm guessing that it's for performance reasons but I agree that it's ugly.

Bracket wrote:I was only using [the look-around anchors] as an example.

Sorry, I saw a silly way of doing something and had to "fix" it.

Bracket wrote:"(?s)" is what enables [the "." matches newline] option for the regex that follows it

Ack, I read too quickly. I never use the embedded modifiers when I'm actually writing Perl and didn't realize that the library used by UE supports them. That's good to know. Maybe I should try "(?{ code })"

It appears that you've gotten to the root of your problem. Unfortunately, I think this is part of the implementation and there's nothing you can do about it (other than filing a bug report).

pietzcker · Apr 08, 2008#52008-04-08T19:54+00:00

Hi Bracket,

thanks for clearing that up. Especially the bit with (?s). Obviously, UE is kludging around here, only expanding over a line break if it hasn't found a match yet. Which is ugly - and dangerous if you are depending on the results' correctness. So for now it's workarounds for us (like the infamous [^ÿ] or mjcarman's alternation example). This also explains why IDM support has been sceptical about whether some of the bugs I have reported can be fixed soon - if UE has to mash up some kind of newline handling with a third-party regex engine that never sees more than a line. I guess it could get quite hairy, though, trying to give a regex engine access to a multi-GB file...

mjcarman · Apr 08, 2008#62008-04-08T21:05+00:00

The "[^ÿ]" and "(?:.|\r\n)" were workarounds for the lack of a way to set the /s modifier. As Bracket figured out we can use the inline (?:s) modifier; the workarounds aren't necessary.

There doesn't appear to be any way to get UE to apply the regex to the entire file at once. This means that:

\A and \Z won't work
Look-around by the start/end of lines won't work
Greedy quantifiers become non-greedy when the quantified item can span multiple lines. (e.g. "(?s).*" effectively becomes "(?s).*?"

I doubt that this is a limitation of the Boost library API. I suspect it's a deliberate decision to improve performance and make behavior more intuitive for typical end users. Bracket won't like me saying this, but it was probably the right decision. The problems we're seeing are corner cases.

pietzcker · Apr 09, 2008#72008-04-09T07:38+00:00

mjcarman wrote:The "[^ÿ]" and "(?:.|\r\n)" were workarounds for the lack of a way to set the /s modifier. As Bracket figured out we can use the inline (?:s) modifier; the workarounds aren't necessary.

You're right, of course (D'oh!). Must've been confused yesterday evening.

mjcarman wrote:Greedy quantifiers become non-greedy when the quantified item can span multiple lines. (e.g. "(?s).*" effectively becomes "(?s).*?"

Nearly, but not quite.

(?s).*match when applied to

blah blah match match blah
blah blah match

will match

blah blah match match blah
blah blah match

So you could say that greedy quantifiers are still greedy BUT will avoid crossing a newline if it is possible to match on the current line.

This doesn't sound very nice, and maybe someone can formulate this better.

As mjcarman said, there really are corner cases. If you need to work with regexes that application of regexes to the entire file often, RegexBuddy or (if that's not enough) PowerGREP might be an alternative...

Bracket · Apr 10, 2008#82008-04-10T06:04+00:00

mjcarman wrote:There doesn't appear to be any way to get UE to apply the regex to the entire file at once. This means that:

\A and \Z won't work

Well, not quite true. "\z" actually does work (lowercase "z"). I'm guessing they hardcoded that eventuality. "\A" and "\Z" work exactly the same way that "^" and "$" do.

mjcarman wrote:Look-around by the start/end of lines won't work

Also not quite. It's specifically look *behinds* that don't work, because they would need to see the line breaks of the line that came before them. But look *aheads* work fine, because the line breaks at the end of the current line they're parsing are still visible to them.

mjcarman wrote:I doubt that this is a limitation of the Boost library API. I suspect it's a deliberate decision to improve performance and make behavior more intuitive for typical end users. Bracket won't like me saying this, but it was probably the right decision. The problems we're seeing are corner cases.

I understand what you're saying, but I disagree simply because if they were going to bother implementing RegEx, they should have done it properly. Now, I understand the case for higher performance, however:

1. They could have simply put in a configuration option for full RegEx support, or high-performance with some limits support, and let each user decide for themselves.

2. From what I understand, EditPad Pro has complete Regex support and does not suffer at all as a result of it. That at least proves that it's doable.

mjcarman · Apr 10, 2008#92008-04-10T14:28+00:00

Bracket wrote:From what I understand, EditPad Pro has complete Regex support and does not suffer at all as a result of it.

Is EditPad Pro disk-based? UltraEdit is. That's why it can handle massive files but I can see how it would make it hard to apply regexes to the whole file at once.

pietzcker · Apr 11, 2008#102008-04-11T03:35+00:00

Hi,

I thought I could work around this problem by using negative lookahead, i.e. apply the search string

Code: Select all

(?s)Start.*match(?!.*match)

to the test text

Code: Select all

Some text
Start matching from here
Next possible (and correct) match
More text

Alas, it looks like lookaheads don't cross newlines in UE, even when (?s) is specified, so that's too bad. It will always match "Start match". Not even the workarounds

Code: Select all

Start[^ÿ]*match(?![^ÿ]*match)

or

Code: Select all

Start(.|\r\n)*match(?!(.|\r\n)*match)

will work. Unless someone has a better idea, but I'm slowly running out of those...

Tim

Bracket · Apr 21, 2008#112008-04-21T21:46+00:00

Well, I don't know if it's disk based, but I copied this from the features section of the developer's site:

----------------------------

Edit huge files without breaking a sweat. EditPad Pro will instantly open files up to 2 GB in size, even if your PC has less than 2 GB of RAM. Also, the maximum length of a single line is not limited, which is a problem with many editors claiming to support "unlimited" file sizes.

----------------------------

If it can open and work with a 2 GB text file without performance issues, I'd say that's enough to form a proof-of-concept.

pietzcker · Apr 22, 2008#122008-04-22T06:49+00:00

Well, I have switched from UE to EditPad Pro last week, and I'm not looking back.

The regex-based syntax highlighting and code folding/parsing work a lot better for me than anything you can do with a UE wordfile. Especially support for Python - excellent code folding (based on indentation) that UE can't do, function lists that consider the structure of classes and functions within classes etc... and of course a top-notch regex engine.

There are a few UE features that EPP doesn't have, and for others these might be deal-breakers. I.e., EPP doesn't have a built-in scripting language - but a good interface to any other scripting language, so I can use my favourite Python if I need to. So YMMV, and to each his own. I have now found that EPP just suits my personal needs better. I still think UE is a very good editor and unrivalled in many areas. Just not the areas I care most about...

Best regards,
Tim

Update by Mofi: Code folding based on indent level needed especially for Python files is also supported by UltraEdit v18.10 and later versions. A hierarchical function list view with classes on top level and functions within a class on second level can be also achieved since UltraEdit v16.00.

ridgerunner · Jul 19, 2008#132008-07-19T04:41+00:00

I'm also experiencing some peculiarities with UE's handling of multi-line regexs. Will report back later...

sklad2 · Aug 08, 2008#142008-08-08T19:17+00:00

I am also playing with editpad pro, it is missing some features, I must have like sftp, but the regex stuff is amazing even on really big!!! files. Maybe ultraedit + powergep will do the trick...