Help on advanced usage of lookbehind and lookahead Perl regular expressions

fvgfvg · Feb 02, 2014#12014-02-02T14:11+00:00

The following regex:

UltraEdit.activeDocument.findReplace.replace("((?<=\\R)|(?<=\\s)|(?<=>)|(?<=\\))|(?<=F)|(?<=\\())('|‘)(.*?)('|’)(?!s)(?!r)(?!t)(?!m)","xxx\\3xxx");

is meant to match on single quotation mark sets, (as in the test block below) with positive lookbehinds for newline, blank, ( and >. (The negative lookaheads are there so that it skips words containing apostrophes of the type I'm they're it's.) In RegexBuddy it works just fine, and I've been careful to copy it back to UE in 'C-style String' format, so that that backslash \ gets the additional '\\'. (Thanks for that Mofi!) (Mostly it works just fine in UE - at least with the Perl regex engine.)

Code: Select all

This is text. This is text. This is text. 'This sentence is in quotation marks'. This is text. This is text. This is text. 'This is a sentence containing apostrophy contructions: I'm they're it's won't'  This is text. This is text. This is text. This is text. This is text. 'This sentence has a mixture of ordinary apostrophes and hex91, hex92: I’m they’re it's won’t’ This is text. This is text.  This is text.

But the above a regex produces a: "You have entered an invalid regular expression". It's not the hex91 / hex92 that seems to be the problem - I've checked - though I'm also looking for a way of avoiding those as well. (If I switch from the Perl to the UE regex engine there's no error message, it says 'script succeeded', but in fact nothing happens.)
The problem persists even if I simplify things, throwing out the dubious hex 91/92's, as well as the 'or' operator '|', as in:

Code: Select all

UltraEdit.activeDocument.findReplace.replace("((?<=\\R)|(?<=\\s)|(?<=F)|(?<=\\())'(.*?)'(?!s)(?!r)(?!t)(?!m)","xxx\\2xxx");

Where have I gone wrong here?

best,
fvg

Mofi · Feb 02, 2014#22014-02-02T20:54+00:00

\R is supported in a Perl regular expression, but not in a lookbehind expression. (?<=\R) makes the search string an invalid expression as \R matches either just a carriage return, or a line-feed, or a carriage return + line-feed. But a lookbehind/lookahead string must be of fixed length which is not the case here.

The solution is using (?<=\r)|(?<=\n) although that expression is not necessary as \s matches all whitespace characters and that includes also carriage returns and line-feeds.

Some more hints:

Use a non marking group if you group something like several arguments of an OR expression and do not want to back reference the string matched by the expression. (...) is a marking group, (?:...) is a non marking group.

So the Perl regular expression string could be also

(?:(?<=\s)|(?<=>)|(?<=\))|(?<=F)|(?<=\())(?:'|‘)(.*?)(?:'|’)(?!s)(?!r)(?!t)(?!m)
Use a character set definition [...] instead of /x|y) if just an OR of single characters is needed.

Well, on your very complex search requirements the second (?:'|’) must be kept to match also the third example in your text because of the requirement on favoring straight single quote over right single quote. I know the exact reason why ['‘] can't be used also after the capturing group for your third example, but it is very hard to explain to somebody not being already an expert in Perl regular expressions.

But the first one could be replaced. So the Perl regular expression string could be also

(?:(?<=\s)|(?<=>)|(?<=\))|(?<=F)|(?<=\())['‘](.*?)(?:'|’)(?!s)(?!r)(?!t)(?!m)
A lookbehind/lookahead string must be of fixed length. But that does not mean you cannot use a character set. As long as no multiplier is used, it is also possible to use a character set in a lookbehind or lookahead expression.

This makes your expression much easier as it can be now

(?<=[\s>()F])['‘](.*?)(?:'|’)(?![srtm])

The command in the script is now:

Code: Select all

UltraEdit.activeDocument.findReplace.replace("(?<=[\\s>()F])['‘](.*?)(?:'|’)(?![srtm])","xxx\\1xxx");

Is that string not much easier than what RegexBuddy created?

fvgfvg · Feb 03, 2014#32014-02-03T10:51+00:00

Wow.. thank you Mofi. Up to now my faith in RegexBuddy has been absolute ... Seems what works there doesn't always work under UE.
What I've just learned is that character set arguments [...] will work within lookaround constructions - I'll be using that a lot. The (?:'|’) I'll have to play around with - I don't think I understand that. But what I've decided is to standardize all these apostrophes first (replacing all ´`‘’ with single ') - that will obviate the 'or' construction: '|' (And I'll beware of that (?<=\R) ).
best, and thanks again
fvg

TXWizard · Mar 19, 2014#42014-03-19T06:11+00:00

This is quite a discussion about regular expressions in the context of UltraEdit. I continue to be amazed at how faithful the PCRE engine is to Perl. I first used regular expressions in Perl scripts, using the ActiveState distribution, and more recently in the Microsoft .NET Framework. For the most part, I have found that the behavior of its regular expression object generally mirrors the Perl engine pretty faithfully, and is a huge improvement over the anemic implementation in the Visual Basic Scripting Runtime (scrrun.dll).

I first acquired a copy of UltraEdit in September 2009, which I began using largely on the strength of its support for Perl compatible regular expressions. While I found a few nits in the first version or two, all of them seem to be resolved, and the present day engine seems very solid, and I use it daily.