Problem with NOT (^) in character set

SteveLig · Oct 09, 2015#12015-10-09T15:11+00:00

My perl regular expression is matching much more than it should in UE 21 and 22.

<cc>(.*Mass[^<(]*Mass)\.( .*)</cc>

is matching this in my input file:

<cc>O. Ahlborg & Sons v. Massachusetts Heavy Indus., Inc., 65 Mass. App. Ct. 385, 392 (2006)</cc>. See <readme cite_65><cc>Harrison v. Roncone, 447 Mass. 1001 (2006)</cc> (where case involves multiple claims and multiple parties, judgment dismissing fewer than all claims or parties is interlocutory and is not immediately appealable absent determination by trial judge that there is no just reason for delay and upon express direction for the entry of final judgment). See also <readme cite_66><cc>Trenz v. Family Dollar Stores of Mass., Inc., 73 Mass. App. Ct. 610, 613-615 (2009)</cc>

I'm trying to match only the content between <cc></cc> that has two occurrences of "Mass" contained within it, so only the first <cc> above should be matched. I would have expected the [^<)] to have stopped match

The ^ in the character set doesn't seem to have an effect.

Removing the parentheses doesn't help nor does using the non-greedy ? character.

Am I misunderstanding something or is this a bug of some type.

This behavior occurs in UE 21.30.0.124 and 22.20.0.34

Thanks for any help!

Steve

Mofi · Oct 09, 2015#22015-10-09T20:11+00:00

This is not a bug. Your expression is not designed for limiting the search within <cc> and next </cc> with no other <cc> or </cc> inside.

The Perl regular expression is by design greedy which means it always tries to get a positive match with matching as much as possible. Even using <cc>(.*?Mass[^<(]*?Mass)\.( .*?)</cc> is of no help here. Why?

Well, let us look on [^<(]*Mass or [^<(]*Mass? and why this expression does match more as you expect.

You think, that this matches any character including newline characters up to string Mass NOT including < or (. But that is obviously wrong.

[^<(]* with or without ? matches 0 or more characters being whether < nor ( and therefore this expression matches anything including Mass up to a left angle bracket or an opening parenthesis. Why should [^<(]* stop on Mass? This part of the expression is evaluated later when matching any character not being < or ( stopped because of one of those two characters is found.

But the fixed string Mass starts whether with < nor with ( and therefore the greedy nature of the Perl regular expression becomes effective even with ?. All characters of string Mass are completely included in character set [^<(] which is the main problem here. Therefore [^<(]* matches always everything up to a < or ( independent on how many Mass are included. And as there is never Mass when < or ( is found, the further matching behavior is hard to predict.

Okay, let us talk about the solution for this tricky search task.

It looks like nesting of element cc does not occur and therefore it is enough to stop matching always on finding closing tag </cc> after starting matching with opening tag <cc>.

What we need here are negative lookaheads to get the matching behavior you want: (?s)<cc>((?:(?:.(?!</cc>))*?Mass){2})\.( .*?)</cc>

Explanation of this expression:

(?s) ... enable matching newline characters by dot, see "." (dot) in Perl regular expressions doesn't include CRLFs? for details.

<cc> ... the opening tag, simply a fixed string.

(...) first marking group for the expression matching anything after <cc> up to end of second Mass.

(?:...){2} ... outer non marking group to define an expression which must be applied exactly two times for a positive match.

(?:.(?!</cc>))*? ... this is the important expression as it matches any character 0 or more times non-greedy as long as after the currently matched character there is NOT </cc> and stopping matching also when Mass is found because this is the next part of the expression.

Mass ... the fixed string which is always a stop condition for previous expression matching any character.

\. ... after second Mass there must be a dot character.

(...) ... second marking group for expression matching anything after the dot after second Mass to character left of next </cc> found in file.

.*? ... any character including newline characters 0 or more times up to closing tag </cc>.

</cc> ... the closing tag is the stop condition for matching any character.

I don't know why you have specified also an opening parenthesis in the negative character set.

However, this expression matches for your example:

<cc>O. Ahlborg & Sons v. Massachusetts Heavy Indus., Inc., 65 Mass. App. Ct. 385, 392 (2006)</cc>

and

<cc>Trenz v. Family Dollar Stores of Mass., Inc., 73 Mass. App. Ct. 610, 613-615 (2009)</cc>

The . is removed when using \1\2 as replace string.

SteveLig · Nov 10, 2015#32015-11-10T19:10+00:00

Thanks so much for the detailed response, Mofi!

I'm going to spend some time going through it. Apparently I don't know regular expressions as well as I thought I did.