Which RegExp to find a block of several lines, thanks to criteria relative to first and last line of the block

LaurentG · Jan 22, 2018#12018-01-22T09:03+00:00

Hi everybody,

since I'm not sure description of my request is so clear ;-) let me explain a little bit :

I'd like to be able to find with one regexp (according to UE, Unix or Perl syntax) next block of several lines (and have it selected),
- first line of it contains a given string (let's say "Blablabla"),
- and last line of it contains another string (let's say "Blublublu"),
- while intermediate lines (which may be empty, 1 line, or several lines) do not contain any of the two strings (neither 'Blablabla' not 'Blublublu' in any of these lines)

I tried a lot of things.... but never succeeded in such a "search"

If anyone can help, it would be very appreciated.
Thanks !

Mofi · Jan 22, 2018#22018-01-22T11:24+00:00

The description reads like searching for an XML block. This can be done only with Perl regular expression using the search string: (?s)Blablabla(?:.(?!Blablabla|Blublublu))*.?Blublublu

(?s) ... . (dot) matches also newline characters.

(?:...) ... non-capturing group. The multiplier * should be applied on the entire expression inside this group.

. ... any character; usually with the exception of newline characters, but here with including newline characters because of (?s) at beginning of search string.

(?!...) ... a negative lookahead. The current character should be matched by . only if the strings of fixed length inside the negative lookahead are not found next after current character.

Blablabla|Blublublu ... a simple OR expression of two strings with fixed length. A positive/negative lookahead/lookbehind works only if Perl regular expression can determine from search expression how many characters to look ahead or behind.

* ... the expression in non-capturing group 0 or more times. The last character left to Blublublu is not matched by the non-capturing group expression as there is Blublublu next.

.? ... one more character optionally, i.e. 0 or 1 character.

This Perl regular expression finds also BlablablaBlublublu with no character between the two strings (most likely tags).

(?s)Blablabla(?:.(?!Blablabla|Blublublu))*.Blublublu without ? after second . requires that there is at least one character between Blablabla and Blublublu.

(?s)Blablabla(?:.(?!Blablabla|Blublublu))+.Blublublu with multiplier+ instead of * requires that there are at least two characters between Blablabla and Blublublu.

See also forum topic "." (dot) in Perl regular expressions doesn't include newline characters CRLF? and take a look on the articles about

LaurentG · Jan 22, 2018#32018-01-22T12:52+00:00

Hi,

thanks a lot for a so much detailed answer!
And you're right, it's not like searching for an XML block, but exactly searching for an XML block. Well seen!

BTW: Is it feasible to do the same in UE or UNIX syntax?
Actually, the main point which blocked me was to be unable to have . (dot) matching also new lines. I didn't know the (?s).

Mofi · Jan 22, 2018#42018-01-22T15:55+00:00

Such a search for an XML block is not possible with UltraEdit or Unix regular expression engine. Only the Perl regular expression engine with negative lookahead capability makes this search possible.

There is a simple solution for matching any character without using a dot and mode flag (?s): [\s\S] (any whitespace or any non whitespace character) or [\w\W] (any word or any non word character).

fleggy · Jan 22, 2018#52018-01-22T16:43+00:00

Hello,

this pattern selects everything between Blablabla and Blublublu excluding these words:

(?s)(?<=Blablabla)(?:(?!Blablabla|Blublublu).)*(?=Blublublu)

When you put the lookahead test before the dot then you don't need the optional match .?
And a little technical note: lookahead can contain variable patterns, thanks God :)

Personally, I would use some recursive pattern for XML parsing. But the request is too vague.

BR, Fleggy

Mofi · Jan 22, 2018#62018-01-22T17:34+00:00

Thanks, Fleggy. I learned again something new about Perl regular expressions from your reply.

LaurentG · Jan 22, 2018#72018-01-22T19:50+00:00

fleggy wrote:Personally, I would use some recursive pattern for XML parsing. But the request is too vague.

Hi and many thanks to both of you.

@Fleggy : could you be a little bit specific about what you call "recursive pattern". Sounds "interesting" to me ;-)

About my request, to be less "vague", the need is to be able to easily select any XML block, from beginning tag to end tag.
And an option (not purely theoretical, I have a need), would be to select an XML block:

- only if it contains at least once a given sub-block (or a given data)
- and / or only if it does not contain another sub-block

Example:
be able to select, in following XML file, XXX blocks

- including at least one ZZZ block,
- and avoiding any block VVV

This should select the second and the third block, but neither the first one (contains a VVV sub-block), nor the 3rd (do not include a ZZZ sub-block)

Code: Select all

<XXX>
    <ZZZ>Blue</ZZZ>
    <TTT>
        <VVV>
            High
        </VVV>
        <WWW>0</WWW>
    </TTT>
    <AAA>3.14</AAA>
</XXX>
<XXX>
    <ZZZ>Red</ZZZ>
    <TTT>
        <BBB>18</BBB>
    </TTT>
</XXX>
<XXX>
    <TTT>
        Never
    </TTT>
    <CCC>14</CCC>
</XXX>
<XXX>
    <ZZZ>White</ZZZ>
    <AAA>3.14159</AAA>
</XXX>

fleggy · Jan 23, 2018#82018-01-23T07:16+00:00

Hi LaurentG,

Can XXX blocks be nested? And if yes, how to handle them?

Thanks, Fleggy

Jan 23, 2018#92018-01-23T10:20+00:00

Well, here is the recursive attempt. I added another two XML blocks to make your sample more interesting :)

Code: Select all

<XXX>
    <ZZZ>Blue</ZZZ>
    <TTT>
        <VVV>
            High
        </VVV>
        <WWW>0</WWW>
    </TTT>
    <AAA>3.14</AAA>
</XXX>
<XXX>
    <ZZZ>Red</ZZZ>
    <TTT>
        <BBB>18</BBB>
    </TTT>
</XXX>
<XXX>
    <TTT>
        Never
    </TTT>
    <CCC>14</CCC>
</XXX>
<XXX>
    <ZZZ>White</ZZZ>
    <AAA>3.14159</AAA>
</XXX>
<XXX>
    <YYY>White</YYY>
    <XXX>3.14159
        <ZZZ></ZZZ>
    </XXX>
</XXX>
<XXX>
    <YYY>White</YYY>
    <XXX>3.14159</XXX>
    <ZZZ></ZZZ>
</XXX>

And this is the Perl regex:

(?s)(<(XXX)\b[^>]*+>(?>(?:(?!<\2\b)(?!</\2\b)(?!<VVV\b)(?!</VVV\b)(?=(<ZZZ\b)?).)++|(?1))*+</\2>)(?(3)(?<=>)|(?<=ZZZ NOT FOUND))

Unfortunately, currently I have no access to UE so I used Notepad++. I really hope it will work in UE as well.
I don't want to look immodest but I am really proud how I solved the condition "including at least one ZZZ block" :)

Code: Select all

(?s)                   the dot also matches CR/LF
(                      group #1, referenced in recursion
 <(XXX)\b[^>]*+>       match the starting tag XXX and store its name in group #2
 (?>                     the body of the tag is an atomic group
  (?:                    inside the body can be: either any characters
   (?!<\2\b)               it must not be the tag <XXX
   (?!</\2\b)              it must not be the closing tag </XXX
   (?!<VVV\b)              it must not be the tag <VVV
   (?!</VVV\b)             it must not be the closing tag </VVV
   (?=(<ZZZ\b)?)           TRICK: check if there is the tag <ZZZ and store it in group #3 if found
                                  the lookahead must be optional
                                  we will check the group #3 at the very end of this pattern
   .                     if all lookarounds passed -> match a character
  )++                    repeat possessively
  |                      or
  (?1)                   try to match a nested tag recusively
 )*+                     the body can contain zero or more previously defined parts (possessive quantifier used for better performance)
 </\2>                 match the closing tag </XXX>
)                      end of group #1
                       now the whole XML block is matched
                       but we must check if ZZZ tag has been found
(?(3)                  TRICK: here we test if the group #3 has been defined
 (?<=>)                       YES: we need to pass - any pattern which never fails can be used here
 |
 (?<=ZZZ NOT FOUND)            NO: we need to fail - any pattern which always fails can be used here
)

BR, Fleggy

Jan 24, 2018#102018-01-24T20:12+00:00

Hello,

the regex in my previou post matches the innermost XXX block containing ZZZ. This is by design because it is not possible to capture a group on level N and work with it on level N-1 (at least not in Perl). Thus the final test of the group #3 works only after XXX block where ZZZ has been captured. But what if LaurentG needs to match the outermost block XXX? Well, here is my approach:
- check if there is XXX block not containing the string <ZZZ on any level. I used conditional lookahead for this.
- YES, there is such a block: we are not interested in this block so we use always failing token (?!) and the whole regex fails
- NO, because either the block contains <ZZZ or there is no valid XXX block at all. Anyway now we can try to match XXX block again because if there is XXX block then it must contain <ZZZ somewhere inside its body!

This time I used named groups for better debugging.

(?s)(?(?=(?<LAMAIN><(?<LATAG>XXX)\b[^>]*+>(?>(?:(?!<\k<LATAG>\b)(?!</\k<LATAG>\b)(?!<ZZZ\b).)++|(?&LAMAIN))*+</\k<LATAG>>))(?!)|(?<MAIN><(?<TAG>XXX)\b[^>]*+>(?>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b)(?!<VVV\b).)++|(?&MAIN))*+</\k<TAG>>))

You can see the difference between the old regex and the new one in this text:

Code: Select all

<XXX>
    <YYY>White</YYY>
    <XXX>3.14159
    </XXX>
    <ZZZ></ZZZ>
</XXX>

<XXX>
    <YYY>White</YYY>
    <XXX>3.14159
        <ZZZ></ZZZ>
    </XXX>
</XXX>

And here is the briefly commented version:

Code: Select all

(?s)                # dot matches anything
(?                  # start of the condition
                    # use lookahead to check if there is XXX block not containing <ZZZ
  (?=(?<LAMAIN><(?<LATAG>XXX)\b[^>]*+>(?>(?:(?!<\k<LATAG>\b)(?!</\k<LATAG>\b)(?!<ZZZ\b).)++|(?&LAMAIN))*+</\k<LATAG>>))
  (?!)              # XXX block without ZZZ found -> the pattern must fail
  |                 # else try to match XXX block not containing <VVV
  (?<MAIN><(?<TAG>XXX)\b[^>]*+>(?>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b)(?!<VVV\b).)++|(?&MAIN))*+</\k<TAG>>)
)                   # end of the condition

BR, Fleggy

LaurentG · Jan 27, 2018#112018-01-27T18:44+00:00

Hi Fleggy

thank you for all your answers.
I'm afraid it begins to be a little bit too complex for me...

Nevertheless, really interesting, thanks to your detailed explanations.

Regards
Laurent

fleggy · Jan 28, 2018#122018-01-28T08:17+00:00

Hi LaurentG,

I must thank you because I do love regexp challenges and the problem of "the positive condition in nested blocks" was really challenging :)

Thanks, Fleggy