Searching a regex pattern excluding contents inside a tag?

Searching a regex pattern excluding contents inside a tag?

81
Advanced UserAdvanced User
81

    Sep 24, 2016#1

    I'm trying to find whether there are lines that are starting without a tag(i.e. the lines are for some reason broken in two or multiple lines) in a file using something similar like "^\w+" but the search would ignore anything inside "<math><LaTeX>...</LaTeX></math>" e.x. sample

    Code: Select all

    <para>The EPUB specification does not enforce or suggest a particular DRM scheme.</para>
    <para>An ePub publication is delivered as a single file. This file is an unencrypted zipped archive containing
    a set of interrelated resources.</para>
    <para>Books with synchronized audio narration are created in EPUB 3 by using media overlay documents to describe SMIL.</para>
    <math><LaTeX>\begin{align*}
    whatever is written\\
    a=ba+g
    \end{align*}</LaTeX></math>
    <para>Anything goes....</para>
    <caption>MHTML – a webpage archive format used to combine resources 
    in a single document</caption>
    <para>Some random stuff.</para>
    <math><LaTeX>\begin{equation*}
    0=a\ fs
    \end{equation*}</LaTeX></math>
    The search result should find
    "a" from the line a set of interrelated resources.</para>
    "in" from the line in a single document</caption>
    and not
    "whatever" from the line whatever is written\\
    "a" from the line a=ba+g
    "0" from the line 0=a\ fs

    Can this be done somehow using the lookaheads and lookbehinds?

    18672
    MasterMaster
    18672

      Sep 25, 2016#2

      Hi,

      As long as no nested tag exists between <LaTeX>...</LaTeX> this pattern should be OK:

      (?s)^[^<](?=[^<]++(?!</LaTeX>))

      Or (if other tag than </math> can follow the tag </LaTeX>):

      (?s)^[^<](?=[^<]++(?!</LaTeX></math>))

      Or if you want select everything and not just the first character:

      (?s)^[^<]++(?!</LaTeX></math>)

      BR, Fleggy

      81
      Advanced UserAdvanced User
      81

        Sep 25, 2016#3

        Thanks fleggy. :mrgreen:
        Would you mind explaining how the regex works? :|

        18672
        MasterMaster
        18672

          Sep 25, 2016#4

          Hi,

          I'll try to explain the simplest pattern :)

          (?s)^[^<]++(?!</LaTeX></math>)
          • (?s)
            '.' matches also CR/LF (not necessary)
          • ^
            match the beginning of a line
          • [^<]++
            match possessively all characters which are not '<'
            the first + means one or more characters
            the second + means that the previous quantifier is possessive (keep the match even when any following tokens fail)
          • (?!</LaTeX></math>)
            a negative lookahead: we need the closing tags are not </LaTeX></math>
          And the variant:

          (?s)^[^<](?=[^<]++(?!</LaTeX></math>))

          is almost the same. Only the first character matches and the rest is used as a positive lookahead.

          BR, Fleggy

            Sep 25, 2016#5

            Hi Don,

            If you expect operators such <, >, <=, <>, >> or << in the text then try this pattern:

            (?s)^(?!<[^>]+>)(?:.(?!<[^ <=>]+>))++.(?!</LaTeX></math>$)

            Or if you really want to select just the first character:

            (?s)^(?!<[^>]+>).(?=(?:.(?!<[^ <=>]+>))++.(?!</LaTeX></math>$))

            I added $ to check EOL. Maybe you won't need it.

            BR, Fleggy

            81
            Advanced UserAdvanced User
            81

              Sep 25, 2016#6

              Thanks a lot again. :mrgreen:

                Jan 14, 2017#7

                Hi fleggy,

                Can you tell me how this expression can be used in UltraEdit v14.10 as the expression "++" is not supported in the Perl regex of this version so "(?s)^[^<]++(?!</LaTeX></math>)" shows a invalid regex message.

                18672
                MasterMaster
                18672

                  Jan 14, 2017#8

                  Hi Don,

                  Sorry, I don't have UE14 to play with. Perhaps this will work:

                  (?s)^(?!<[^>]+>)(?>(?:.(?!<[^ <=>]+>))+).(?!</LaTeX></math>$)

                  I replaced the possessive modifier by atomic group and it works in your short sample.
                  Use the following pattern to modify any other Perl search expression containing ++.
                  X++ -> (?>X+)
                  e.g.
                  (?s)^[^<]++(?!</LaTeX></math>)
                  ->
                  (?s)^(?>[^<]+)(?!</LaTeX></math>)

                  BR, Fleggy