Finding particular number ranges inside all paragraphs

Finding particular number ranges inside all paragraphs

81
Advanced UserAdvanced User
81

    Jul 07, 2016#1

    I want to find all [0-9]+-[0-9]+ expressions inside all <p>...</p> elements
    sample

    Code: Select all

    <PUI>1236-2000</PUI>
    <DOI>14-2330</DOI>
    .....
    ....
    <p>A free encyclopedia built collaboratively using wiki software. (Creative Commons Attribution-ShareAlike License) pp.26-35.</p>
    <fn><p>Wikipedia is owned by an American organization, Wikimedia Foundation, which is in ... Wikipedia has a standard page layout for all pages in the encyclopedia. 203-209 dsfds sdfdsfdsf ds</p></fn>
    I'm looking for a search expression to find the elements 26-35 and 203-209 from inside the para's in above sample
    I tried "<p>*[0-9]+-[0-9]+" but its not working. What am I doing wrong

    18672
    MasterMaster
    18672

      Jul 07, 2016#2

      Hi, try this Perl regex:

      <p>\D*+\K\d+-\d+(?=\D*</p>)

      BR, Tom

      6,602548
      Grand MasterGrand Master
      6,602548

        Jul 07, 2016#3

        It looks like you are using the legacy UltraEdit regular expression engine. This engine is not as clearly defined as the modern Perl regular expression engine in case of a search string like the one you tried. What is the problem?

        * matches ANY character EXCEPT carriage return and line-feed 0 or more times.

        [0-9]+ matches ANY digit 1 or more times.

        The problem: ANY digit is a character range which is included also in character range ANY character.

        So we have here a character range with undetermined number of repeats followed by another character range included in first character range with also an undetermined number of repeats.

        How should the expression engine know where to stop matching characters according to first character range and start matching characters according to second character range with second character range being included also in first character range?

        A software programmer calls such a situation an undefined behavior which means the result is unpredictable.

        There is another method to match ANY character EXCEPT carriage return and line-feed 0 or more times with UE regular expression engine: ?++

        So what happens on using <p>?++[0-9]+-[0-9]+

        Well, it looks like the still undefined behavior works for this example. But does it really work?

        What about two or even more number ranges within a paragraph, i.e. something like:

        Code: Select all

        <p>See the pages 20-35, the tables 1-3 and the figures 3-5.</p><p>And take note of the comments 5-9.</p>
        Most users would expect a match of everything from paragraph element to 20-35. But this search string matches everything from first paragraph element to 5-9 in second paragraph. In Perl regular expression documentations this matching behavior is described as greedy. ?++ matches as much as possible to produce nevertheless a positive match.

        Let us look together on a simple example for difference between greedy and non greedy matching behavior with using UltraEdit regular expression engine where greedy and non greedy matching behavior cannot be really controlled as in Perl. In a file there is a line with a file name with full path.

        Code: Select all

        C:\Temp\Test\Example.txt
        The UE regex search string [A-Z]:\*\ matches just C:\Temp\ which is non greedy whereas [A-Z]:\?++\ matches C:\Temp\Test\ which is greedy.

        But here is a definite string with a single character - the backslash - which defines where to stop matching any character except newline characters 0 or more times. Therefore both expressions work and match something.

        But UE regex search string <p>*[0-9]+-[0-9]+ is different as there is no fixed string after * which determines the stop condition for matching any character except newline characters 0 or more times non greedy.

        But let us look on the example above with the two paragraphs with in total 4 number ranges. What should be matched?
        • Everything from beginning of each paragraph to the first number range in each paragraph;
        • everything from beginning of each paragraph to the last number range in each paragraph,
        • everything from beginning of first paragraph to the last number range in any paragraph on the line in the file.
        Well, most likely it would be best to match just each number range within a paragraph, but this is very tricky as it can be seen on this example.
        Best regards from an UC/UE/UES for Windows user from Austria

        18672
        MasterMaster
        18672

          Jul 08, 2016#4

          Well, I didn't think about it so much. If I suppose that the format is correct (<p> and </p> are always in pair) then this Perl regex could work:

          \d+-\d+(?=.*?</p>)

          EDIT:
          I know it is not correct. Research in progress... :)

          EDIT2:
          I think the ideal solution needs a variable lookbehind which UE does not support. But this pattern should be quite usable:

          find a number-number followed by ....</p> and not by <p> before this </p>

          \d+-\d+(?=(?:.(?<!<p>))*?</p>)

          or maybe better
          \d+-\d+(?=(?>.(?<!<p>))*?</p>)