Searching whether a tag is within another tag in files?

Searching whether a tag is within another tag in files?

81
Advanced UserAdvanced User
81

    Sep 05, 2017#1

    I have some files in which there are one or more <aff> tags in it, each <aff> can contain one or more of other tags like <institution type="institution">, <institution type="department">, <city>, <country>, <sup>, etc. According to the rules that needs to be followed in these files is that each group of <institution type="..."> in each <aff> must be inside another tag namely <institution-wrap>. Some files have these tag put and some don't. How do I find that?
    For example, here is a sample text:

    Code: Select all

    <aff id="aff1">
    <institution type="institution">NSF</institution>
    <institution type="department">Dept. of History</institution>
    <city>New York</city>
    <country>USA</country>
    </aff>
    
    <aff id="aff2"><sup>&#x2020;<sup>
    <institution-wrap>
    <institution type="institution">NSF</institution>
    <institution type="department">Dept. of History</institution>
    </institution-wrap>
    <city>New York</city>
    <country>USA</country>
    </aff>
    
    <aff id="aff3">
    <institution-wrap>
    <institution type="division">NASA</institution>
    </institution-wrap>
    <city>New York</city>
    <country>USA</country>
    </aff>
    
    <aff id="aff4">
    <sup>1</sup>
    <institution type="division">Caltech</institution>
    </aff>
    
    The search should find <aff id="aff1"> and <aff id="aff4"> or something similar to it from the above sample file.

    NOTE: Some <aff> might not contain any <institution type="..."> at all, those will be ignored and there could be other tags containing <institution type="..."> but we only want to look inside the <aff> tags.

    18672
    MasterMaster
    18672

      Sep 05, 2017#2

      Hi Don,

      I see a very simple pattern in your sample. No recursion this time :)

      (?<!</institution>\r\n)(?<!<institution-wrap>\r\n)<institution type

      BR, Fleggy

      EDIT:
      If your files are as clean as your sample then you can "fix" missing tags with this replace:

      F: (?<!</institution>\r\n)(?<!<institution-wrap>\r\n)((?:<institution type.+</institution>\r\n)+)
      R: <institution-wrap>\r\n\1</institution-wrap>\r\n

      I would suggest to use some "salt" in the replace expression to be able to simply find added tags and check them. Then you can remove the "salt" by another simple replace.

      81
      Advanced UserAdvanced User
      81

        Sep 06, 2017#3

        Hi fleggy, but the pattern does not work. I think you did not understand the problem.
        It simply finds all <institution type="..."> tags even those that are inside <institution-wrap>....</institution-wrap>.
        I only want to find those <institution type="..."> tags that are not inside a <institution-wrap>....</institution-wrap>
        in a <aff>....</aff>.

        18672
        MasterMaster
        18672

          Sep 06, 2017#4

          Hi Don,

          It works in your sample using CR/LF line delimiter. Perhaps there are only \r line delimiters in your real files. I used the fixed length delimiter \r\n because the lookbehind must have fixed length.

          BR, Fleggy

          81
          Advanced UserAdvanced User
          81

            Sep 06, 2017#5

            Hi fleggy,

            It seems that in the new versions of UE your pattern works as desired. However, in older versions 14.10 and 18.20 that I use it does not and I'm not quite sure why. Negative lookbehinds are supported in the version that I'm using but it does not work as expected in this case for some reason.
            Are there any alternative methods to do it? :)

            Thanks.

            18672
            MasterMaster
            18672

              Sep 06, 2017#6

              Hi Don,

              I don't have such an old version. BTW I got an idea - try to change \r\n in lookbehinds to the real line terminator in the Find what: input field. In other words

              use this multi-line pattern
              (?<!</institution>
              )(?<!<institution-wrap>
              )((?:<institution type.+</institution>\r\n)+)

              instead of
              (?<!</institution>\r\n)(?<!<institution-wrap>\r\n)((?:<institution type.+</institution>\r\n)+)

              BR, Fleggy

                Sep 06, 2017#7

                Next variant without CR/LF in lookbehinds:

                F: (?<!</institution>)(?<!<institution-wrap>)((?:\r\n<institution type.+</institution>)+)
                R: \r\n<institution-wrap>\1\r\n</institution-wrap>

                This should work even in UE18. I cannot find a simpler pattern. Only overcomplicated... :/

                BTW I overlooked that some <institution type> tags could not be inside <aff>....</aff>. My pattern doesn't test it. I am afraid that the "complete" pattern would be too complex for UE18.

                6,603548
                Grand MasterGrand Master
                6,603548

                  Sep 07, 2017#8

                  This is a reply to first post by Fleggy and the post before this post:

                  UltraEdit v14.10 and v18.20 do not support lookbehind and lookahead over lines. So the search string

                  (?<!</institution>\r\n)(?<!<institution-wrap>\r\n)<institution type

                  must be modified to

                  (?<!</institution>)(?<!<institution-wrap>)\r\n<institution type

                  Of course then the line ending before <institution is also matched by the search expression, but this can be easily recognized in replace string by adding \r\n at beginning.

                  That was also already recognized by you, Fleggy, without the ability to test your last suggested expression with those old versions of UltraEdit. Your last suggestion also does not work because of \r\n at beginning of the non marking group within marking group. UE v14.10 and v18.20 process the text files mainly line based and for that reason it is necessary to use Perl regular expressions which search also line based. So working correct are only multi-line expressions which match something in a line and its line termination, but not expressions which match a line termination and something (optional) on next line.

                  The Perl regular expression search and replace strings working also with UE v14.10 and v18.20:

                  F: (?<!</institution>)(?<!<institution-wrap>)\r\n((?:<institution type.+</institution>\r\n)+)
                  R: \r\n<institution-wrap>\r\n\1</institution-wrap>\r\n
                  Best regards from an UC/UE/UES for Windows user from Austria

                  81
                  Advanced UserAdvanced User
                  81

                    Sep 07, 2017#9

                    Thank you both, fleggy and Mofi :)

                    18672
                    MasterMaster
                    18672

                      Sep 07, 2017#10

                      Hi Don,

                      I am surprised what limits are in UE 18 regarding Perl regexes. If I may I would suggest you to update to UE 19, at least. Since this version the Perl regex engine is much more powerful.

                      BR, Fleggy

                      PeM
                      Thank you, Mofi, for your analysis.

                      81
                      Advanced UserAdvanced User
                      81

                        Sep 08, 2017#11

                        Hi fleggy,

                        I want to upgrade to the newest version but there are some major issues (for my work at least) in the newer versions that I'm facing.
                        1. Opening a file takes 2-3 seconds and sometimes even longer.
                        2. While using search with list lines containing strings the list appears after 2-3 seconds after I hit enter.
                        3. The taglist does not remember that last tag I used and always goes to the top of the list.
                        That is why I'm using such an older version of UE (v14.10) in my work and I know it does not support a lot of Perl regex things. Even UE v18.20 has those above issues.
                        Thanks for your suggestion though. :)