User to user discussion and support for UltraEdit, UEStudio, UltraCompare, and other IDM applications.

Find, replace, find in files, replace in files, regular expressions
11 posts Page 1 of 1
I have some files in which there are one or more <aff> tags in it, each <aff> can contain one or more of other tags like <institution type="institution">, <institution type="department">, <city>, <country>, <sup>, etc. According to the rules that needs to be followed in these files is that each group of <institution type="..."> in each <aff> must be inside another tag namely <institution-wrap>. Some files have these tag put and some don't. How do I find that?
For example, here is a sample text:

Code: Select all
<aff id="aff1">
<institution type="institution">NSF</institution>
<institution type="department">Dept. of History</institution>
<city>New York</city>
<country>USA</country>
</aff>

<aff id="aff2"><sup>&#x2020;<sup>
<institution-wrap>
<institution type="institution">NSF</institution>
<institution type="department">Dept. of History</institution>
</institution-wrap>
<city>New York</city>
<country>USA</country>
</aff>

<aff id="aff3">
<institution-wrap>
<institution type="division">NASA</institution>
</institution-wrap>
<city>New York</city>
<country>USA</country>
</aff>

<aff id="aff4">
<sup>1</sup>
<institution type="division">Caltech</institution>
</aff>

The search should find <aff id="aff1"> and <aff id="aff4"> or something similar to it from the above sample file.

NOTE: Some <aff> might not contain any <institution type="..."> at all, those will be ignored and there could be other tags containing <institution type="..."> but we only want to look inside the <aff> tags.
Hi Don,

I see a very simple pattern in your sample. No recursion this time :)

(?<!</institution>\r\n)(?<!<institution-wrap>\r\n)<institution type

BR, Fleggy

EDIT:
If your files are as clean as your sample then you can "fix" missing tags with this replace:

F: (?<!</institution>\r\n)(?<!<institution-wrap>\r\n)((?:<institution type.+</institution>\r\n)+)
R: <institution-wrap>\r\n\1</institution-wrap>\r\n

I would suggest to use some "salt" in the replace expression to be able to simply find added tags and check them. Then you can remove the "salt" by another simple replace.
Hi fleggy, but the pattern does not work. I think you did not understand the problem.
It simply finds all <institution type="..."> tags even those that are inside <institution-wrap>....</institution-wrap>.
I only want to find those <institution type="..."> tags that are not inside a <institution-wrap>....</institution-wrap>
in a <aff>....</aff>.
Hi Don,

It works in your sample using CR/LF line delimiter. Perhaps there are only \r line delimiters in your real files. I used the fixed length delimiter \r\n because the lookbehind must have fixed length.

BR, Fleggy
Hi fleggy,

It seems that in the new versions of UE your pattern works as desired. However, in older versions 14.10 and 18.20 that I use it does not and I'm not quite sure why. Negative lookbehinds are supported in the version that I'm using but it does not work as expected in this case for some reason.
Are there any alternative methods to do it? :)

Thanks.
Hi Don,

I don't have such an old version. BTW I got an idea - try to change \r\n in lookbehinds to the real line terminator in the Find what: input field. In other words

use this multi-line pattern
(?<!</institution>
)(?<!<institution-wrap>
)((?:<institution type.+</institution>\r\n)+)

instead of
(?<!</institution>\r\n)(?<!<institution-wrap>\r\n)((?:<institution type.+</institution>\r\n)+)

BR, Fleggy
Next variant without CR/LF in lookbehinds:

F: (?<!</institution>)(?<!<institution-wrap>)((?:\r\n<institution type.+</institution>)+)
R: \r\n<institution-wrap>\1\r\n</institution-wrap>

This should work even in UE18. I cannot find a simpler pattern. Only overcomplicated... :/

BTW I overlooked that some <institution type> tags could not be inside <aff>....</aff>. My pattern doesn't test it. I am afraid that the "complete" pattern would be too complex for UE18.
This is a reply to first post by Fleggy and the post before this post:

UltraEdit v14.10 and v18.20 do not support lookbehind and lookahead over lines. So the search string

(?<!</institution>\r\n)(?<!<institution-wrap>\r\n)<institution type

must be modified to

(?<!</institution>)(?<!<institution-wrap>)\r\n<institution type

Of course then the line ending before <institution is also matched by the search expression, but this can be easily recognized in replace string by adding \r\n at beginning.

That was also already recognized by you, Fleggy, without the ability to test your last suggested expression with those old versions of UltraEdit. Your last suggestion also does not work because of \r\n at beginning of the non marking group within marking group. UE v14.10 and v18.20 process the text files mainly line based and for that reason it is necessary to use Perl regular expressions which search also line based. So working correct are only multi-line expressions which match something in a line and its line termination, but not expressions which match a line termination and something (optional) on next line.

The Perl regular expression search and replace strings working also with UE v14.10 and v18.20:

F: (?<!</institution>)(?<!<institution-wrap>)\r\n((?:<institution type.+</institution>\r\n)+)
R: \r\n<institution-wrap>\r\n\1</institution-wrap>\r\n
Best regards from Austria
Thank you both, fleggy and Mofi :)
Hi Don,

I am surprised what limits are in UE 18 regarding Perl regexes. If I may I would suggest you to update to UE 19, at least. Since this version the Perl regex engine is much more powerful.

BR, Fleggy

PeM
Thank you, Mofi, for your analysis.
Hi fleggy,

I want to upgrade to the newest version but there are some major issues (for my work at least) in the newer versions that I'm facing.

  1. Opening a file takes 2-3 seconds and sometimes even longer.
  2. While using search with list lines containing strings the list appears after 2-3 seconds after I hit enter.
  3. The taglist does not remember that last tag I used and always goes to the top of the list.
That is why I'm using such an older version of UE (v14.10) in my work and I know it does not support a lot of Perl regex things. Even UE v18.20 has those above issues.
Thanks for your suggestion though. :)
11 posts Page 1 of 1
cron