XML search for a key tag, then within that node, find another

UltraNewbie1105 · PostMay 02, 2025#12025-05-02T13:31+00:00

I'm trying to get help in being able to find (then finally extract from and put into an output file across many files) results as follows:

Identify a particular XML element by a specific content value like <pet>Dog</pet>
Then, within that nested element that the above shares, look for another element like <breed>DogBreed</breed> where I just want to pull out the actual content of that element, regardless of what the content value is.

The XML structure would look like this:

Code: Select all

<vetPatient>
  <pet>Dog</pet>
  <breed>AnyBreed</breed>
</vetPatient>

Ideally, I'd like to do this over hundreds of XML files and output to a .csv. Any help from the hive?

Mofi · PostMay 02, 2025#22025-05-02T17:51+00:00

Let us assume that

the element pet exists always only within the element vetPatient and
the element pet exists always just once within the element vetPatient and
the element breed also exists always only within the element vetPatient and
the element breed also exists always just once within the element vetPatient and
the element breed is always below element pet within the element vetPatient.

In this case can be created a copy of the directory containing all the XML files to search for the data.

Next run a Perl regular expression Replace in Files with checked option Match case on the copy of the directory searching in all *.xml files for a string matching the regular expression (<pet>Dog</pet>)(?:[\s\S](?!</vetPatient>))*?(<breed>.+?</breed>) and using the expression \1\2 as replace expression.

The case sensitive Perl regular expression Replace in Files changes an XML content like

Code: Select all

<vetPatient>
  <pet>Dog</pet>
  <breed>AnyBreed</breed>
</vetPatient>
<vetPatient>
  <pet>Other</pet>
  <breed>AnyBreed</breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet><breed>AnyBreed 2</breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet>
  <breed></breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet>
</vetPatient>
<vetPatient>
  <pet>Dog</pet>
  <other>whatever</other>
  <breed>AnyBreed 3</breed>
</vetPatient>

to the following XML content:

Code: Select all

<vetPatient>
  <pet>Dog</pet><breed>AnyBreed</breed>
</vetPatient>
<vetPatient>
  <pet>Other</pet>
  <breed>AnyBreed</breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet><breed>AnyBreed 2</breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet>
  <breed></breed>
</vetPatient>
<vetPatient>
  <pet>Dog</pet>
</vetPatient>
<vetPatient>
  <pet>Dog</pet><breed>AnyBreed 3</breed>
</vetPatient>

Now should be opened Advanced - Settings or Configuration - Search - Find output format and unchecked the options Header, File summary and Find summary as just the lines with the two elements are of interest for the final output file.

Next can be executed a case sensitive Perl regular expression Find in Files with checked option Results to edit window with the search expression <pet>Dog</pet><breed>.+?</breed> and UltraEdit creates a new UTF-16 encoded text file with the found lines in all XML files in the copy of the directory.

The created file could look like this:

Code: Select all

C:\Temp\Test\Test1.xml(2):   <pet>Dog</pet><breed>AnyBreed</breed>
C:\Temp\Test\Test1.xml(11):   <pet>Dog</pet><breed>AnyBreed 2</breed>
C:\Temp\Test\Test2.xml(2):   <pet>Dog</pet><breed>AnyBreed</breed>
C:\Temp\Test\Test2.xml(9):   <pet>Dog</pet><breed>AnyBreed 2</breed>
C:\Temp\Test\Test2.xml(19):   <pet>Dog</pet><breed>AnyBreed 3</breed>

A case sensitive Perl regular expression Find in Files executed from top of the file ** Find Results ** with the search expression ^(?:.+?\\)+(.+\.xml)$[0-9]+$:.+?<breed>(.+?)</breed>.*$ and the replace expression \1\2 would reformat the results output to a valid CSV file as long as no breed value contains the character " with the following lines for the example above:

Code: Select all

"Test1.xml","AnyBreed"
"Test1.xml","AnyBreed 2"
"Test2.xml","AnyBreed"
"Test2.xml","AnyBreed 2"
"Test2.xml","AnyBreed 3"

The ** Find Results ** should be saved now as *.csv file without or with conversion of the file from UTF-16 to UTF-8 or to ANSI. The copy of the directory should be deleted finally.

fleggy · PostMay 04, 2025#32025-05-04T22:49+00:00

Hi,

here is a single regexp attemp (still in progress on my side) for inspiration. It replaces whole XML elements with parsable text <PET if found>:<breed value if found>
The desired results look like PET:some_breed. Any other forms can be throw away in the next processing.

The regexp always matches a whole XML element and checks if there are PET and BREED subelements on the 1st nested level only.

Find what: (this is a single line regexp, just a little longer)
(?s)<(?<PARNT>vetPat\w+)[^>]*+>(?:[^<]*+(?<MAIN><(?<TAG>(?=(?<PET>pet)\b|(?<BREED>breed)\b|.)\w+)\b[^>]*+(?:(?<=/)>|>(?>(?(<BREED>)(?(<BREEDVALUE>)(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++|(?<BREEDVALUE>(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++))|(?:(?!<\k<TAG>\b)(?!</\k<TAG>\b).)++)|(?&MAIN))*+</\k<TAG>>)))*[^<]*+</\k<PARNT>>

Replace with:
$+{PET}:$+{BREEDVALUE}

I have not installed UE on my new system yet so I tested this regexp in Notepad++. And I think I can simplify it but now I must go to bed :)

BR, Fleggy

EDIT: successfully tested in UE 2024.2.0.44 64-bit
Some notes - The regexp is quite generic because I was not sure what is a MUST for the match. PET and BREED can be in any order and they are ignored in nested elements, there are allowed any other XML elements (including nested ones) in the parent XML element. The parent element can have any name (not only vetPatient).
for example

Code: Select all

<vetPet>
  <pet>Dog</pet>
  <other>whatever
    <other2>whatever</other2>
  </other>
  <breed>AnyBreed 3</breed>
</vetPet>

EDIT 2:
This solution is too generic and needs a pattern for all possilble parent elements (vetPatient and maybe more) otherwise the regexp matches the whole root element and never will try to match inner elements. Therefore I changed the original regexp to match parent element only if it begins to vetPat (vetPat\w+)

PostMay 06, 2025#42025-05-06T06:38+00:00

Hi UltraNewbie1105,

if some Mofi's assumptions are not valid and the more generic regexp is needed then proceed as folllows:
- create copy of the folder with your xml files
- do Replace in Files on this copied folder using the Perl regexp above
- change your Search setting as Mofi described
- do Find in Files
Find what: (?<=pet:).+
- save the Find Results as csv

BR, Fleggy

XML search for a key tag, then within that node, find another

XML search for a key tag, then within that node, find another

Choose Display Mode