Search and Replace to remove node from XML file

Search and Replace to remove node from XML file

3

    Oct 17, 2008#1

    Hi all,

    I have some large XML files, from 80 MB to 250 MB that I need to extract some date from. Due to the size I cannot successfully open the files to convert the data.
    There are a number of nodes that are are not required for this 'current' exercise, so am thinking that a way around it is to remove these nodes.

    The basic format is as follows

    Code: Select all

    <adviser>
    <clients>
    <client>
    <title>
        <key>MR</key>
        <description>Mr</description>
    </title>
    <firstName>Anthony</firstName>
    <surname>Winkler</surname>
    <preferredName>Tony</preferredName>
    <dateOfBirth>1982-11-01 12:00:00.0</dateOfBirth>
    <gender>M</gender>
    <notes>
       <note>
       <id>250664</id>
       <noteTopic>
           <refType>
               <fieldLocation>
                  <key>CLIENT_NOTES</key>
                  <description>Notes</description>
               </fieldLocation>
               <key>NOTE_TOPIC</key>
               <description>Topic</description>
            </refType>
            <key>ADVICE</key>
            <description>Advice</description>
            <id>4417</id>
        </noteTopic>
        <description>Advice</description>
        <creationDate>2008-04-17 12:00:00.0</creationDate>
        <noteText>Advice scopes and strategies: <br />1. Income protection / Salary continuance - Income protection: ACCEPTED<br />2. Life insurance - Consolidate your Debts: ACCEPTED<br />3. TPD - Consolidate your Debts: ACCEPTED<br />4. Trauma - Consolidate your Debts: ACCEPTED<br /><br />Advice recommendations: ACCEPTED ALL RECOMMENDATIONS<br /><br />Notes: 
    </noteText>
       <isStandard>true</isStandard>
       <attachments>
           <attachmentsItem>
    	<id>161159</id>
    	<attachmentType>
    		<key>SERVER</key>
    		<description>Server file system</description>
    	</attachmentType>
    	<fileName>Std-SoA-17-Dec-07_5667_27001_205336.doc</fileName>
    	<fileSize>190</fileSize>
                 <creationDate>2008-04-17 02:27:25.0</creationDate>
            </attachmentsItem>
        </attachments>
      </note>
    </notes>
    <client>
    <clients>
    <adviser>
    What I would like to do it do a search and replace and remove all of the notes node. from the <notes> to the </notes>

    Code: Select all

    <adviser>
    <clients>
    <client>
    <title>
        <key>MR</key>
        <description>Mr</description>
    </title>
    <firstName>Anthony</firstName>
    <surname>Winkler</surname>
    <preferredName>Tony</preferredName>
    <dateOfBirth>1982-11-01 12:00:00.0</dateOfBirth>
    <gender>M</gender>
    <client>
    <clients>
    <adviser>
    
    Any ideas?

    Thanks in advance, Steve

    22
    Basic UserBasic User
    22

      Oct 18, 2008#2

      Using the Perl regex engine
      Search for:

      (?s)<notes>.+</notes>\r\n

      replace with:
      nothing

      should do the trick.
      Normally, if there were more than one set of <notes> </notes>, then this would be greedy and span the whole range, but because of a bug in the multiline support in Ultraedit it acts lazy and gives the result you want.

      Works for me using UE ver 13.20+2.
      There may be multiline support in ver 14, but you have not indicated what version you are using.
      Jane

      3

        Oct 18, 2008#3

        Hi Jane,

        Thanks heaps for that!
        I am using 14.20 and it is picking up the selected node.

        Are you saying it 'could' just pick up from the first <notes> to the last </notes> in the file?

        There are multiple sets of the <notes> node in the file, and within the <notes> node there can be carriage returns e.g.:

        Code: Select all

        <adviser>
          <clients>
            <client>
              <notes>
                <note>stuff in here
                           can be many lines
                           <br>can be any HTML source code
        
                </note>
                <note>
                </note>
              </notes
            </client>
            <client>
              <notes>
                <note>
                </note>
                <note>
                </note>
              </notes
            </client>
            <client>
              <notes>
                <note>
                </note>
              </notes
            </client>
          <clients>
        <adviser>
        
        Thanks again
        Steve

        236
        MasterMaster
        236

          Oct 19, 2008#4

          Theoretically, the way the regex is now, it should pick up everything from the first <notes> to the last </notes> because + is a greedy quantifier. Because of a bug in UE's regex engine, the + loses its greediness when multiple lines are involved. So at the moment, it should work, but if IDM (or Boost, who provide the regex library) fix this bug, then it won't work anymore. The "lazy" version of the search regex would be (?s)<notes>.+?</notes>\r\n - this should always work but might be a little slower.

          The moment you really run into trouble is if <notes> tags can be nested. Regular expressions are not able to deal with arbitrarily nested structures.

          3

            Oct 19, 2008#5

            Thanks for the explanation.

            the <notes></notes> cant be nested (luckily :-) )

            22
            Basic UserBasic User
            22

              Oct 23, 2008#6

              Thanks for explaining Tim. I should have included the lazy .+? but I find in long searches it tends to be a bit slower due to backtracking. However, my advice which depends on a bug in UltraEdit to get faster results is probably not the best long term advice.