XML text cleaning by regular expressions

XML text cleaning by regular expressions

1
NewbieNewbie
1

    Dec 21, 2007#1

    Hello!

    Please, help in solving this problem!
    How to clean an XML file from all the data outside the tags?
    I am sure there are thousands of ways to do it but so far I fail.

    Let's see how it can be done by using regular expressions in ultraedit.

    Here is an example of the XML file structure I have:

    BEFORE THE CLEANING:

    text to clean
    <wordA>
    <wordB>

    text to clean
    <wordX>useful text X</wordX> no need to clean this
    text to clean

    text to clean
    text to clean
    <wordY>useful text Y</wordY>
    no need to clean this <wordZ>useful text Z</wordZ>
    text to clean
    </wordC>


    DESIRED RESULT (after cleaning):

    <wordA>
    <wordB>
    <wordX>useful text X</wordX>
    no need to clean this
    <wordY>useful text Y</wordY>
    no need to clean this <wordZ>useful text Z</wordZ>
    </wordC>



    In the examle above:
    * wordA, wordB, wordC, wordX, wordY, wordZ are any words.
    * "useful text X" is any text
    * "no need to clean this" is any text

    I am not sure that every useful line begins with "<". There may be spaces or even junk text, which however I do not need to clean out.


    Here is one suggestion:
    Removing all the lines which do not contain: <*>

    The following example will remove all the lines containing tags. I want to do exactly the opposite:
    Find: "%*<*>*^p"
    Replace with: ""

    Is it possible with regular expressions?




    P.S. I use Ultraedit 11.10+1, but please if you have any other suggestions about different methods to solve this, it will be interesting to see. Maybe such a cleaning is a common feature in some other software? (Suggestions for macros are also welcome and appreciated).

    P.S. 2: It is not my priority, but I'm curious - is it somehow possible to obtain this with regular expressions:
    <wordA>
    <wordB>
    <wordX>useful text X</wordX>
    <wordY>useful text Y</wordY>
    <wordZ>useful text Z</wordZ>
    </wordC>


    Thank you!

    6,681583
    Grand MasterGrand Master
    6,681583

      Dec 21, 2007#2

      For your first need deleting all lines which do not contain: <*> I suggest following macro. It works only for files with DOS line endings because of ^p. It also deletes all blank lines (with a dirty trick).

      The macro property Continue if a Find with Replace not found must be checked for this macro.

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Bottom
      IfColNumGt 1
      "
      "
      EndIf
      Top
      Find RegExp "%#"
      Replace All "MaRkErChAr"
      Find RegExp "%^(*<*>^)"
      Replace All "#^1"
      Loop
      Find RegExp "%[~#]*^p"
      Replace All ""
      IfNotFound
      ExitLoop
      EndIf
      EndLoop
      Find RegExp "%#"
      Replace All ""
      Find MatchCase RegExp "%MaRkErChAr"
      Replace All ""

      But you want also the text at start of the line before the tag and the text at end of the line after the tag deleted. That's no problem. Simply append the following 4 lines to the macro above and you will get it.

      Find RegExp "%[~<^p]+<"
      Replace All "<"
      Find RegExp ">[~<>^p]+$"
      Replace All ">"
      Best regards from an UC/UE/UES for Windows user from Austria

      262
      MasterMaster
      262

        Dec 21, 2007#3

        I will start by apologizing because I suggest a solution that will only work for UE13 and above. But you did write "...but please if you have any other suggestions about different methods to solve this, it will be interesting to see..." :-)

        In UE 13 the javascript environment supports ECMAScript for XML (E4X) and that makes it possible to work with the XML tree. But first the original example must at least have balanced tags:

        Code: Select all

        <wordA/>
        <wordB>
          text to clean
          <wordX>useful text X</wordX> no need to clean this
        text to clean
          
        text to clean
        text to clean
        <wordY>useful text Y</wordY>
        no need to clean this <wordZ>useful text Z</wordZ>
          text to clean
        </wordB>
        And now the script. I hope I have put enough comments in the script to explain what happens:

        Code: Select all

        // Misc options for global XML object:
        // http://developer.mozilla.org/en/docs/E4X_Tutorial:The_global_XML_object
        XML.ignoreComments = false;
        XML.ignoreProcessingInstructions = false;
        XML.ignoreWhitespace = true;
        XML.prettyPrinting = true;
        XML.prettyIndent = 2;
        
        // Select the entire XML document
        UltraEdit.activeDocument.selectAll();
        
        // Assuming no root tag, we supply one:
        var dirtyXML = "<cleanXMLroot>"+UltraEdit.activeDocument.selection+"</cleanXMLroot>";
        
        // Try and create a XML object
        try {
          var xml = new XML( dirtyXML );
          
          // run through all xml nodes at this level:
          traverseSubnodes(xml);
          
          // Write xml back with the now deleted text nodes
          // Note: toString is invoked on a XMLList just below the 
          //       artificial root tag (cleanXMLroot): = xml.*
          UltraEdit.activeDocument.write( xml.*.toXMLString() );
        
        }
        catch (exc) {
          // Unselect text
          UltraEdit.activeDocument.top();
        
          // Write XML error text
          UltraEdit.messageBox(exc.toString(),"XML error");
        }
        
        
        function traverseSubnodes(xmlNode) {
        
          // Obtain xmlNodes just below the input node as a XMLList object
          var subNodes = xmlNode.*;
        
          // First run through all nodes and delete text nodes at this level.
          for (i in subNodes) {
            if(subNodes[i].nodeKind()=="text") {
              delete subNodes[i];
            }
          }
        
          // Next: Go deeper in the xml tree for nodes that are complex type:
          for (i in subNodes) {
            if(subNodes[i].nodeKind()=="element") {
        
              // Yup: This one is complex = children
              if(subNodes[i].hasComplexContent()) {
                // go deeper
                traverseSubnodes(subNodes[i]);
              }
            }
          }
        }
        The script will produce the following output:

        Code: Select all

        <wordA/>
        <wordB>
          <wordX>useful text X</wordX>
          <wordY>useful text Y</wordY>
          <wordZ>useful text Z</wordZ>
        </wordB>