XML text cleaning by regular expressions

KOFN · Dec 21, 2007#12007-12-21T02:13+00:00

Hello!

Please, help in solving this problem!
How to clean an XML file from all the data outside the tags?
I am sure there are thousands of ways to do it but so far I fail.

Let's see how it can be done by using regular expressions in ultraedit.

Here is an example of the XML file structure I have:

BEFORE THE CLEANING:

text to clean
<wordA>
<wordB>
text to clean
<wordX>useful text X</wordX> no need to clean this
text to clean

text to clean
text to clean
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
text to clean
</wordC>

DESIRED RESULT (after cleaning):

<wordA>
<wordB>
<wordX>useful text X</wordX> no need to clean this
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
</wordC>

In the examle above:
* wordA, wordB, wordC, wordX, wordY, wordZ are any words.
* "useful text X" is any text
* "no need to clean this" is any text

I am not sure that every useful line begins with "<". There may be spaces or even junk text, which however I do not need to clean out.

Here is one suggestion:
Removing all the lines which do not contain: <*>

The following example will remove all the lines containing tags. I want to do exactly the opposite:
Find: "%*<*>*^p"
Replace with: ""

Is it possible with regular expressions?

P.S. I use Ultraedit 11.10+1, but please if you have any other suggestions about different methods to solve this, it will be interesting to see. Maybe such a cleaning is a common feature in some other software? (Suggestions for macros are also welcome and appreciated).

P.S. 2: It is not my priority, but I'm curious - is it somehow possible to obtain this with regular expressions:
<wordA>
<wordB>
<wordX>useful text X</wordX>
<wordY>useful text Y</wordY>
<wordZ>useful text Z</wordZ>
</wordC>

Thank you!

Mofi · Dec 21, 2007#22007-12-21T14:12+00:00

For your first need deleting all lines which do not contain: <*> I suggest following macro. It works only for files with DOS line endings because of ^p. It also deletes all blank lines (with a dirty trick).

The macro property Continue if a Find with Replace not found must be checked for this macro.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNumGt 1
"
"
EndIf
Top
Find RegExp "%#"
Replace All "MaRkErChAr"
Find RegExp "%^(*<*>^)"
Replace All "#^1"
Loop
Find RegExp "%[~#]*^p"
Replace All ""
IfNotFound
ExitLoop
EndIf
EndLoop
Find RegExp "%#"
Replace All ""
Find MatchCase RegExp "%MaRkErChAr"
Replace All ""

But you want also the text at start of the line before the tag and the text at end of the line after the tag deleted. That's no problem. Simply append the following 4 lines to the macro above and you will get it.

Find RegExp "%[~<^p]+<"
Replace All "<"
Find RegExp ">[~<>^p]+$"
Replace All ">"

jorrasdk · Dec 21, 2007#32007-12-21T14:53+00:00

I will start by apologizing because I suggest a solution that will only work for UE13 and above. But you did write "...but please if you have any other suggestions about different methods to solve this, it will be interesting to see..."

In UE 13 the javascript environment supports ECMAScript for XML (E4X) and that makes it possible to work with the XML tree. But first the original example must at least have balanced tags:

Code: Select all

<wordA/>
<wordB>
  text to clean
  <wordX>useful text X</wordX> no need to clean this
text to clean
  
text to clean
text to clean
<wordY>useful text Y</wordY>
no need to clean this <wordZ>useful text Z</wordZ>
  text to clean
</wordB>

And now the script. I hope I have put enough comments in the script to explain what happens:

Code: Select all

// Misc options for global XML object:
// http://developer.mozilla.org/en/docs/E4X_Tutorial:The_global_XML_object
XML.ignoreComments = false;
XML.ignoreProcessingInstructions = false;
XML.ignoreWhitespace = true;
XML.prettyPrinting = true;
XML.prettyIndent = 2;

// Select the entire XML document
UltraEdit.activeDocument.selectAll();

// Assuming no root tag, we supply one:
var dirtyXML = "<cleanXMLroot>"+UltraEdit.activeDocument.selection+"</cleanXMLroot>";

// Try and create a XML object
try {
  var xml = new XML( dirtyXML );
  
  // run through all xml nodes at this level:
  traverseSubnodes(xml);
  
  // Write xml back with the now deleted text nodes
  // Note: toString is invoked on a XMLList just below the 
  //       artificial root tag (cleanXMLroot): = xml.*
  UltraEdit.activeDocument.write( xml.*.toXMLString() );

}
catch (exc) {
  // Unselect text
  UltraEdit.activeDocument.top();

  // Write XML error text
  UltraEdit.messageBox(exc.toString(),"XML error");
}


function traverseSubnodes(xmlNode) {

  // Obtain xmlNodes just below the input node as a XMLList object
  var subNodes = xmlNode.*;

  // First run through all nodes and delete text nodes at this level.
  for (i in subNodes) {
    if(subNodes[i].nodeKind()=="text") {
      delete subNodes[i];
    }
  }

  // Next: Go deeper in the xml tree for nodes that are complex type:
  for (i in subNodes) {
    if(subNodes[i].nodeKind()=="element") {

      // Yup: This one is complex = children
      if(subNodes[i].hasComplexContent()) {
        // go deeper
        traverseSubnodes(subNodes[i]);
      }
    }
  }
}

The script will produce the following output:

Code: Select all

<wordA/>
<wordB>
  <wordX>useful text X</wordX>
  <wordY>useful text Y</wordY>
  <wordZ>useful text Z</wordZ>
</wordB>