Remove all XML blocks with an ID not listed in another file

Exoskeletor · Jun 12, 2012#12012-06-12T12:56+00:00

I would like something like this:

From a text file with many products with that structure:

<product id="02-000580">
<name><![CDATA[IQ IMF-30 ανεμιστήρας νερού]]></name>
<link><![CDATA[http://www.site.gr/index.php?790]]></link>
<price_with_vat>593.00</price_with_vat>
<category id="790"><![CDATA[Κλιματισμός/Ανεμιστήρες/Δαπέδου]]></category>
<image>http://www.site.gr/components/com_virtuemart/shop_image/product/02-000580.jpg</image>
<thumbnail>http://www.site.gr/components/com_virtuemart/shop_image/product/resized/02-000580.jpg</thumbnail>
<manufacturer><![CDATA[IQ]]></manufacturer>
<shipping type="accurate" currency="euro"></shipping>
<description><![CDATA[]]></description>
<instock>N</instock>
<availability>2 έως 4 ημέρες</availability>
</product>

to keep only those which have an identification string in tag <product id="..."> which are specified in a CSV file. For example if I have the id's 3,4 then it must keep only the blocks

<product id="3">
.
.
</product>

<product id="4">
.
.
</product>

Can anyone help me to do this?

Mofi · Jun 14, 2012#22012-06-14T14:15+00:00

Well, it would be possible to use the method as used also in Script to find multiple items and output them with the small difference that the find string changes for every loop run and caret must be moved back to top of file after every find and copy append. But for 1000 id strings this method would be slow because of the 2000 display updates.

Also possible would be to jump from one id string in the input file to next and evaluate if the current id string is in the list from the CSV file. The block is copied ff this is the case. But if the input file has 20000 products this method would result in 20000 display updates making the script very slow.

But I have another idea. A very simple tagged regular expression replace is used to mark those products which are of interest. To reduce the number of replaces up to 50 product id strings are combined in an OR expression. When the entire list of id strings was applied to the file, the remaining products without a marker are removed. And finally the markers are removed from the remaining products.

Please note that the script below works only if

first file - most left file on open file tabs bar - is the file containing the id strings separated by commas and no other characters like line terminators, or the list file is a simple text file which contains one id string per DOS terminated line.
second file - right of most left file on open file tabs bar - is the file containing the product data.

The third file can be the script file executed with Scripting - Run Active Script.

According to your request the script as posted below checks now if third character in first file is a hyphen character. If this is the case the first file is interpreted as file containing the identifcation strings and the second file is the file containing the product data. Otherwise the first file is interpreted as file with the product data and the second file contains the list of identification strings. It is not allowed that the identification strings contain characters which have a special meaning in Perl regular expression search strings as this would result in a not working script. Strings with just digits, letters, underscores and hyphens are no problem.

Update on 2012-10-22: The script supports now a second variant of input XML file as requested.

Code: Select all

// Are at least 2 files opened in UltraEdit?
if (UltraEdit.document.length > 1) {

   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   // Get all id strings separated by a comma from first file if third
   // character is a hyphen or from second file into an array of strings.
   var asIDs;
   var nDataFile = 0;
   var nListFile = 1;
   UltraEdit.document[0].gotoLine(1,3);
   if (UltraEdit.document[0].isChar("-")) {
      nDataFile = 1;
      nListFile = 0;
   }
   UltraEdit.document[nListFile].selectAll();
   if (UltraEdit.document[nListFile].selection.indexOf(",") > 0) {
      asIDs = UltraEdit.document[nListFile].selection.split(",");
   } else {
      asIDs = UltraEdit.document[nListFile].selection.split("\r\n");
      // Remove empty string at end of array if list file ended with a line termination.
      if (asIDs[asIDs.length-1].length == 0) asIDs.pop();
   }
   // And copy entire contents of data file via clipboard 9 into a
   // new file. The input files should not be modified for security.
   UltraEdit.selectClipboard(9);
   UltraEdit.document[nDataFile].selectAll();
   UltraEdit.document[nDataFile].copy();
   // Discard the selections in both files.
   UltraEdit.document[nListFile].top();
   UltraEdit.document[nDataFile].top();
   // Paste the copied data into a new file.
   UltraEdit.newFile();
   UltraEdit.activeDocument.paste();
   // Make sure the last line in new file ends with a line termination.
   if (UltraEdit.activeDocument.isColNumGt(1)) {
      UltraEdit.activeDocument.insertLine();
      if (UltraEdit.activeDocument.isColNumGt(1)) {
         UltraEdit.activeDocument.deleteToStartOfLine();
      }
   }
   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(0);
   UltraEdit.activeDocument.top();

   // Define the parameters for the case-sensitive Perl regular expression
   // Replace All executions to mark the products of interest.
   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;

   // Run this loop until all id strings have been used in a Perl
   // regular expression replace to mark the products of interest.
   var nIdNum = 0;
   var nMaxPerReplace = 50;
   if (UltraEdit.activeDocument.findReplace.find("<kwdikos>")) {
      var sSearchBegin = "^([ \\t]*<product)(>\\s+<kwdikos>)(";
      var sSearchEnd   = ")(<)";
      UltraEdit.activeDocument.top();
   } else {
      var sSearchBegin = "^([ \\t]*<product)( id=\")(";
      var sSearchEnd   = ")(\")";
   }
   UltraEdit.activeDocument.findReplace.regExp=true;
   while (nIdNum < asIDs.length)
   {
      var sSearchExp = sSearchBegin + asIDs[nIdNum];
      var nToDoCount = asIDs.length - nIdNum;
      if (nToDoCount > nMaxPerReplace) nToDoCount = nMaxPerReplace;
      while (--nToDoCount) sSearchExp += '|' + asIDs[++nIdNum];
      sSearchExp += sSearchEnd;
      // The regular expression search string is now
      // either ^([ \t]*<product)( id=")(id1|id2|id3|...|idn)(")
      //     or ^([ \t]*<product)(>\s+<kwdikos>)(id1|id2|id3|...|idn)(<)
      // depending on existence of <kwdikos> anywhere in file.
      UltraEdit.activeDocument.findReplace.replace(sSearchExp,"\\1#\\2\\3\\4");
      nIdNum++;
   }
   // Remove now all products without # after tag string <product.
   UltraEdit.activeDocument.findReplace.replace("(?s)^[ \\t]*<product[^#s].+?</product>\\s+","");
   // Remove marker character # after <product in remaining content.
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.replace("<product#","<product");
}

Exoskeletor · Jun 15, 2012#32012-06-15T09:45+00:00

Works like a charm.
Thank you very much.

Is there any way to modify the script in order to check the third character of the first file if it is a "-" (which if it is it means that the first file is the file with the ID's) and if it isn't to alter the code properly in order to work even if the first file is the product data?

Also I notice that if I want to copy and paste ID's from Excel to a new file in UltraEdit in order for this to work all I have to do is change in split function the string from "," to "\r\n".

Last is there any way to create the new file in UTF-8 format by default?

Mofi · Jun 15, 2012#42012-06-15T14:39+00:00

I made the 2 enhancements as requested in updated script above.

There is at Advanced - Configuration - Editor - New File Creation the configuration option Create new files as UTF-8. If you select this option for Encoding Type, all new files are by default created as UTF-8 encoded files.

If you use UltraEdit v17.30 or later you can also use following script code below the line with UltraEdit.newFile();

Code: Select all

// Is encoding of new file not already UTF-8?
if (UltraEdit.activeDocument.encoding != 65001) {
   // Convert empty file encoded in ASCII/ANSI or UTF-16 little endian (LE) to UTF-8.
   UltraEdit.activeDocument.ASCIIToUTF8();
}

If you use UltraEdit prior v17.30 you can only convert the new file to Unicode (UTF-16 LE) using command UltraEdit.activeDocument.ASCIIToUnicode(); below the line with UltraEdit.newFile();
And later after script finished you convert the file from UTF-16 LE to UTF-8 on Save As or via command File - Conversions - UNICODE/UTF-8 to UTF-8 (Unicode Editing).

Alternatively for this script you could make the data file always the active file with command
UltraEdit.document[nDataFile].setActive();
use command
UltraEdit.saveAs("NewFileName")
and run the replaces on the data file directly saved now with a different name.

A note for other users interested in conversions from/to UTF-8 using a script:

Converting a UTF-16 encoded file to UTF-8 with command UltraEdit.activeDocument.ASCIIToUTF8() works always.

But it is not possible to convert a UTF-8 encoded file to UTF-16 without temporary conversion to ASCII/ANSI. As this would most likely result in damaged text, the workaround is following script code for conversion from UTF-8 to UTF-16.

Code: Select all

UltraEdit.selectClipboard(9);
UltraEdit.activeDocument.selectAll();
UltraEdit.activeDocument.cut();
// The Unicode file is now empty.
UltraEdit.activeDocument.UTF8ToASCII();
UltraEdit.activeDocument.ASCIIToUnicode();
UltraEdit.activeDocument.paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);

Exoskeletor · Oct 11, 2012#52012-10-11T11:13+00:00

I would like another script that will do the same work but instead of

Code: Select all

<products>
<product id="ID">
....
</product>
</products>

to work with

Code: Select all

<products>
<product>
<kwdikos>ID</kwdikos>
....
</product>
</products>

Thanks for your time.

Mofi · Oct 13, 2012#62012-10-13T16:32+00:00

I modified the script code posted in my first post to support this second variant of input XML file.

Exoskeletor · Oct 22, 2012#72012-10-22T11:55+00:00

Mofi wrote:I modified the script code posted in my first post to support this second variant of input XML file.

thank you
but it is ignoring me only the first id. for example if i use
01-000010
01-000012
i get

<?xml version="1.0" encoding="UTF-8" ?>
<product>
<kwdikos>01-000012</kwdikos>
.....
</product>

</products>
but i dont get the 01-000010 product (the tag <products> also is missing)

I think it handles the <products> tag like it was <product>. i want to ignore <products> or to ignore the first <product that can find
(i have avoided by using this
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.replace("<products>","<<>>");
in the beggining and this
UltraEdit.activeDocument.findReplace.replace("<<>>","<products>");
in the end

)

Mofi · Oct 22, 2012#82012-10-22T14:21+00:00

I modified the script once again by adding 1 character to ignore <products on replace.

It would be easier for me to have:

both variants of input XML files,
the list file with the IDs,
both variantes of output XML files according to input files.

I needed to create all the input files by myself and could just suppose how the output files should look like.

Exoskeletor · Nov 30, 2012#92012-11-30T16:35+00:00

works great. thank you very much for your help