How to split an XML file into different files based on x number of nodes?

ftpaccess · Aug 23, 2017#12017-08-23T19:34+00:00

Hi,

I have this kind of file (with indenting tabs):

<?xml version="1.0" encoding="utf-8"?>
<products>
	<product>
		<product_id>62455</product_id>
		<model>76171570018</model>
		<image><![CDATA[http://s7d9.scene7.com/is/image/JCPenney/DP0929201617164410M?wid=2000&hei=2000&op_sharpen=1]]></image>
		<price>69.9900</price>
		<category>Rugs</category>
		<brand>KALEEN</brand>
		<brand2>KALEEN </brand2>
		<title>Kaleen Brisa Tiles Negative Rectangular Rugs</title>
		
		<productpageurl><![CDATA[http://www.appliance.com/index.php?route=product/product&modelnumber=76171570018&path=1&product_id=62455]]></productpageurl>
	</product>

	<product>
		<product_id>62456</product_id>
		<model>76171450026</model>
		<image><![CDATA[http://s7d9.scene7.com/is/image/JCPenney/DP0929201617163413M?wid=2000&hei=2000&op_sharpen=1]]></image>
		<price>189.9900</price>
		<category>Rugs</category>
		<brand>KALEEN</brand>
		<brand2>KALEEN </brand2>
		<title>Kaleen Brisa Tiles Positive Rectangular Rugs</title>
		
		<productpageurl><![CDATA[http://www.appliance.com/index.php?route=product/product&modelnumber=76171450026&path=1&product_id=62456]]></productpageurl>
	</product>
</products>

I want it to be split into different files with a given number of nodes in each file.

So file 1 would be:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<products>
	<product>
		<product_id>62455</product_id>
		<model>76171570018</model>
		<image><![CDATA[http://s7d9.scene7.com/is/image/JCPenney/DP0929201617164410M?wid=2000&hei=2000&op_sharpen=1]]></image>
		<price>69.9900</price>
		<category>Rugs</category>
		<brand>KALEEN</brand>
		<brand2>KALEEN </brand2>
		<title>Kaleen Brisa Tiles Negative Rectangular Rugs</title>
		
		<productpageurl><![CDATA[http://www.appliance.com/index.php?route=product/product&modelnumber=76171570018&path=1&product_id=62455]]></productpageurl>
	</product>
</products>

And file 2 would be:

Code: Select all

<?xml version="1.0" encoding="utf-8"?>
<products>
	<product>
		<product_id>62456</product_id>
		<model>76171450026</model>
		<image><![CDATA[http://s7d9.scene7.com/is/image/JCPenney/DP0929201617163413M?wid=2000&hei=2000&op_sharpen=1]]></image>
		<price>189.9900</price>
		<category>Rugs</category>
		<brand>KALEEN</brand>
		<brand2>KALEEN </brand2>
		<title>Kaleen Brisa Tiles Positive Rectangular Rugs</title>
		
		<productpageurl><![CDATA[http://www.appliance.com/index.php?route=product/product&modelnumber=76171450026&path=1&product_id=62456]]></productpageurl>
	</product>
</products>

Any idea if there is a way of doing it in UltraEdit? I looked through the forum but couldn't find any. I found ways of doing it online using some other softwares but wasn't familiar to those. 69

MickRC3 · Aug 24, 2017#22017-08-24T17:45+00:00

1) In order to split up an XML file into two or more files based on the contents of inner tags you have to be able to save a lot of temporary information. Macros are not the correct tools for this. Variable support is not there. A scripting solution would be better.

2) As you discovered, there are tools already that perform this functionality. Often the best tool is one written to do only the one thing well.

3) If it were absolutely necessary to do this in UE, a script could be written for the task. Understand that each level of tags further into the document increases the complexity of the script so a one size fits all script would be much more work. Your example of only having to replicate the <products> level tags and then populate the new file with <product> level tags can be done. If on the other hand we had something along the lines of <country> <subdivision> <vendor> <products> <product> </product> </products> </vendor></subdivision></country> the level of complexity is much, much higher.

ftpaccess · Aug 24, 2017#32017-08-24T20:20+00:00

i used mofi's script and made it work Split large text files with UltraEdit.

Thanks

Mofi · Aug 25, 2017#42017-08-25T05:15+00:00

Here is a script for this task making as much as possible in memory for maximum performance:

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();

   // Select anything in active file and process this content if active file
   // is not an empty file resulting in nothing selected after next command.
   UltraEdit.activeDocument.selectAll();
   if (UltraEdit.activeDocument.isSel())
   {
      // Find blocks with 1 to N product elements. The maximum number of
      // products N within a block is determined by second value in {1,N}.
      var asBlocks = UltraEdit.activeDocument.selection.match(/(?:[\t ]*<product\b[\s\S]+?<\/product>(?:[\t ]*\r?\n)+){1,2}/g);

      UltraEdit.activeDocument.top();  // Cancel the selection in active file.

      if (asBlocks != null)   // Was any block found at all.
      {
         // Load name of file with full path of active file into a variable.
         var sFileName = UltraEdit.activeDocument.path;
         // Determine line termination type of active file (DOS/Windows or UNIX).
         var sLineTerm = (UltraEdit.activeDocument.lineTerminator <= 0) ? "\r\n" : "\n";

         // Determine amount of leading zeros required for number in
         // file names for equal file name length of all created files.
         var sLeadingZeros = (asBlocks.length+10).toString(10);
         sLeadingZeros = sLeadingZeros.replace(/./g,"0");

         for (var nBlock = 0; nBlock < asBlocks.length; nBlock++)
         {
            // Create the block to write into the file with trimming
            // all trailing whitespaces form end of found block.
            var sBlock = '<?xml version="1.0" encoding="utf-8"?>' + sLineTerm + '<products>' + sLineTerm;
            sBlock += asBlocks[nBlock].replace(/\s+$/,"");
            sBlock += sLineTerm + '</products>' + sLineTerm;

            UltraEdit.newFile();

            // Set the right line termination type for new file according to
            // line termination type (DOS/Windows or UNIX) of active file.
            UltraEdit.activeDocument.unixMacToDos();
            if (sLineTerm == "\n") UltraEdit.activeDocument.dosToUnix();

            // Set UTF-8 as encoding for new file as the source file has UTF-8
            //encoding declaration at top and is therefore also UTF-8 encoded.
            UltraEdit.activeDocument.unicodeToASCII();
            UltraEdit.activeDocument.ASCIIToUTF8();

            // Write the block into the new file.
            UltraEdit.activeDocument.write(sBlock);

            // Determine file number string for active block.
            var sFileNumber = (nBlock+1).toString(10);
            sFileNumber = sLeadingZeros.substr(sFileNumber.length) + sFileNumber;

            // Insert an underscore and the file number before file extension
            // of active file name to determine name of file with full path
            // of active block for saving the new file with that name.
            var sSaveName = sFileName.replace(/(\.?[^.\\]*)$/,"_" + sFileNumber + "$1");

            // Save and close the file with active block.
            UltraEdit.saveAs(sSaveName);
            UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
         }
      }
   }
}

It is not designed for very large files. It should work for files with up to 20 MiB.

ftpaccess · Aug 25, 2017#52017-08-25T11:05+00:00

Thanks mofi. But your previous script worked like a charm as I allocated the blocks and lines for each file and it completed the job in an hour or so. My file was 330 MB big. The only other thing I had to run separately on it was to add text at the beginning and end of each file along with doing a split.

So I ran 2 separate regex on all the files in the folder.

Regular expression to add a text in the beginning of a file:
Search : \A
Replace : <?xml version="1.0" encoding="utf-8"?>\r\n<products>\r\n

Regular expression to add text at end of file:
Search : \Z
Replace : </products>

MickRC3 · Aug 25, 2017#62017-08-25T22:03+00:00

Well, that large file split script works for your specific case, along with some REGEX.

How did you get the script to cut exactly on a product boundary? Or did you manually move a split product record so it was all in one slice?

ftpaccess · Aug 28, 2017#72017-08-28T00:58+00:00

I needed to make separate files with 1000 products in each file. Every product had 10 lines in source file which means 10000 lines per file had to be copied to a new file. So I changed the values like in the image below and that did the trick.

MickRC3 · Aug 28, 2017#82017-08-28T12:48+00:00

10000 should not have worked. Your original file had two lines at the top, the XML shebang and the outer <products> tag, which are not part of a 10 line product entry. So the last product entry should have lost the last two lines which would be at the top of the next file. Now it is possible that you removed those lines manually before you split the file, along with the trailing </products> tag. Then 10000 would do the task without cutting any entry into two pieces. You did not mention doing so in your explanation of how you used the script. I am only bringing this up so that should anyone else decide to use the file split script that they understand that they must account for lines that are not part of a repeating pattern when they select the size of a slice.

ftpaccess · Aug 28, 2017#92017-08-28T16:03+00:00

That's why I removed those 2 lines at top and also the bottom tag and then re-added them using regex in all files in the folder.