XML files: convert to CSV, sort by <ID>, sort the tags

XML files: convert to CSV, sort by <ID>, sort the tags

1581
Power UserPower User
1581

    Apr 25, 2019#1

    This is not really an important question, just curious about possible solutions.

    I know - this should be done with XSLT, but if there are already solutions in UE, it would be fine.

    Example of an existing XML file:

    Code: Select all

    <PERSON>
        <NAME>John</NAME>
        <ID>25</ID>
        <CATEGORY>S</CATEGORY>
    </PERSON>
    
    <PERSON>
        <NAME>Susan</NAME>
        <ID>9</ID>
        <CATEGORY>M</CATEGORY>
    </PERSON>
    
    a) Define a tag and convert the content to CSV like format

    Code: Select all

    <PERSON><NAME>John</NAME><ID>25</ID><CATEGORY>S</CATEGORY></PERSON>
    <PERSON><NAME>Susan</NAME><ID>9</ID><CATEGORY>M</CATEGORY></PERSON>
    
    b) Define a tag like PERSON and sort all persons based on a subtag like ID

    Code: Select all

    <PERSON>
        <NAME>Susan</NAME>
        <ID>9</ID>
        <CATEGORY>M</CATEGORY>
    </PERSON>
    <PERSON>
        <NAME>John</NAME>
        <ID>25</ID>
        <CATEGORY>S</CATEGORY>
    </PERSON>
    
    c) Define a tag like PERSON and sort the subtags alphabetically (C - I - N)

    Code: Select all

    <PERSON>
        <CATEGORY>S</CATEGORY>
        <ID>25</ID>
        <NAME>John</NAME>
    </PERSON>
    
    <PERSON>
        <CATEGORY>M</CATEGORY>
        <ID>9</ID>
        <NAME>Susan</NAME>
    </PERSON>
    UE 26.20.0.74 German / Win 10 x 64 Pro

    6,603548
    Grand MasterGrand Master
    6,603548

      Script to convert XML file to a line based format

      Apr 27, 2019#2

      Here is an UltraEdit / UEStudio script to convert an entire XML file to a line based format.

      Code: Select all

      var g_sElementName = "";   // Define fixed element name here.
      
      function getElementName ()       // Function to get element name from current
      {                                // selection or input by the script user.
         // Is no element defined and something is selected in active file?
         if ((!g_sElementName.length) && (UltraEdit.activeDocument.isSel()))
         {
            // Get element name which is first word of current selection.
            g_sElementName = UltraEdit.activeDocument.selection.replace(/\W*(\w*).*$/,"$1");
         }
         if (!g_sElementName.length)   // Is element name still not defined?
         {
            g_sElementName = UltraEdit.getString("Please enter element name:",1);
         }
      }
      
      if (UltraEdit.document.length > 0)  // Is any file opened?
      {
         // Define environment for this script.
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
      
         getElementName();
      
         if (g_sElementName.length)   // Is element name defined now?
         {
            UltraEdit.perlReOn();
            UltraEdit.activeDocument.findReplace.mode=0;
            UltraEdit.activeDocument.findReplace.matchCase=false;
            UltraEdit.activeDocument.findReplace.matchWord=false;
            UltraEdit.activeDocument.findReplace.regExp=true;
            UltraEdit.activeDocument.findReplace.searchDown=true;
            if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
            {
               UltraEdit.activeDocument.findReplace.searchInColumn=false;
            }
      
            // Does the active XML file contain the element at all?
            UltraEdit.activeDocument.top();
            if (UltraEdit.activeDocument.findReplace.find("<" + g_sElementName + "\\>"))
            {
               // Get element name as written in file because XML is case-sensitive.
               // The additional condition should not be necessary, but with some
               // very old versions of UltraEdit for Windows the element name would
               // be defined as empty string without this additional condition.
               var sElementName = UltraEdit.activeDocument.selection.substr(1);
               if (sElementName.length == g_sElementName.length)
               {
                  g_sElementName = sElementName;
                  UltraEdit.activeDocument.findReplace.matchCase=true;
               }
      
               // Get line termination from active file, default is CR+LF.
               var sLineTerm = "\r\n";
               UltraEdit.activeDocument.top();
               if (UltraEdit.activeDocument.findReplace.find("\\r?\\n|\\r"))
               {
                  sLineTerm = UltraEdit.activeDocument.selection;
               }
      
               // Remove the XML element at top if the active XML file has one
               // at all and also all whitespace characters at top of file.
               UltraEdit.activeDocument.top();
               UltraEdit.activeDocument.findReplace.preserveCase=false;
               UltraEdit.activeDocument.findReplace.replaceAll=true;
               UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
               UltraEdit.activeDocument.findReplace.replace("(?:\\s*<\\?xml\\b[^>]+>)?\\s*","");
      
               // Remove all horizontal tabs, line-feeds, carriage returns and
               // normal spaces between angle brackets as well as end of file.
               UltraEdit.activeDocument.findReplace.replaceAll=true;
               UltraEdit.activeDocument.findReplace.replace("(?<=>)[\\t\\n\\r ]+(?:(?=<)|(?!.))","");
               UltraEdit.activeDocument.top();
      
               // Insert line termination after each end tag of element.
               UltraEdit.activeDocument.findReplace.replace("(</" + g_sElementName + ">)","$1" + sLineTerm);
               UltraEdit.activeDocument.top();
      
               // Insert line termination after each end of inline tag of element.
               UltraEdit.activeDocument.findReplace.replace("(<" + g_sElementName + "\\>.*?/>)","$1" + sLineTerm);
               UltraEdit.activeDocument.top();
            }
         }
      }
      
      Some additional notes:
      1. It is possible to define an element name at top of the script to run the script without user prompt for element name.
      2. The first word of an existing selection on starting the script is interpreted as element name on empty element name string defined in script.
      3. The line terminations are inserted after end tag of chosen element or after /> of the tag on being an inline element.
      4. The XML element <?xml ... > at top of XML file is removed automatically if present in active file.
      PS: This script works even with UltraEdit for Windows v13.00 for ANSI encoded XML files on auto-indent setting not enabled in configuration.

        Script to sort level one elements in an XML file based on value of a subelement

        Apr 27, 2019#3

        Here is an UltraEdit / UEStudio script to sort level one elements in an XML file based on value of a subelement.

        Code: Select all

        var g_sElementName = "";   // Define fixed main element name here.
        var g_sSortElement = "";   // Define fixed sort element name here.
        
        function getElementName ()       // Function to get element name from current
        {                                // selection or input by the script user.
           // Is no element defined and something is selected in active file?
           if ((!g_sElementName.length) && (UltraEdit.activeDocument.isSel()))
           {
              // Get element name which is first word of current selection.
              g_sElementName = UltraEdit.activeDocument.selection.replace(/\W*(\w*).*$/,"$1");
           }
           if (!g_sElementName.length)  // Is element name still not defined?
           {
              g_sElementName = UltraEdit.getString("Please enter element name:",1);
           }
        }
        
        function sortByNumber (aSortValueA, aSortValueB)
        {
           return aSortValueA[1] - aSortValueB[1];
        }
        
        function sortByString (aSortValueA, aSortValueB)
        {
           return aSortValueA[1] > aSortValueB[1];
        }
        
        function sortXmlElements ()
        {
           UltraEdit.perlReOn();
           UltraEdit.activeDocument.findReplace.mode=0;
           UltraEdit.activeDocument.findReplace.matchCase=false;
           UltraEdit.activeDocument.findReplace.matchWord=false;
           UltraEdit.activeDocument.findReplace.regExp=true;
           UltraEdit.activeDocument.findReplace.searchDown=true;
           if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
           {
              UltraEdit.activeDocument.findReplace.searchInColumn=false;
           }
           // Does the active XML file not contain the main element?
           UltraEdit.activeDocument.top();
           if (!UltraEdit.activeDocument.findReplace.find("<" + g_sElementName + "\\>"))
           {
              UltraEdit.messageBox("ERROR: No end tag of element '" + g_sElementName + "' found in file.");
              return;
           }
        
           // Get main element name as written in file.
           // The additional condition should not be necessary, but with some
           // very old versions of UltraEdit for Windows the element name would
           // be defined as empty string without this additional condition.
           var sElementName = UltraEdit.activeDocument.selection.substr(1);
           if (sElementName.length == g_sElementName.length)
           {
              g_sElementName = sElementName;
           }
        
           // Is the sorting element not defined at top of the script?
           if (!g_sSortElement.length)
           {
              g_sSortElement = UltraEdit.getString("Please enter name of sorting element:",1);
           }
        
           // Has the user not entered the name of sorting element?
           if (!g_sSortElement.length) return;
        
           // Does the active XML file not contain the sorting element?
           UltraEdit.activeDocument.top();
           if (!UltraEdit.activeDocument.findReplace.find("<" + g_sSortElement + "\\>"))
           {
              UltraEdit.messageBox("ERROR: Sorting element '" + g_sSortElement + "' not found in file.");
              return;
           }
        
           // Get sorting element name as written in file.
           var sSortElement = UltraEdit.activeDocument.selection.substr(1);
           if (sSortElement.length == g_sSortElement.length)
           {
              g_sSortElement = sSortElement;
           }
        
           // Get all values of sorting elements.
           UltraEdit.activeDocument.selectAll();
           var sSearch = "<" + g_sSortElement + "\\b.*?>[^<]*";
           var rRegSearch = new RegExp(sSearch,"g");
           var asSortValues = UltraEdit.activeDocument.selection.match(rRegSearch);
        
           if(asSortValues === null)
           {
              UltraEdit.messageBox("ERROR: Failed to find sorting values with regular expression '" + sSearch + "' in file.");
              return;
           }
        
           // Load all XML blocks according to end tag of main element.
           var sMainEndTag = "</" + g_sElementName + ">";
           var asXmlBlocks = UltraEdit.activeDocument.selection.split(sMainEndTag);
        
           if ((asXmlBlocks.length - 1) != asSortValues.length)
           {
              UltraEdit.messageBox("ERROR: The number of '" + g_sElementName +
                                   "' elements (" + (asXmlBlocks.length - 1).toString(10) +
                                   ") is different to the number of '" +  g_sSortElement +
                                   "' elements (" +  asSortValues.length.toString(10) + ").");
              return;
           }
        
           // Remove the tag left to the element values and determine if all element
           // values consist of only one or more digits, i.e. are decimal numbers.
           var bAllNumbers = true;
           for (var nIndex = 0; nIndex < asSortValues.length; nIndex++)
           {
              asSortValues[nIndex] = asSortValues[nIndex].replace(/^.+>/,"");
              if (asSortValues[nIndex].search(/^\d+$/) < 0)
              {
                 bAllNumbers = false;
              }
           }
        
           // Sort the sorting values either by number or lexicographically by
           // string with keeping array index number before sort used below.
           var aaSortValues = [];
           if (bAllNumbers)
           {
              for (nIndex = 0; nIndex < asSortValues.length; nIndex++)
              {
                 var nSortValue = parseInt(asSortValues[nIndex],10);
                 aaSortValues[nIndex] = [ nIndex, nSortValue ];
              }
              aaSortValues.sort(sortByNumber);
           }
           else
           {
              for (nIndex = 0; nIndex < asSortValues.length; nIndex++)
              {
                 aaSortValues[nIndex] = [ nIndex, asSortValues[nIndex] ];
              }
              aaSortValues.sort(sortByString);
           }
        
           // Get everything up to first element from first XML block like XML header.
           var sFileContent = "";
           sSearch = "^([\\s\\S]*?)<" + g_sElementName + "\\b[\\s\\S]*$";
           rRegSearch = new RegExp(sSearch,"i");
           var sHeader = asXmlBlocks[0].replace(rRegSearch,"$1");
           if (sHeader.length != asXmlBlocks[0].length)
           {
              sFileContent = sHeader;
              asXmlBlocks[0] = asXmlBlocks[0].substr(sHeader.length);
           }
        
           // Is there more than one XML element block and first XML element
           // after sort is not the first element read from file before?
           var nFirstSortIndex = aaSortValues[0][0];
           if ((asXmlBlocks.length > 1) && (nFirstSortIndex > 0))
           {
              // Move everything up to start tag of main element in first XML block
              // after sort to first XML block before sort like line termination.
              var sFirstBegin = asXmlBlocks[nFirstSortIndex].replace(/^(\s+)[\w\W]*$/,"$1");
              if (sFirstBegin.length != asXmlBlocks[nFirstSortIndex].length)
              {
                 asXmlBlocks[0] = sFirstBegin + asXmlBlocks[0];
                 asXmlBlocks[nFirstSortIndex] = asXmlBlocks[nFirstSortIndex].substr(sFirstBegin.length);
              }
           }
        
           // Build the file content with sorted XML blocks in memory.
           for (nIndex = 0; nIndex < aaSortValues.length; nIndex++)
           {
              sFileContent += asXmlBlocks[aaSortValues[nIndex][0]] + sMainEndTag;
           }
           sFileContent += asXmlBlocks[asXmlBlocks.length-1];
        
           UltraEdit.activeDocument.write(sFileContent);
        }
        
        if (UltraEdit.document.length > 0)  // Is any file opened?
        {
           // Define environment for this script.
           UltraEdit.insertMode();
           if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
           else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
           getElementName();
           if (g_sElementName)
           {
              sortXmlElements();
           }
        }
        
        Some additional notes:
        1. It is possible to define a main element name at top of the script to run the script without user prompt for main element name.
        2. The first word of an existing selection on starting the script is interpreted as main element name on empty element name string defined in script.
        3. The name of the element of which values are used for sorting can be defined in script at top. The user is prompted for sorting element name if not defined in script.
        4. The sorting is done numeric if all values of sorting element are positive decimal integer numbers. Otherwise the sort is done lexicographically on the string values.
        5. The script does not take care about positions of other first level elements in XML file. They are moved together with the defined/selected/input main elements.
        6. The script is not written for large XML files. The entire file content must fit twice into memory of JavaScript interpreter on running the script.
        7. The XML element <?xml ... > at top of XML file is kept at top if present in active file at all.
        The script is not written for sorting elements on other level in XML structure than level one.

        PS: This script works even with UltraEdit for Windows v13.00 for small ANSI encoded XML files on auto-indent setting not enabled in configuration. UltraEdit for Windows v13.10 is required on an error condition with error message output with UltraEdit.messageBox().
        Best regards from an UC/UE/UES for Windows user from Austria

        1581
        Power UserPower User
        1581

          Apr 29, 2019#4

          Thanks, Mofi - great.

          For the first code xml2csv I have to think about the results which are out of definition, e.g. the element "Person" covers only 20% of the XML content.
          UE 26.20.0.74 German / Win 10 x 64 Pro

          6,603548
          Grand MasterGrand Master
          6,603548

            Script to convert user defined XML element to a line based format

            Jun 03, 2019#5

            I thought about the requirement to convert just a user defined XML element to a line based format, i.e. all occurrences of a user defined element with all its subelements is reformatted to be on a single line without changing the rest of the XML file.

            Here is the script for this task:

            Code: Select all

            var g_sElementName = "";   // Define fixed element name here.
            
            function getElementName ()       // Function to get element name from current
            {                                // selection or input by the script user.
               // Is no element defined and something is selected in active file?
               if ((!g_sElementName.length) && (UltraEdit.activeDocument.isSel()))
               {
                  // Get element name which is first word of current selection.
                  g_sElementName = UltraEdit.activeDocument.selection.replace(/\W*(\w*).*$/,"$1");
               }
               if (!g_sElementName.length)   // Is element name still not defined?
               {
                  g_sElementName = UltraEdit.getString("Please enter element name:",1);
               }
            }
            
            if (UltraEdit.document.length > 0)  // Is any file opened?
            {
               // Define environment for this script.
               UltraEdit.insertMode();
               if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
               else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
            
               getElementName();
            
               if (g_sElementName.length)   // Is element name defined now?
               {
                  UltraEdit.perlReOn();
                  UltraEdit.activeDocument.findReplace.mode=0;
                  UltraEdit.activeDocument.findReplace.matchCase=false;
                  UltraEdit.activeDocument.findReplace.matchWord=false;
                  UltraEdit.activeDocument.findReplace.regExp=true;
                  UltraEdit.activeDocument.findReplace.searchDown=true;
                  if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
                  {
                     UltraEdit.activeDocument.findReplace.searchInColumn=false;
                  }
            
                  // Does the active XML file contain the element at all?
                  UltraEdit.activeDocument.top();
            
                  if (UltraEdit.activeDocument.findReplace.find("<" + g_sElementName + "\\>"))
                  {
                     // Get element name as written in file because XML is case-sensitive.
                     // The additional condition should not be necessary, but with some
                     // very old versions of UltraEdit for Windows the element name would
                     // be defined as empty string without this additional condition.
                     var sElementName = UltraEdit.activeDocument.selection.substr(1);
                     if (sElementName.length == g_sElementName.length)
                     {
                        g_sElementName = sElementName;
                        UltraEdit.activeDocument.findReplace.matchCase=true;
                     }
                     UltraEdit.activeDocument.findReplace.preserveCase=false;
                     UltraEdit.activeDocument.findReplace.replaceAll=true;
                     UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
            
                     UltraEdit.activeDocument.top();
                     var sFindStartTag = "<" + g_sElementName + "\\>[^>]*";
                     var sFindEndTag = "</" + g_sElementName + ">";
                     var nColumnOffset = (typeof(UltraEdit.activeDocumentIdx) == "undefined") ? 1 : 0;
            
                     while (UltraEdit.activeDocument.findReplace.find(sFindStartTag))
                     {
                        // Caret is left to > of start tag of element.
                        // Get current line and column position in file.
                        var nLine = UltraEdit.activeDocument.currentLineNum;
                        var nColumn = UltraEdit.activeDocument.currentColumnNum + nColumnOffset;
            
                        // Find the end tag and set caret to begin of end tag.
                        UltraEdit.activeDocument.findReplace.regExp=false;
                        if (UltraEdit.activeDocument.findReplace.find(sFindEndTag))
                        {
                           var nEndColumn = UltraEdit.activeDocument.currentColumnNum;
                           nEndColumn += nColumnOffset - sFindEndTag.length;
                           UltraEdit.activeDocument.gotoLine(0,nEndColumn);
            
                           // Select everything from begin of end tag to end of start tag.
                           UltraEdit.activeDocument.gotoLineSelect(nLine,nColumn);
            
                           // Remove all normal spaces, horizontal tabs, carriage returns and
                           // line-feeds between angle brackets as well as end of selection.
                           UltraEdit.activeDocument.findReplace.regExp=true;
                           UltraEdit.activeDocument.findReplace.mode=1;
                           UltraEdit.activeDocument.findReplace.replace(">[\\r\\n\\t ]+(<|(?!.))",">$1");
                           UltraEdit.activeDocument.findReplace.mode=0;
                        }
                        UltraEdit.activeDocument.findReplace.regExp=true;
                     }
            
                     // In some versions of UltraEdit for Windows it can happen that the
                     // whitespaces left to end tag of element are not completely removed
                     // during the replaces on the selections. For that reason run one more
                     // Perl regular expression replace all to definitely remove them now.
                     UltraEdit.activeDocument.top();
                     UltraEdit.activeDocument.findReplace.replace("[\\r\\n\\t ]+(?=" + sFindEndTag + ")","");
                     UltraEdit.activeDocument.top();
                  }
               }
            }
            
            Some additional notes:
            1. It is possible to define an element name at top of the script to run the script without user prompt for element name.
            2. The first word of an existing selection on starting the script is interpreted as element name on empty element name string defined in script.
            PS: This script works even with UltraEdit for Windows v14.10.

            PPS: The first two scripts were updated by me, too. Both were updated for better compatibility with older versions of UltraEdit. The second script for sorting elements based on a subelement value handles now much better the use case that XML file contains no XML header or there is an XML header and the first element is moved down in file and another one like the last one is moved up in file to first position.
            Best regards from an UC/UE/UES for Windows user from Austria

            1581
            Power UserPower User
            1581

              Jul 15, 2019#6

              Hi Mofi,

              Sorry for my late reply. I had problem with the new account and other stuff.

              Sorting feature:
              In the file I want to sort I always get the
              ERROR: The number of 'Parameter' elements (70) is different to the number of 'Description' elements.
              XML compare:
              There are some XML-DLL from Microsoft, and there are some GUIs for it. Due to some feature (e.g. "Ignore Child Order") it works great for me.

              Geert Bellekens XmlDiff
              The XML Diff and Patch GUI Tool
              XMLDiff in CodePlex archive
              ...

              So, for the current situation I'm happy with my tools.

              Peter
              UE 26.20.0.74 German / Win 10 x 64 Pro

              6,603548
              Grand MasterGrand Master
              6,603548

                Jul 15, 2019#7

                The script searches simply for all Parameter and all Description elements in entire file before it does the sort of the Parameter elements based on the value of their Description elements. So if other elements have also a Description element or there are Parameter elements with no or more than one Description element, the script exits without doing anything. It is of course possible to rewrite the script to be more restrictive on where a Description element is found and be less restrictive for Parameter elements with no or more than one Description element.

                This is a typical example for a not good list of requirements for a coding task resulting in customer not getting what was expected as presented wonderfully by project cartoon.
                Best regards from an UC/UE/UES for Windows user from Austria

                1581
                Power UserPower User
                1581

                  Jul 15, 2019#8

                  Sorry, Mofi, that I was not precisely enough. I promise improvement.

                  I saw the cartoon the first time more than 30 ago (for planning and construction), but now I see it improved and expanded (billing, documentation, ..).
                  And yes - it is perfect since more than 30 years.
                  UE 26.20.0.74 German / Win 10 x 64 Pro