Script for deletion of unnecessary tags in an HTML file

Script for deletion of unnecessary tags in an HTML file

4
NewbieNewbie
4

    8:37 - May 02#1

    There are a lot of unnecessary tags that need to be deleted from an HTML file which should be done using an UltraEdit script.

    The example HTML file below contains the information which tags to delete.

    Code: Select all

    <html>
    <head>
    <title>EXAMPLE</title>
    </head>
    <body>
    <p class="indent">EXAMPLE FILE</p>
    
    <p>Below are IDs. These references should not be deleted.</p>
    
    <p class="indent">Example <a id="rref1" href="#ref1">1</a></p>
    <p class="indent">Example <a id="rref2" href="#ref2">2</a></p>
    <p class="indent">Example <a id="rref3" href="#ref3">3</a></p>
    <p class="indent">Example <a id="rref4" href="#ref4">4</a></p>
    <p class="indent">Example <a id="rref5" href="#ref5">5</a></p>
    
    <p>These are useless tags. Opening and closing tags should be deleted together.</p>
    
    <p><b>Aaa</b>
    should be just
    Aaa</p>
    
    <p><i>Bbb</i>
    should be just
    Bbb</p>
    
    <p><u>Ccc</u>
    should be just
    Ccc</p>
    
    <p class="indent">Example <a id="rref6" href="#ref6">6</a>
    should be just
    Example 6
    because there is no matching ID.</p>
    
    <p>There are many types of tags like figure link table link that will be deleted.</p>
    
    <p><a href="#fig4">figure 4</a>44
    should be
    figure 444</p>
    
    <p><a href="#tab6">Table 678</a>85
    should be
    Table 67885</p>
    
    <p class="indent">Example <a id="rref7" href="#ref7">7</a></p>
    <p class="indent">Example <a id="rref8" href="#ref8">8</a></p>
    <p class="indent">Example <a id="rref9" href="#ref9">9</a></p>
    
    <h2>References</h2>
    <ul class="reflist">
    <li id="ref1">Link 1</li>
    <li id="ref2">Link 2</li>
    <li id="ref3">Link 3</li>
    <li id="ref4">Link 4</li>
    <li id="ref5">Link 5</li>
    </ul>
    </body>
    </html>
    

    6,680583
    Grand MasterGrand Master
    6,680583

      6:25 - May 03#2

      Here is a commented UltraEdit script for this task.

      Code: Select all

      if (UltraEdit.document.length > 0)  // Is any file opened?
      {
         // Define environment for this script.
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
      
         // Move caret to top of the active file.
         UltraEdit.activeDocument.top();
      
         // Define all parameters for a case-insenstive Perl regular expression replace all.
         UltraEdit.perlReOn();
         UltraEdit.activeDocument.findReplace.mode=0;
         UltraEdit.activeDocument.findReplace.matchCase=false;
         UltraEdit.activeDocument.findReplace.matchWord=false;
         UltraEdit.activeDocument.findReplace.regExp=true;
         UltraEdit.activeDocument.findReplace.searchDown=true;
         if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
         {
            UltraEdit.activeDocument.findReplace.searchInColumn=false;
         }
         UltraEdit.activeDocument.findReplace.preserveCase=false;
         UltraEdit.activeDocument.findReplace.replaceAll=true;
         UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
      
         // Remove all B, I and U tags.
         UltraEdit.activeDocument.findReplace.replace("</?(?:b|i|u)>","");
      
         // Remove all figure and tab hypertext reference tags.
         UltraEdit.activeDocument.findReplace.replace('<a[\\s]+href="#(?:fig|tab)[0-9]+">(.*?)</a>',"\\1");
      
         // Select the entire file and get all list item identifiers
         // with "ref" and a number loaded into an array of strings.
         UltraEdit.activeDocument.selectAll();
         var asListIdNumbers = UltraEdit.activeDocument.selection.match(/<li\sid="ref([0-9]+)/gi);
         // Move caret to top of the active file which cancels also the selection.
         UltraEdit.activeDocument.top();
      
         // If there is at least one list item identifier, modify each element
         // in the array to have finally just the reference numbers.
         if (asListIdNumbers)
         {
            var iListIndex;
            for (iListIndex = 0; iListIndex < asListIdNumbers.length; ++iListIndex)
            {
               asListIdNumbers[iListIndex] = asListIdNumbers[iListIndex].replace(/^.*?ref/,"");
            }
            // Search for each hypertext reference which references a list identifier.
            // Get the reference number and look it up in the list of identifier numbers.
            // If the number is present, keep the reference as is and continue the search.
            // Remove otherwise the A tags and keep just the text.
            while(UltraEdit.activeDocument.findReplace.find('<a\\sid="rref[0-9]+"\\shref="#ref[0-9]+">.*?</a>'))
            {
               var sNumber = UltraEdit.activeDocument.selection.replace(/^.*?#ref([0-9]+).*$/,"$1");
               for (iListIndex = 0; iListIndex < asListIdNumbers.length; ++iListIndex)
               {
                  if (asListIdNumbers[iListIndex] == sNumber)
                  {
                     // Cancel the selection by moving the caret one character to
                     // the left which means left to closing angle bracket of </a>.
                     UltraEdit.activeDocument.key("LEFT ARROW");
                     break;
                  }
               }
               // Is the referenced number not found in the list of numbers?
               if (iListIndex == asListIdNumbers.length)
               {
                  // Remove the A tags and keep just the text.
                  UltraEdit.activeDocument.write(UltraEdit.activeDocument.selection.replace(/^.*?>(.*?)<\/a>/,"$1"));
               }
            }
         }
         else
         {
            // Remove all hypertext reference tags for non-existing list identifiers.
            UltraEdit.activeDocument.findReplace.replace('<a\\sid="rref[0-9]+"\\shref="#ref[0-9]+">(.*?)</a>',"\\1");
         }
         UltraEdit.activeDocument.top();
      }
      
      Please let me know if that is working for you with your version of UltraEdit or if something needs to be changed.

      I recognized shortly before finishing the script development that it would be more efficient to load the entire HTML file into memory of the JavaScript engine inside UltraEdit as one large string, do the replaces all on the file contents string in memory and write the modified file contents string back to the file replacing everything in the active file. But I did not know if there are Unicode characters in the HTML file being perhaps UTF-8 encoded and which version of UltraEdit is used by you. The advantage of doing all on one string in memory of the JavaScript engine would be a faster execution of the script as there would be just two document window refreshes (one on selecting all, the second one on writing new file contents over entire selected text) and just one undo record instead of multiple undo records.
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        6:24 - May 04#3