Delete duplicate lines within each block

Delete duplicate lines within each block

3
NewbieNewbie
3

    Sep 14, 2012#1

    Hi
    I need some help with my large file compare. I have data with contains many block of messages, each block starts with ==== in two lines and ends with ==== in two lines as below. I want to delete any duplicates in each block.
    For examples, 1st block below have 3 appears 2 times, I want to delete dupicate value, in 2nd block, there are no duplicates, 3rd block have again 2 duplicated, 4th block no duplicates.

    Can some one help me urgently?

    ====
    ====
    1
    3
    5
    3
    6
    7
    ====
    ====
    4
    5
    6
    7
    8
    ====
    ====
    1
    2
    3
    4
    5
    2
    ====
    ====
    2
    3
    4
    5
    6
    ====
    ====

    6,603548
    Grand MasterGrand Master
    6,603548

      Sep 14, 2012#2

      Here is version 2 of the script written for this task. Copy and paste the script code into a new ASCII file, save the file for example as RemoveDuplicatesInBlocks.js, add the script file via Scripting - Scripts to the list of scripts, open the file on which the script should be executed and run the script from menu Scripting.

      Code: Select all

      if (UltraEdit.document.length > 0)  // Is any file opened in UltraEdit?
      {
      
         // Define environment for this script.
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
      
         // Select entire file and load all into memory as blocks.
         UltraEdit.activeDocument.selectAll();
         var sBlockSeparator = "====\r\n";
      
         // The string above is the block separator. The command below creates
         // an array of strings. Everything between 2 block separating strings
         // is loaded into the array of strings as a string. As there is nothing
         // between two successive block separating lines, there are also empty
         // strings added to the array.
         var asBlocks = UltraEdit.activeDocument.selection.split(sBlockSeparator);
      
         if (asBlocks.length > 1 )   // Any block found?
         {
            var nDuplicateLines = 0; // Variable for counting the number of duplicate lines removed.
            var nLineNum = 0;        // Variable holding current original line number.
      
            // Prepare output window for process information.
            UltraEdit.outputWindow.showStatus=false;
            UltraEdit.outputWindow.clear();
            if (UltraEdit.outputWindow.visible == false)
            {
               UltraEdit.outputWindow.showWindow(true);
            }
      
            // Evaluate each block for duplicate lines to remove.
            for (var nBlockIndex = 0; nBlockIndex < asBlocks.length; nBlockIndex++)
            {
               // If the block string is empty , ignore this block as it is for 2 block seperating lines.
               if (!asBlocks[nBlockIndex].length)
               {
                  nLineNum++;
                  continue;
               }
      
               // Split a block string into an array of strings each
               // string containing one line of the current block.
               var asLines = asBlocks[nBlockIndex].split("\r\n");
      
               // Build an array with the line numbers for the lines of this block. This
               // is necessary as duplicate lines removed dynamically reduce the total
               // number of lines within the block and therefore it is not possible to
               // calculate original line numbers when a duplicate line is removed.
               nLineNum++;
               var anLineNumbers = new Array(asLines.length);
               for (var nLineIndex = 0; nLineIndex < asLines.length; nLineIndex++)
               {
                   anLineNumbers[nLineIndex] = nLineNum + nLineIndex;
               }
               nLineNum += asLines.length - 1;
      
               // Check every line of the block except the last line against
               // all other lines of the block below. The last string in the
               // array is always an empty string and is therefore ignored too.
               var bLinesRemoved = false;
               for (var nLineIndex = 0; nLineIndex < (asLines.length-2); nLineIndex++)
               {
                  // Start with the compares on next line.
                  var nCompareIndex = nLineIndex+1;
                  while (nCompareIndex < (asLines.length-1))
                  {
                     // Current line within current block identical with line to compare?
                     if (asLines[nLineIndex] != asLines[nCompareIndex]) nCompareIndex++;
                     else
                     {
                        nDuplicateLines++;
                        bLinesRemoved = true;
                        UltraEdit.outputWindow.write("Line "+anLineNumbers[nCompareIndex]+" removed as identical with line "+anLineNumbers[nLineIndex]);
                        asLines.splice(nCompareIndex,1);
                        anLineNumbers.splice(nCompareIndex,1);
                     }
                  }
               }
               // Is any line removed in this block, rebuild the block string from
               // the array of the strings each containing one line of the block.
               if (bLinesRemoved) asBlocks[nBlockIndex] = asLines.join("\r\n");
            }
      
            var sFinalReport;
            if (nDuplicateLines)    // Are any duplicate lines removed.
            {
               // Replace entire file content by rebuild file content.
               UltraEdit.selectClipboard(9);
               UltraEdit.clipboardContent = asBlocks.join(sBlockSeparator);
               UltraEdit.activeDocument.paste();
               UltraEdit.clearClipboard();
               UltraEdit.selectClipboard(0);
               UltraEdit.activeDocument.top();
               sFinalReport = nDuplicateLines.toString() + " duplicate line" + ((nDuplicateLines > 1) ? "s" : "") + " removed.";
            }
            else
            {
               UltraEdit.activeDocument.top();
               sFinalReport = "No duplicate lines found.";
            }
            UltraEdit.outputWindow.write(sFinalReport);
         }
      }

      3
      NewbieNewbie
      3

        Sep 14, 2012#3

        Thank you so much for your quick reply.

        Your script works perfectly with small data (like five blocks). But when I am running with huge data (200000 lines), it is running for long time, it's not yet finished from last 30 minutes. I will wait some more time.

        Can I ask another favor like, on top of removing duplicates, can you display the row number it deleted?

        6,603548
        Grand MasterGrand Master
        6,603548

          Sep 15, 2012#4

          It would have been good if you would have written how large your file is. For files with more than 20 MB I would have coded the script completely different. The script is written for maximum performance by doing everything in memory. But this works only for small files. This script is not written for files with several dozens MB or even GB.

          UltraEdit prior v18.20.0.1017 have a problem with writing large strings from memory back to a file. It takes very long. You have not written which version of UltraEdit you have and therefore I assumed that you are using the currently latest version which is v18.20.0.1017 not having this problem. For previous versions of UE I would have written instead of

          Code: Select all

                   UltraEdit.activeDocument.write(asBlocks.join(sBlockSeparator));
          in script

          Code: Select all

                   UltraEdit.selectClipboard(9);
                   UltraEdit.clipboardContent = asBlocks.join(sBlockSeparator);
                   UltraEdit.activeDocument.paste();
                   UltraEdit.clearClipboard();
                   UltraEdit.selectClipboard(0);
          which is a workaround for slow speed of write command on writing large strings into a file.
          Malini wrote:Can I ask another favor like, on top of removing duplicates, can you display the row number it deleted?
          With the script as it was, this was not possible. The line number information was not present within memory. But I rewrote the script to add also line number information. This makes the script slower, but you get in output window which lines of original file content has been removed to build new file content with less lines. See the modified script in my previous post now containing also comments.

          3
          NewbieNewbie
          3

            Sep 16, 2012#5

            After changing to your latest code, it works fine with until 200K lines in very fast pace. I am cutting blocks like 200K lines and running your script.

            I am using old version of UE.

            Thank you so much for your help.