How do I remove duplicate lines?

How do I remove duplicate lines?

10
Basic UserBasic User
10

    Dec 05, 2005#1

    I have a problem that I hope someone can help me with :)
    I have a large file, in this file I have some duplicate (and more) values for the same "primary key". The "key" is always in the same position in the lines. 
    Is it possible to make (and how) a macro that delete or removes the lines that have duplicate values?
    My file has more than 100.000 lines and I would hate to do this manually :evil: 
    I also have to keep the file as it is, so I can't import to Excel.

    I have UltraEdit version 11.00b.

    6,688587
    Grand MasterGrand Master
    6,688587

      Dec 05, 2005#2

      The following macro should do the job, but only if no line exists which contains regular expression characters of UltraEdit style like +[]^%$ ... See help of UltraEdit about regular expressions in UltraEdit style. Unix style cannot be used here, because ^c is not available in a Unix style search.

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Bottom
      IfColNum 1
      Else
      "
      "
      EndIf
      Top
      Clipboard 9
      Loop
      IfEof
      ExitLoop
      EndIf
      Key END
      IfColNumGt 1
      StartSelect
      Key HOME
      Cut
      EndSelect
      Find MatchCase RegExp "%^c^p"
      Replace All ""
      Paste
      EndIf
      Key DOWN ARROW
      EndLoop
      ClearClipboard
      Clipboard 0
      Top
      UnixReOn

      Remove the last red command, if you use regular expression in UltraEdit style by default instead of Unix style.
      For UltraEdit v11.10c and former versions see at Advanced - Configuration - Find setting Unix style Regular Expressions.
      For UltraEdit v11.20 and later versions see at Advanced - Configuration - Searching setting Unix style Regular Expressions.
      The macro commands UnixReOn and UnixReOff modify this setting.

      I have an idea how to do it without a regular expression search, but it is much more tricky and I now have no time to develop this macro set (it cannot be done with a single macro).
      Best regards from an UC/UE/UES for Windows user from Austria

      10
      Basic UserBasic User
      10

        Dec 05, 2005#3

        Thank you :D

        I don't see how the macro works, but it actually does :D

        Thanks from Norway.

        344
        MasterMaster
        344

          Dec 05, 2005#4

          Hi Norwegian guy, Tag Mofi,

          I added some functionality to Mofi's good macro:
          Make a new tab and list those double(+) lines there, cause I want to know which lines where two times inside.
          Then I sort them. (You can kick the sort line if you want.)

          Check it out.  :D

          rds Bego

          Edited my Mofi: Macro source code removed - see below for the improved version.
          Normally using all newest english version incl. each hotfix. Win 10 64 bit

          10
          Basic UserBasic User
          10

            Dec 05, 2005#5

            This macro gets better and better, I'm glad there are some helpful people out there :)

            Thank you.

            6,688587
            Grand MasterGrand Master
            6,688587

              Dec 09, 2005#6

              Thanks Bego for the idea to collect the duplicate lines as additional info and for the information that IfFound and IfNotFound can also be used after a replace. That was new for me although I have written dozens of UltraEdit macros. Even an experienced user like I can learn from others. Thanks again.

              I have modified the macro again. Now it also works for files with lines with UltraEdit style regular expression characters and it does not need a second macro as I first thought would be necessary. It now could be also converted to a macro with Unix style regular expressions instead of UltraEdit style. Only 5 simple regular expressions must be changed for Unix style.

              The removing duplicate line replace command is now case-sensitive. Remove MatchCase parameter if it should ignore case.

              The collection of the duplicate lines is done now with clipboard 8, which improves execution speed a lot. The duplicate lines are sorted. If someone wants this macro without collecting the duplicate line info, remove the red colored lines.

              This macro is now added to my private collection of useful macros - see sticky forum topic Macro examples and reference for beginners and experts which contains a macro file with the macros DelDupInfo+ (macro below) and DelDupInfo- (macro below without the red lines).

              The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.


              InsertMode
              ColumnModeOff
              HexOff
              UnixReOff
              Bottom
              IfColNum 1
              Else
              "
              "
              EndIf
              Top
              Find MatchCase RegExp "%^([~^p]^)"
              Replace All "#MOFI_RULES#^1"
              Clipboard 8
              ClearClipboard

              Clipboard 9
              Loop
              Find MatchCase RegExp "%#MOFI_RULES#*$"
              IfNotFound
              ExitLoop
              EndIf
              Cut
              Find MatchCase "^c^p"
              Replace All ""
              IfFound
              Paste
              Find MatchCase Up "#MOFI_RULES#"
              Key HOME
              Clipboard 8
              Find MatchCase RegExp "%#MOFI_RULES#*^p"
              CopyAppend
              EndSelect
              Key HOME
              Clipboard 9
              Else

              Paste
              Key DOWN ARROW
              Key HOME
              EndIf
              EndLoop
              ClearClipboard
              Top
              Find MatchCase RegExp "%#MOFI_RULES#"
              Replace All ""
              NewFile
              Clipboard 8
              Paste
              ClearClipboard
              Top
              Find MatchCase RegExp "%#MOFI_RULES#"
              Replace All ""
              IfNotFound
              "NO DUPLICATES :-)
              "
              Else
              SortAsc 1 -1 0 0 0 0 0 0
              EndIf
              NextWindow

              Clipboard 0


              Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macro if you do not use UltraEdit style regular expressions by default - see search configuration. Macro command UnixReOff sets the regular expression option to UltraEdit style.


              Edit info: Some comments added - see below!

              This macro will not work for Unix files opened in Unix mode without conversion temporarily (on file load) or permanently to DOS before macro execution (^p matches CR+LF).

              The macro is designed to remove duplicate lines only if a line matches another line 100%. If there are trailing spaces and the trailing spaces of 2 lines displayed identical do not match, the lines will not be removed and reported. Use the command TrimTrailingSpaces at top of the macro after the command Top, if you want to ignore trailing spaces and you can delete it.

              2007-11-01: The macro has been rewritten completely because it damaged the file when there are soft-wrapped lines. The new macro works now also for a file with soft-wrapped lines. Also IfEof has been eliminated to let the macro work on Unicode files too, independent of the version of UltraEdit. IfEof works for Unicode files only since UE v13.20.

                Dec 09, 2005#7

                Bego: Thanks again for this interesting infos (deleted). I will take this into consideration for future macros (see improved version of the macro above).

                The "#MOFI_RULES#" string is used as replacement for the regular expression character % to be able to correct handle lines like this without a regular expression (lines with different preceding and trailing spaces):

                Code: Select all

                Line example
                 Line example
                 Line example 
                Another Line example
                Nothing should be changed when running the macro on those 4 lines. Third line contains a trailing space, second line not! Select line 2 and 3 and you will see the difference.

                  Nov 01, 2007#8

                  Hi sas2000,

                  thanks for your uedit32.ini. With your configuration I was able to reproduce the problem and find the reason why the macro failed and created a wrong output (= damaged file).

                  The problem was the soft-wrapping you have enabled. I normally have soft-wrapping of lines not active. I use it normally only when editing HTML files, but have it not active when running macros or scripts.

                  I did not know how many macro commands depend on wrapping mode on/off. Key HOME, Key END and SelectLine which I have used before for this macro to select a line with or without line ending are executed always on current displayed line which is not the entire real line if the line is currently soft-wrapped. As a result of this the previous macro worked perfect until it reached the first line which was soft-wrapped.

                  Additionally you have option Replace All is From Top of File active as you can see in the Replace dialog which makes the output even worse.

                  I have completely rewritten the macro to get correct output(s) also when soft-wrapped lines exist.

                  I have already deleted all of our previous posts. You can delete now the zip archives and files on your website.

                  As a result of turning my attention to what happens when a macro is run on soft-wrapped lines which is not designed for working in active word-wrap mode I have now to update also the macros DelDupInfo- and DelDupInfo+ in my macro collection and add many notes to my macro reference. But first I have to find out which macro commands work different depending on word-wrapp mode on/off. That will take some time.

                  sas2000, many thanks!
                  Best regards from an UC/UE/UES for Windows user from Austria

                  9
                  NewbieNewbie
                  9

                    Nov 01, 2007#9

                    To Mofi:

                    It works fine now  :D.

                    Having read your macro I think that it doesn't exist, but do you know any macro commands to switch on/off soft-wrapping & Replace All is From Top of File? My knowledge about macros is quite limited and this way I'll avoid problems on my own macros, I've tried:

                    SoftWrapOff
                    WrapOff
                    WrapWordOff
                    WordWrapOff

                    but none worked. May you help me?

                    Thanks.  :!:

                    6,688587
                    Grand MasterGrand Master
                    6,688587

                      Nov 02, 2007#10

                      Except the active regular expression engine none of the configuration settings can be changed by a macro or script. I have already written twice to IDM support that replace option Replace All is From Top of File should be disabled internally temporarily while a macro or script is running to make the output predictable. For scripts this is the case since UE v13.20, for macros since UE v13.20a.

                        Jun 09, 2014#11

                        The macros do not work on Unicode files with characters in file not included in system code page because ^c works only with ASCII/ANSI strings for UltraEdit for Windows < v24.00 and UEStudio < v17.00.

                        Here is macro DelDupInfo- converted to an UltraEdit script which converts the string to search from UTF-16 Little Endian to UTF-8 which the Find and Replace commands of UltraEdit support for a search/replace string. The script can be used also for ASCII/ANSI files.

                        The function IsUnicode must be additionally included for UE v14.20 to v15.20 when script is executed on Unicode files.

                        Code: Select all

                        // Please include here the function IsUnicode for UE v14.20 to v15.20.
                        // See https://forums.ultraedit.com/viewtopic.php?f=52&t=5441
                        
                        function utf16to8(str)
                        {
                           /* Copyright (C) 1999 Masanao Izumo <[email protected]>
                           * Version: 1.0
                           * LastModified: Dec 25 1999
                           * This library is free.  You can redistribute it and/or modify it.
                           * http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt */
                        
                           var out, i, len, c;
                        
                           out = "";
                           len = str.length;
                           for(i = 0; i < len; i++)
                           {
                              c = str.charCodeAt(i);
                              if ((c >= 0x0001) && (c <= 0x007F))
                              {
                                 out += str.charAt(i);
                              }
                              else if (c > 0x07FF)
                              {
                                 out += String.fromCharCode(0xE0 | ((c >> 12) & 0x0F));
                                 out += String.fromCharCode(0x80 | ((c >>  6) & 0x3F));
                                 out += String.fromCharCode(0x80 | ((c >>  0) & 0x3F));
                              }
                              else
                              {
                                 out += String.fromCharCode(0xC0 | ((c >>  6) & 0x1F));
                                 out += String.fromCharCode(0x80 | ((c >>  0) & 0x3F));
                              }
                           }
                           return out;
                        }
                        
                        if (UltraEdit.document.length > 0)  // Is any file opened?
                        {
                           // Define environment for this script.
                           UltraEdit.insertMode();
                           if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                           else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                           UltraEdit.ueReOn();
                        
                           var sSearch = "";
                           var bUnicodeFile = false;
                           if (typeof(UltraEdit.activeDocument.encoding) == "number")
                           {
                              // Is the file encoded in UTF-16 Little Endian or Big Endian or UTF-8?
                              if (UltraEdit.activeDocument.encoding == 1200) bUnicodeFile = true;
                              else if (UltraEdit.activeDocument.encoding == 1201) bUnicodeFile = true;
                              else if (UltraEdit.activeDocument.encoding == 65001) bUnicodeFile = true;
                           }
                           else
                           {
                              if(typeof(IsUnicode) == "function")
                              {
                                 bUnicodeFile = IsUnicode();
                              }
                              else
                              {
                                 if (UltraEdit.outputWindow.visible == false) UltraEdit.outputWindow.showWindow(true);
                                 UltraEdit.outputWindow.write("Function isUnicode not included. Please add this function to the script.");
                                 UltraEdit.outputWindow.write("See https://forums.ultraedit.com/viewtopic.php?f=52&t=5441");
                              }
                           }
                           var_dump(bUnicodeFile);
                           // First go to end of the file and check if the last line of the file has a
                           // line termination. If not insert it because the script must compare whole
                           // lines. After inserting the line termination, verify if the cursor is now
                           // really at column 1. With auto indent enabled and the last line in the
                           // file has preceding whitespace, UE/UES has inserted those whitespace
                           // also on the last line of the file and the cursor is therefore not at
                           // column 1.
                        
                           UltraEdit.activeDocument.bottom();
                           if (UltraEdit.activeDocument.isColNumGt(1))
                           {
                              UltraEdit.activeDocument.insertLine();
                              if (UltraEdit.activeDocument.isColNumGt(1))
                              {
                                 UltraEdit.activeDocument.deleteLine();
                              }
                           }
                        
                           // Insert at start of every line a special marker string. This is needed
                           // because the script should find only whole duplicate lines without using
                           // a regular expression search. Without this marker string at start of the
                           // line a shorter line could completely match also with a longer line which
                           // has additional characters at start of the line.
                        
                           UltraEdit.activeDocument.top();
                           // UltraEdit.activeDocument.trimTrailingSpaces();
                        
                           UltraEdit.activeDocument.findReplace.mode=0;
                           UltraEdit.activeDocument.findReplace.matchCase=true;
                           UltraEdit.activeDocument.findReplace.matchWord=false;
                           UltraEdit.activeDocument.findReplace.regExp=true;
                           UltraEdit.activeDocument.findReplace.searchDown=true;
                           if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
                           {
                              UltraEdit.activeDocument.findReplace.searchInColumn=false;
                           }
                           UltraEdit.activeDocument.findReplace.preserveCase=false;
                           UltraEdit.activeDocument.findReplace.replaceAll=true;
                           UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
                           UltraEdit.activeDocument.findReplace.replace("%^([~^p]^)","#MOFI_RULES#^1");
                        
                           // User clipboard 9 always contains the current line including the marker
                           // string whose duplicates are searched for in the file below the line.
                           UltraEdit.selectClipboard(9);
                        
                           // A regular expression find is used to select entire next line without
                           // the line ending characters. This method is better than the method used
                           // in previous versions of the script with Key END and Key HOME because it
                           // works also for long lines which are currently wrapped. Further the
                           // regular expression search automatically ignores blank lines. If this
                           // regular expression find does not find something, the end of the file
                           // is reached and therefore the loop must be exited.
                        
                           while (UltraEdit.activeDocument.findReplace.find("%#MOFI_RULES#*$"))
                           {
                              // Cut the entire line without the line ending to clipboard 9.
                              // Then replace all duplicates with a case-sensitive search and
                              // replace all. Remove the find option MatchCase to ignore the case.
                              UltraEdit.activeDocument.cut();
                              UltraEdit.activeDocument.findReplace.regExp=false;
                              if (bUnicodeFile)
                              {
                                 // Convert the UTF-16 string in clipboard to UTF-8.
                                 sSearch = utf16to8(UltraEdit.clipboardContent) + "^p";
                              }
                              else
                              {  // Get the ASCII/ANSI string directly from clipboard.
                                 sSearch = UltraEdit.clipboardContent + "^p";
                              }
                              UltraEdit.activeDocument.findReplace.replace(sSearch,"");
                              UltraEdit.activeDocument.findReplace.regExp=true;
                              // Use below command Delete instead of Paste and Key DOWN ARROW if you also
                              // want the line itself which has duplicates deleted from the source file.
                              UltraEdit.activeDocument.paste();
                              UltraEdit.activeDocument.key("DOWN ARROW");
                              UltraEdit.activeDocument.key("HOME");
                           }
                        
                           // Clipboard 9 is not needed anymore and so it can be cleared
                           // to free the RAM used for the current content of clipboard 9.
                           UltraEdit.clearClipboard();
                           // Back at top of the file remove all marker strings inserted
                           // at start of the script to mark start of a line.
                           UltraEdit.activeDocument.top();
                           UltraEdit.activeDocument.findReplace.replace("%#MOFI_RULES#","");
                           // Last switch back to the Windows clipboard.
                           UltraEdit.selectClipboard(0);
                        }
                        
                        Best regards from an UC/UE/UES for Windows user from Austria

                        1
                        NewbieNewbie
                        1

                          Jan 01, 2018#12

                          Hi guys, the macro works great and I want to be able to do similar but instead of removing one duplicate line, is it possible to remove both please?

                          For example if I copy/paste the text from file-2.txt into file-1.txt which causes duplicate lines if some lines from file-2.txt are also in file-1.txt.
                          I would like to be able to remove both duplicate lines so that the final info in file-1.txt now does not contain any lines that are already in file-2.txt.
                          I hope that makes sense.

                          TIA from aus

                          6,688587
                          Grand MasterGrand Master
                          6,688587

                            Jan 01, 2018#13

                            vengab, there are two macros and one script posted here. So it is not really clear which macro you would like have modified to keep in active file only the unique lines. I modified the first macro.

                            InsertMode
                            ColumnModeOff
                            HexOff
                            UnixReOff
                            Bottom
                            IfColNum 1
                            Else
                            "
                            "
                            EndIf
                            Top
                            Clipboard 9
                            Loop
                            IfEof
                            ExitLoop
                            EndIf
                            Key END
                            IfColNumGt 1
                            StartSelect
                            Key HOME
                            Cut
                            EndSelect
                            Find MatchCase RegExp "%^c^p"
                            Replace All ""
                            IfFound
                            DeleteLine
                            Else
                            Paste
                            Key DOWN ARROW
                            EndIf

                            Else
                            DeleteLine

                            EndIf
                            EndLoop
                            ClearClipboard
                            Clipboard 0
                            Top
                            UnixReOn

                            The green formatted code is responsible for keeping in active file just lines not existing more than once.

                            This macro removes additionally all empty lines.
                            Best regards from an UC/UE/UES for Windows user from Austria