Eliminating duplicates based on strings left the equal sign

Eliminating duplicates based on strings left the equal sign

4
NewbieNewbie
4

    Oct 21, 2009#1

    Hello,
    I have checked through the duplicates macros but cannot find one that meets my requirements.
    I have a huge file in which the data is made up of 2 columns of irregular length separated by a =. It so happens that the left hand column has duplicates but the right hand column corresponding to these duplicates does not have similar words.
    An example will explain:
    ravi=rvI
    ravi=roI
    What I would like to do is to separate out the duplicates and store them in a separate file without deleting them. In other words the macro should create two files:
    non-dupes and dupes
    Any chance of a macro for this. I have huge files running to over a hundred thousand entries. Pl. help.
    Doc

    6,605547
    Grand MasterGrand Master
    6,605547

      Oct 21, 2009#2

      If I have you understand correct you want the data in your file split up into 2 other files. The first one should contain the lines with the first occurrence of a string left the equal sign and the second one should contain all other lines having the same string left the equal sign already found once.

      For this task you need 2 macros. The macros are based on the macro I posted at How do I remove duplicate lines? The macro property Continue if search string not found must be checked for both macros. The macro property for the Cancel dialog should be unchecked for both macros.

      The macros are designed to work on lines with DOS line endings because of using ^p in the search strings. If your file is a Unix file opened without conversion to DOS, you have to replace all ^p by ^n in the macro sources to get the macros correct working.

      The first macro uses command SelectLine which is faster, but requires that the source file is opened without any word wrap enabled.

      It is important that your huge source file is opened with usage of a temporary file because the macro modifies this source file, but closes it without saving the changes (and reopens it). This works only if the modifications are not permanent which requires that the source file is opened with usage of a temporary file.

      First click on Macro - Edit Macro, button New Macro and enter as macro name FindDuplicates. The macro name for the first macro is important including the case of the letters. After setting the properties as written above, click OK and replace the existing lines with following macro code:

      Loop
      Clipboard 9
      Find MatchCase "^c"
      IfFound
      SelectLine
      Clipboard 8
      CutAppend
      Else
      ExitLoop
      EndIf
      EndLoop

      Next click again on button New Macro and confirm that you want to save the modifications of the just created macro. For the second macro the name does not matter, use for example SplitDupsFile. After setting the properties as written above, click OK and replace the existing lines with following macro code:

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Bottom
      IfColNum 1
      Else
      "
      "
      EndIf
      Top
      Find RegExp "%^([~^p]^)"
      Replace All "#MOFI_RULES#^1"
      Clipboard 7
      ClearClipboard
      Clipboard 8
      ClearClipboard
      Clipboard 9
      Loop
      Find MatchCase RegExp "%#MOFI_RULES#*="
      IfNotFound
      ExitLoop
      EndIf
      Copy
      Clipboard 7
      CutAppend
      Find RegExp "?++^p"
      CutAppend
      PlayMacro 1 "FindDuplicates"
      Top
      EndLoop
      CopyFilePath
      CloseFile NoSave
      Open "^c"
      ClearClipboard
      NewFile
      Clipboard 7
      Paste
      ClearClipboard
      Top
      Find MatchCase RegExp "%#MOFI_RULES#"
      Replace All ""
      NewFile
      Clipboard 8
      Paste
      ClearClipboard
      Top
      Find MatchCase RegExp "%#MOFI_RULES#"
      Replace All ""
      IfNotFound
      "NO DUPLICATES :-)
      "
      EndIf
      Clipboard 0

      After closing the edit macro dialog with button Close and confirming to save the modifications on the just created second macro, open your file if not already open and run once the just created second macro. It will take very long on your huge file, but as a result the source file is closed without saving the changes and reopened and you should get 2 new files. The first (left) one contains the first occurrence of a line with a unique string left the equal sign and the second (right) one the lines with the duplicates.

      For example the source file contains:

      ravi=rvI
      ravi=roI
      test1=test
      test2=1
      test2=3
      test2=2


      The first new file contains after macro execution:

      ravi=rvI
      test1=test
      test2=1


      The second new file contains after macro execution:

      ravi=roI
      test2=3
      test2=2
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        Oct 21, 2009#3

        Mofi,
        You have saved my life. I revisited the forum not expecting a reply and to my surprise I found the answer. Many thanks. I tested it on a file of 200 words and it works fast and is accurate.
        The actual file has around 264663 records and I'll leave it on tonite to get the answer tomorrow.
        Many thanks once more

        Boromir