Finding duplicate values in a record, listing them in new tab

Finding duplicate values in a record, listing them in new tab

7
NewbieNewbie
7

    Aug 03, 2005#1

    Hi,
    I have read the other finding and removing duplicate posts on here but they all seem to be about removing duplicates. I am just trying to find them in large file with over 100,000 records in it. The file is already sorted on the proper key order and so doing a sort I don't see as helping me.

    Here is a small sample data set. (though it doesn't look like it.. all records are the same length.)
    00560000000012345600004196155000000FP JENS0200508010808 00040004
    00560000000022345600004224262000000FP JENS0200508010808 00040004
    00560000000038764200001122222012000FP BOBS0200508010808 00040004
    00560000000048764200001122342012000FP BOBS0200508010808 00040004

    What I want to do is compare the bolded columns (13-29) in each record, to the same columns in the following record to verify that it does not the same record. If it does I want all instances of the original and its duplicate listed out into a new tab, or even the usual 'find dialog' box would be cool.

    Another acceptable solution would be to compare those columns to those columns in the rest of the file, but I think that would be a lot of unneccesary work that would need to be completed.

    I have never created a Macro in UltraEdit so I was hoping someone might be able to help me out with this one.

    Thanks,
    Christopher

    6,675585
    Grand MasterGrand Master
    6,675585

      May 27, 2006#2

      The bolded columns are 14-29, if first column has the number 1 !

      For this job 2 macros are needed because nesting of loops is not possible. Both macros need the macro property Continue if a Find with Replace not found checked.

      It's important that you first create the submacro with the case-sensitive name FindDupValue. It appends to existing content of clipboard 8
      the current line which contains the string in clipboard 9 until the string is not found anymore.

      Submacro FindDupValue:

      Loop
      Clipboard 9
      Find "^c"
      IfNotFound
      ExitLoop
      Else
      SelectLine
      Clipboard 8
      CopyAppend
      EndIf
      EndLoop


      The main macro which I have named FindDupRecords is much more complicated. The macro is designed to run on an ASCII DOS file which is already saved before the macro is executed.

      First the macro checks if the last line is terminated with CR/LF and if not adds it. Next it sorts the file according to the content in column 14-29. This is important because the duplicate records must be in a consecutive order.

      There is no option to find something only in a specified range. So I have used a regular expression replace to mark the columns of interest with the »« characters. I hope these 2 characters do not exist in your file.

      I don't like window switching during a macro execution. This slows down the execution speed. So the macro uses clipboard 8 as buffer for the duplicate line information. The clipboard 8 is cleared and the loop starts.

      Inside the loop first the value of columns 14-29 of the current line is selected and copied to user clipboard 9. If there is no more a string surrounded with »«, the end of file is reached and the loop is breaked.

      After copying the string to clipboard 9, the cursor is moved to begin of the next line. This must also work on last line or an endless loop is created. Now you know why the macro starts with the EOL check of the last line of the file.

      From this position it searches for the string in clipboard 9. If it is not found, the cursor is still at begin of the line and the value of the line above is not duplicate.

      If the value is found, it has to been found in the current line because the file was sorted according to the values in column 14-29. Well, the macro has found a duplicate record. So the cursor is positioned at begin of the line above (= the first line with this value) and 2 lines are selected and appended to clipboard 8.

      It's possible that the current value exists in more than 2 lines. The submacro FindDupValue finds now all other duplicate lines with this value and appends it also to clipboard 8.

      After all lines with this value are stored in the buffer, the main macro appends an additional DOS line break to create blocks in the result file for better viewing later.

      After all lines of the file are processed, the code after command EndLoop is reached. The source macro was modified. The macro copies now it's file name with path to clipboard 9, closes the modified source file without saving it and opens it again. The macro should not destroy the content of the source file. Clipboard 9 is not used anymore and so the macro clears it.

      Next the new file is created and the content in clipboard 8 is pasted. Clearing the maybe big amount of data in clipboard 8 and switching back to the windows clipboard are the next steps of the macro. If there are duplicate records, we have an unnecessary blank line at the end of the result which is deleted now.

      Back at top of the file the cursor is moved to the end of the first line. If the cursor is now at column 1, no duplicate records has been found by the macro and you will get an appropriate info.

      If duplicate records has been found, the macro sets the cursor back to top of the file and removes the marker characters inserted.


      Macro FindDupRecords:

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Bottom
      IfColNum 1
      Else
      "
      "
      EndIf
      Top
      SortAsc 14 29 1 13 30 -1 0 0
      Find RegExp "%^(?????????????^)^(????????????????^)"
      Replace All "^1»^2«"
      Clipboard 8
      ClearClipboard
      Clipboard 9
      Loop
      Find RegExp "»*«"
      IfNotFound
      ExitLoop
      EndIf
      Copy
      Key HOME
      Key DOWN ARROW
      Find "^c"
      IfFound
      Clipboard 8
      Key HOME
      Key UP ARROW
      StartSelect
      Key DOWN ARROW
      Key DOWN ARROW
      CopyAppend
      EndSelect
      PlayMacro 1 "FindDupValue"
      Clipboard 8
      Key UP ARROW
      Find "^p"
      CopyAppend
      Clipboard 9
      EndIf
      EndLoop
      CopyFilePath
      CloseFile NoSave
      Open "^c"
      ClearClipboard
      NewFile
      Clipboard 8
      Paste
      ClearClipboard
      Clipboard 0
      Key BACKSPACE
      Top
      Key END
      IfColNum 1
      "No duplicate values found !!!"
      Else
      Key HOME
      Find RegExp "[»«]"
      Replace All ""
      EndIf

      Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macro if you do not use UltraEdit style regular expressions by default - see search configuration. Macro command UnixReOff sets the regular expression option to UltraEdit style.
      Best regards from an UC/UE/UES for Windows user from Austria