Mark lines with duplicates in first 60 columns...

Mark lines with duplicates in first 60 columns...

20
Basic UserBasic User
20

    Oct 30, 2009#1

    I can achieve this with the new sort "remove duplicates" upto column 60. But I dont want to sort, I need to leave file the same.


    I have a file with an average column length of 180. and 26,000 lines.

    I want to mark every line that has a duplicate within the first 60 columns.


    Here is an example.

    Imagine the letters in my example are within the first 60 columns, and the numbers are column 61 and up.

    the duplicate line in my example is the 3rd and 6th lines, both the same as 1st line. I want them marked with <<>> or something.


    before....

    aaaa 1111 2222 3333 4444
    bbbb 1111 2222 3333 4444
    aaaa 1111 2222 3333 4444
    cccc 1111 2222 3333 4444
    dddd 1111 2222 3333 4444
    aaaa 1111 2222 3333 4444


    after...

    aaaa 1111 2222 3333 4444
    bbbb 1111 2222 3333 4444
    <<>>aaaa 1111 2222 3333 4444
    cccc 1111 2222 3333 4444
    dddd 1111 2222 3333 4444
    <<>>aaaa 1111 2222 3333 4444


    I hope that makes sense.

    Is there a way to do this? I've searched the forums for a bit and haven't found the answer yet so sorry if I missed it.

    I'm using UE v15.20.0.1020.

    6,605548
    Grand MasterGrand Master
    6,605548

      Oct 31, 2009#2

      This is no problem using a macro. The macro is a simplified version of what I have posted at How do I remove duplicate lines?

      The macro first inserts at start of every non blank line the string #MOFI_RULES# as marker string for start of the line.

      Then a loop is executed. Inside the loop a Perl regular expression search is used to find the next line starting with the marker string #MOFI_RULES# followed by up to 60 characters except line terminating characters. If such a string could not be found anymore from current cursor position to bottom of the file, the loop is exited.

      If a string is found, it is copied to clipboard 9 and the cursor is moved to the start of the next line.

      A simple, non regular expression replace all command is now used to insert your marker string <<<>>> at start of all lines starting also with the marker string #MOFI_RULES# and with the same up to 60 characters as the line above the cursor. The marker string #MOFI_RULES# prevents this replace command for finding the up to 60 characters anywhere else than at start of a line. A regular expression replace can't be used here because ^c (clipboard content) is only supported by the UltraEdit regular expression engine and the clipboard content would be interpreted also as UltraEdit regular expression string if it would contain UltraEdit regular expression characters. That's the reason why the search and replace all command can't be a regular expression replace.

      Your marker string <<<>>> inserted at start of lines with the first 60 characters duplicate to another line prevents such lines for being taken into account on further searches.

      Finally the marker string #MOFI_RULES# is removed from all lines and Windows clipboard is activated again.

      The macro property Continue if search string not found must be checked for this macro.

      InsertMode
      ColumnModeOff
      HexOff
      PerlReOn
      Bottom
      IfColNumGt 1
      "
      "
      EndIf
      Top
      Find RegExp "^([^\r\n])"
      Replace All "#MOFI_RULES#\1"
      Clipboard 9
      Loop
      Find RegExp "^#MOFI_RULES#.{1,60}"
      IfNotFound
      ExitLoop
      EndIf
      Copy
      Key DOWN ARROW
      Key HOME
      Find MatchCase "^c"
      Replace All "<<<>>>^c"
      IfFound
      Find MatchCase "<<<>>>#MOFI_RULES#"
      Replace All "<<<>>>"
      EndIf

      EndLoop
      Top
      Find MatchCase "#MOFI_RULES#"
      Replace All ""
      ClearClipboard
      Clipboard 0

      The red marked code is only needed if the file has also lines with less than 60 characters and those lines start with the same characters as another line with more characters which is located above. For example for a file content like the following

      aaaa 1111 2222 3333 4444
      bbbb 1111 2222 3333 4444
      aaaa 1111 2222
      cccc 1111 2222 3333 4444
      dddd 1111 2222 3333 4444
      aaaa 1111 2222 3333 4444
      aaaa 1111 2222

      the macro above without the red marked code would produce

      aaaa 1111 2222 3333 4444
      bbbb 1111 2222 3333 4444
      aaaa 1111 2222
      cccc 1111 2222 3333 4444
      dddd 1111 2222 3333 4444
      <<<>>><<<>>>aaaa 1111 2222 3333 4444
      <<<>>>aaaa 1111 2222

      As you can see the 6th line is marked twice as duplicate which is not 100% correct. The red marked code removes the marker string #MOFI_RULES# from lines already marked with <<<>>> to prevent marking such lines more than once. If your file does not contain lines with less than 60 characters you can omit the red marked code which makes the macro faster.
      Best regards from an UC/UE/UES for Windows user from Austria

      20
      Basic UserBasic User
      20

        Nov 01, 2009#3

        MOFI, Thanks so much. It is exactly what I asked for.

        BUT, I made a mistake when asking. I must not have been thinking, I'm really sorry but I actually need to mark the duplicate along with the original occurrence.

        Would it be to much to ask if you could modify it so it will do that?

        SO just like you did, but also marking the original...


        before....

        aaaa 1111 2222 3333 4444
        bbbb 1111 2222 3333 4444
        aaaa 1111 2222 3333 4444
        cccc 1111 2222 3333 4444
        dddd 1111 2222 3333 4444
        aaaa 1111 2222 3333 4444


        after...

        <<>>aaaa 1111 2222 3333 4444
        bbbb 1111 2222 3333 4444
        <<>>aaaa 1111 2222 3333 4444
        cccc 1111 2222 3333 4444
        dddd 1111 2222 3333 4444
        <<>>aaaa 1111 2222 3333 4444


        Thanks again MOFI!!

        6,605548
        Grand MasterGrand Master
        6,605548

          Nov 01, 2009#4

          No problem. The 2 green marked lines must be inserted to mark also the first line where duplicates are found. The entire IfFound ... EndIf condition block is now not optional anymore, only the first find and replace all command within the condition code block.

          InsertMode
          ColumnModeOff
          HexOff
          PerlReOn
          Bottom
          IfColNumGt 1
          "
          "
          EndIf
          Top
          Find RegExp "^([^\r\n])"
          Replace All "#MOFI_RULES#\1"
          Clipboard 9
          Loop
          Find RegExp "^#MOFI_RULES#.{1,60}"
          IfNotFound
          ExitLoop
          EndIf
          Copy
          Key DOWN ARROW
          Key HOME
          Find MatchCase "^c"
          Replace All "<<<>>>^c"
          IfFound
          Find MatchCase "<<<>>>#MOFI_RULES#"
          Replace All "<<<>>>"

          Find MatchCase Up "#MOFI_RULES#"
          Replace "<<<>>>"

          EndIf
          EndLoop
          Top
          Find MatchCase "#MOFI_RULES#"
          Replace All ""
          ClearClipboard
          Clipboard 0
          Best regards from an UC/UE/UES for Windows user from Austria

          20
          Basic UserBasic User
          20

            Nov 02, 2009#5

            Perfect!

            Thanks so much for your time again Mofi.

            You're one of the best parts of Ultraedit. :D

            1
            NewbieNewbie
            1

              Oct 18, 2010#6

              This macro works exactly as advertised in UltraEdit 15 and 16, but I can't get it to work in UltraEdit 12.10. Is there something different that needs to be done in that version or is it "user error" that's making it not run. I see that it is adding the marker string and then deleting it, but it is not marking any duplicates.

              TIA,
              scoobman3

              6,605548
              Grand MasterGrand Master
              6,605548

                Oct 19, 2010#7

                The problem with this macro and UE v12.10 is a bug with the Perl regular expression engine introduced with UE v12.00 and used in this macro. This macro as posted above requires at least UE v13.00 to work because of the bug. The problem is that the expression {1,60} is not available in the UltraEdit or Unix regular expression engine. So all lines must have at least 60 characters on a line. However, that can be guaranteed with additional macro code and than the UltraEdit regular expression engine can be used. Here is the macro code which worked with UE v12.10 for the example. The blue colored parts are the changes made in comparison to the macro above to get the same result using the UltraEdit regular expression engine.

                InsertMode
                ColumnModeOff
                HexOff
                UnixReOff
                Bottom
                IfColNumGt 1
                "
                "
                EndIf
                Top
                Key END
                Loop
                IfColNumGt 60
                ExitLoop
                Else
                " "
                EndIf
                EndLoop
                ColumnModeOn
                ColumnInsert " "
                ColumnModeOff
                Top

                Find RegExp "%^([~^p]^)"
                Replace All "#MOFI_RULES#^1"
                Clipboard 9
                Loop
                Find RegExp "%#MOFI_RULES#????????????????????????????????????????????????????????????"
                IfNotFound
                ExitLoop
                EndIf
                Copy
                Key DOWN ARROW
                Key HOME
                Find MatchCase "^c"
                Replace All "<<<>>>^c"
                IfFound
                Find MatchCase "<<<>>>#MOFI_RULES#"
                Replace All "<<<>>>"
                Find MatchCase Up "#MOFI_RULES#"
                Replace "<<<>>>"
                EndIf
                EndLoop
                Top
                Find MatchCase "#MOFI_RULES#"
                Replace All ""
                Find MatchCase RegExp "%<<<>>>^(*^)$"
                Replace All "^1#!#"
                Key END
                Key LEFT ARROW
                IfCharIs "#"
                Key LEFT ARROW
                Key LEFT ARROW
                Key LEFT ARROW
                EndIf
                ColumnModeOn
                ColumnDelete 1
                ColumnModeOff
                Top
                Find MatchCase RegExp "%^(*^)#!#$"
                Replace All "<<<>>>^1"
                TrimTrailingSpaces

                ClearClipboard
                Clipboard 0
                Best regards from an UC/UE/UES for Windows user from Austria

                2
                NewbieNewbie
                2

                  Jan 06, 2012#8

                  This is a useful macro, but a bit slow on big files. Has anyone written the same thing as a script?

                  6,605548
                  Grand MasterGrand Master
                  6,605548

                    Jan 07, 2012#9

                    A script is much faster when everything can be done in RAM instead of active file. So if a file has only some 100 KB or some MB, no problem to do this with a script faster in RAM. But with a large file of several 100 MB or even GB it would be also a problem to do this with a script because loading all lines into RAM could fail. Well, a script which works like this macro while having the maximized window of a new file active would be also faster than the macro because UltraEdit would not need to refresh the display all the time during execution.

                    Marking lines with duplicates in first 60 columns could be done much faster when the lines are sorted alphabetically. The macro as written here does not require that the lines are sorted and therefore searches every line against all lines below in the file. That can take a long time with hundred thousands of lines. If the lines are sorted, a single Perl regular expression Replace could do the job too and would "compare" the first 60 characters of a line only against the first 60 characters of the line below (and the next but one line if two lines start with same 60 characters). In other words with a sorted file the number of "compares" could be reduced dramatically.