Sort on duplicate value contained in following row

Sort on duplicate value contained in following row

2
NewbieNewbie
2

    Jan 16, 2008#1

    I'm not sure how to easily describe this request but I'm trying to find an easy way to identify/highlight duplicate values that appear after/before the current value.

    Using the example below you will notice that the both the values 62500388 and 62500394 appear more than once, therefore I would like all of those values highlighted (perhaps with a prefix so the data could be re-sorted).

    62500386;62500386
    62500387;62500387
    62500388;62500217
    62500388;62500388
    62500388;62500587
    62500391;62500389
    62500391;62500391
    62500392;62500392
    62500393;62500393
    62500394;62073500
    62500394;62500394
    62500395;62073600

    Notes:
    - Both the values in the first and second column are not fixed length
    - A check for duplicates is only needed based on the values in the first column

    If any one has any suggestions, I'd appreciate the help. Thanks in advance.

    119
    Power UserPower User
    119

      Jan 17, 2008#2

      There's no way to make UE highlight them (as in syntax highlighting). You could use find or find/replace to detect/mark them though. The following uses Perl regular expressions. (So make sure you have the "Regular Expressions" box checked in the Find dialog and have selected Perl-compatible regexps in the configuration.)

      To find them search for ^(\d+);\d+\s+\1

      To mark them search for ^((\d+);\d+\s+)\2
      and replace with *$1*$2

      Replace the '*' chars with whatever (literal) text you want to use to mark them.

      236
      MasterMaster
      236

        Jan 17, 2008#3

        It would have been a good idea to read the sticky first and answer the questions there. What I'd need to know: What do you want to do with the results? Do you want to delete lines that start with identical numbers? What UE version are you using?

        The following Perl style regex (UE >= V12) will find all adjacent lines that start with the same characters (up to the first ; ):

        Code: Select all

        ^([^;]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+
        It looks a little strange but it works in UE (it's a workaround for the bug described in viewtopic.php?f=8&t=4683)

        You could then replace with \1;\2 in order to remove the duplicates.

        But maybe you want something else done?

        2
        NewbieNewbie
        2

          Jan 17, 2008#4

          mjcarman and pietzcker: Many thanks for both of your responses. I'm at home right now but will try your suggestions when I get to work tomorrow.

          pietzcker: Apologies for not reading the Sticky, I should of done. Ultimately the duplicate values just need to be found amongst the 400,000 lines of data and then used elsewhere. So finding the duplicates and adding a prefix of some kind would be a big help.

          Again, I will try both suggestions tomorrow. Thanks for your help, I appreciate it.

          119
          Power UserPower User
          119

            Jan 17, 2008#5

            pietzcker's use of "\r\n" is more robust than my use of "\s" (though it doesn't appear to matter for your data).

            pietzcker, I had missed that other topic. I'm glad to see a way to match a newline. It would be nice if "\n" just worked, but it doesn't surprise me that it doesn't. I had tried "\r\f" and even "\015\012" but neither work, oddly. I hadn't thought to try "\r\n".

            236
            MasterMaster
            236

              Jan 18, 2008#6

              pjoyce wrote:So finding the duplicates and adding a prefix of some kind would be a big help.
              OK. I guess this is something for a macro.

              I'm not very good at UE macros, and their behavior often puzzles me. What I have found to work (in UE V13) is:

              InsertMode
              ColumnModeOff
              HexOff
              PerlReOn
              Top
              Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
              Find RegExp "^"
              Replace All SelectText "###"

              The macro property "Continue with macro after search and replace not found" must be unchecked, and the macro must be "run multiple times", checking the option "run until end of file".

              I had first tried to write a macro that would only have to be run once, using a loop. However, it didn't work. What I had written was:

              InsertMode
              ColumnModeOff
              HexOff
              PerlReOn
              Top
              Loop
              Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
              Find RegExp "^"
              Replace All SelectText "###"
              Key HOME
              EndLoop

              But somehow, the loop only runs once, and I have no idea why.

              6,603548
              Grand MasterGrand Master
              6,603548

                Jan 18, 2008#7

                Very interesting. Your loop macro should work with macro property Continue if search string not found unchecked. I have added 3 lines to make the macro independent of this macro property and suddenly the loop works. I will make further tests. Maybe IDM has added into UltraEdit that a loop without a number and without an ExitLoop command runs only once to avoid an endless loop if macro property Continue if search string not found is set. I will send an email to IDM and ask for clarification on this issue.

                InsertMode
                ColumnModeOff
                HexOff
                PerlReOn
                Top
                Loop
                Find RegExp "^([^;#]+);([^\r\n]+\r\n)(?:\1;[^\r\n]+\r\n)+"
                IfNotFound
                ExitLoop
                EndIf

                Find RegExp "^"
                Replace All SelectText "###"
                Key HOME
                EndLoop

                I have got on 2008-01-18 the answer from IDM (but edited this post 2 days later). There is really a simple protection mechanism against endless loops. A loop without a loop number and without command ExitLoop is always executed only once. I have added this new information to my macro reference file.
                Best regards from an UC/UE/UES for Windows user from Austria

                236
                MasterMaster
                236

                  Jan 18, 2008#8

                  Thanks Mofi! Sounds like a reasonable explanation.

                  119
                  Power UserPower User
                  119

                    Jan 18, 2008#9

                    pietzcker wrote:I guess this is something for a macro.
                    You're making it more complicated than it needs to be. The search/replace pair in my first post already adds a marker prefix.

                    236
                    MasterMaster
                    236

                      Jan 18, 2008#10

                      Well, yes, but it only works for duplicate lines, not for triplicates and higher repetitions.