regular expression to remove duplicates and keep the order

regular expression to remove duplicates and keep the order

5
NewbieNewbie
5

    Jan 07, 2008#1

    I do not want to use the sort/advanced sort function of ultraedit, as I do not want to lose the order, is there a regular expression to remove duplicates, or any other way to remove duplicates but keep the order/sequence the same?

    6,686585
    Grand MasterGrand Master
    6,686585

      Jan 07, 2008#2

      What about How do I remove duplicate lines? or Special Case: Remove Duplicates TOTALLY, not just one?

      The macro property Continue if a Find with Replace not found or Continue if search string not found must be checked for this macro.
      Best regards from an UC/UE/UES for Windows user from Austria

      5
      NewbieNewbie
      5

        Jan 07, 2008#3

        I do not want to remove every occurrence, I would like to keep one duplicate, and the lines containing the duplicates contains $.
        How would I amend the macros to get this to do as I would like?

        6,686585
        Grand MasterGrand Master
        6,686585

          Jan 08, 2008#4

          You have not read How do I remove duplicate lines? or at least you have not done it carefully enough. The first occurrence of a duplicate line will remain and it does not matter if the lines contain regular expression strings because the improved macro uses non regular expression searches/replaces.

          The final macro version as posted in the linked topic is also already ready for usage in the macro file in the ZIP archive you can download from Macro examples and reference for beginners and experts. Macro DelDupLineInfo+ deletes the duplicates and creates a report, macro DelDupLineInfo- deletes just the duplicate lines without creating a report.

          Please next time read more carefully. I don't like writting the same twice.
          Best regards from an UC/UE/UES for Windows user from Austria

          5
          NewbieNewbie
          5

            Jan 08, 2008#5

            I have tried the delduplineinfo+ and the delduplineinfo-, they are both exceptionally slow, after an hour it was only through 1,000 lines, the file has over 300,000 lines!!!

            Is there a way to simply randomise the order after a sort ascending/descending as it seems I may be forced to use that option?

            236
            MasterMaster
            236

              Jan 08, 2008#6

              This sounds more like a job for a Perl or Python script. After all, what you're asking is maybe trivial but demanding. Every single line has to be checked against all following lines (up to 300000), and then every duplicate has to be removed. How long are your lines?

              5
              NewbieNewbie
              5

                Jan 08, 2008#7

                an average of 70 characters

                236
                MasterMaster
                236

                  Jan 08, 2008#8

                  OK, this is a quick-and-dirty program, hardly any error checks, and it won't handle unicode files. But I've tried it on a 66000 lines XML file that was reduced to 12000 lines within one minute on my laptop. The more duplicates there are, the longer it takes. It works with Python 2.5, haven't tested with other versions.

                  Code: Select all

                  # -*- coding: iso-8859-1 -*-
                  
                  in_file = open("test.txt","r").readlines() # Put the input file (here called test.txt - rename as required) in same directory as script
                  counter = 0
                  while True:
                      try:
                          testline=in_file[counter]
                      except:
                          break
                      while True:
                          try:
                              x=in_file[counter+1:].index(testline)
                          except ValueError:
                              break
                          in_file.pop(counter+x+1)
                      counter += 1
                  
                  out_file = open("output.txt","w") # overwrites output.txt if it exists
                  for zeile in in_file:
                      out_file.write(zeile)
                  out_file.close()
                  
                  HTH,
                  Tim