Removing lines with partial duplicate content with spaces

Removing lines with partial duplicate content with spaces

3
NewbieNewbie
3

    May 09, 2007#1

    I feel like I am making a duplicate topic on the subject but I have done extensive searching and I cannot figure out how to get this to work.

    The closest topic I could find that is related (and has a macro posted by Mofi) that doesn't work for me is: Delete line with partially duplicate content

    I'm working with a city/state CSV that has multiple duplicate cities that need to be removed. Here's a sampling:

    Hamilton,AK,Wade Hampton (CA),62.8961111,-163.8941667
    Hamilton Acres,AK,Fairbanks North Star,64.8469444,-147.6705556
    Hammerhead,AK,Aleutians West (CA),51.8486111,-176.6416667
    Happy Valley,AK,Kenai Peninsula,59.9472222,-151.7322222
    Happy Valley,AK,Aleutians West (CA),51.8472222,-176.6541667
    Harding Lake,AK,Fairbanks North Star,64.4216667,-146.8508333
    Hawk Inlet,AK,Skagway-Hoonah-Angoon (CA),58.1258333,-134.7530556
    Haycock,AK,Nome (CA),65.2097222,-161.1655556
    Healy,AK,Denali,63.8569444,-148.9661111
    Herendeen Bay,AK,Aleutians East,55.7680556,-160.6858333
    Herring Cove,AK,Ketchikan Gateway,0,0
    Herring Cove,AK,Ketchikan Gateway,55.325,-131.5222222
    Highland Park,AK,Fairbanks North Star,64.7569444,-147.3605556
    Hobart Bay,AK,Skagway-Hoonah-Angoon (CA),57.4658333,-133.3533333

    In this case, I need these removed:
    Happy Valley,AK,Kenai Peninsula,59.9472222,-151.7322222
    Herring Cove,AK,Ketchikan Gateway,0,0

    They are duplicates. We aren't worried so much about county and geo locations. But anything with city,state that matches exactly, I'd like removed so that there is only one.

    Thanks.

    236
    MasterMaster
    236

      Re: Removing lines with partial duplicate content with space

      May 09, 2007#2

      Again, this seems to call for regular expressions. First, you need to activate Perl style regexps in the options dialog (search/regular expressions). Then use the replace command (activating regular expressions in the replace dialog)

      Code: Select all

      Search string:
      ^([\w\s]+?,\w+?)(,.+)(\r?\n\1.*?)+$
      
      Replace string:
      \1\2
      This should find two or more adjacent lines that start with the same city/state combination. It works on your example; if the rest of the data are consistent, it should work with them, too (hopefully).

      HTH,
      Tim

      3
      NewbieNewbie
      3

        May 09, 2007#3

        This code consistently errors out my application after setting the perl regexp option then running a replace all on this large 163,000 line file.

        344
        MasterMaster
        344

          May 09, 2007#4

          Scott: Did you update to 13.00a+2 ?

          Very well done Tim

          Just 2 minor comments:
          - the data has to be sorted
          - if there are more than 2 "same lines", you have to run the replace more than 1 time.

          One question: Why is the os-neutral \n not enough? I see we need \r?\n to get it going
          but I don't really see why \n alone does not work.

          rds Bego
          Normally using all newest english version incl. each hotfix. Win 10 64 bit

          236
          MasterMaster
          236

            Re: Removing lines with partial duplicate content with space

            May 09, 2007#5

            You're right:

            The data has to be sorted. Or at least duplicate lines have to be adjacent.

            About running it more than once: Yes, I see this behavior in UE, but it shouldn't be that way. The + in (\r?\n\1.*?)+$ should catch any multiple adjacent lines that start with the same two CSV fields. It seems that in UE the $ overrides the greediness of the + :( In RegexBuddy (where I "crafted" the regex) and in Python I get the expected behavior.

            About \r? --> I use that because newline handling differs between applications. UE regex seems to require \r\n, Python only accepts \n (or \r?\n). I have even found now that (in UE) I have to use \r\n without the ? to make it work right. Otherwise, the replace will introduce a surplus \r (0D 0D 0A). I don't know why that should happen...seems like a(nother) bug of UE's regex engine.

            I'm still waiting for them to fix positive lookaround...

            If UE still crashes on your large file after update to 13.00a+2 I'd suggest Python (six line program):

            Code: Select all

            # -*- coding: iso-8859-1 -*-
            import re
            result = re.sub(r"(?m)^([\w\s]+?,\w+?)(,.+)(\n\1.*?)+$", r"\1\2", open("input.csv","r").read())
            output = open("output.csv","w")
            output.write(result)
            output.close()
            In Perl, you could probably do that in one line :)

              Re: Removing lines with partial duplicate content with space

              May 09, 2007#6

              I've just checked a few things. First, Perl regexes in UE do require \r\n if you want to match a newline in a DOS file. I don't think that that's an error; from what I have gathered, other regex implementations, especially Perl, will handle newlines the same way.

              Then, I think I've located the bug that leads to \r being duplicated on a replace. Normally, . (dot) doesn't match a newline, so under Windows, it also shouldn't match \r. In UE, it does match \r under certain circumstances.

              If the regex ^(\w+)(.+)(\r?\n\1.*)+$ (find all adjacent lines that start with the same word) is applied to

              Code: Select all

              One line
              One more line
              One line again
              Another line
              Another line too
              Another different line
              and told to replace with \1\2 (i.e., replace with just the first of those lines), it should result in

              Code: Select all

              One line
              Another line
              but instead, you get:

              Code: Select all

              One line\r
              One line again
              Another line\r
              Another different line
              So \2 contains the \r that the underlying expression (.+) shouldn't have matched and it finds only pairs of lines even though it should have found all three.

              The first bug can be worked around by using \r\n instead of \r?\n, but the second one remains. Well, I'll write to IDM...

              Tim

              3
              NewbieNewbie
              3

                Re: Removing lines with partial duplicate content with space

                May 11, 2007#7

                Thanks guys, I upgraded my software and pietzcker's code worked great (after heavy pause, lol).

                Thankfully, I had already alphabetized the entire list by state then city in excel (after splitting it to get around the 655++ line limit). I also do not see any duplicates listed more than twice so it all worked out.

                This was a huge huge huge time saver. Thanks for helping me remove 10,000+ junk lines.

                Update: I had to run this through the CSV like 4 times before I finally got the search string not found error.

                Final tally: 14,410 lines of junk removed dumping 0.77MB of junk text.

                Thanks again!

                344
                MasterMaster
                344

                  Re: Removing lines with partial duplicate content with space

                  May 16, 2007#8

                  Hi Tim !

                  Good job. Pls keep me/us up to date on that issue.

                  :-) Bego
                  Normally using all newest english version incl. each hotfix. Win 10 64 bit

                  119
                  Power UserPower User
                  119

                    Re: Removing lines with partial duplicate content with space

                    Jan 17, 2008#9

                    pietzcker wrote:Perl regexes in UE do require \r\n if you want to match a newline in a DOS file. I don't think that that's an error; from what I have gathered, other regex implementations, especially Perl, will handle newlines the same way.
                    Well, yes and no. In Perl, "\n" is a little bit magical. It's a placeholder for a logical (platform) newline. Most of the time -- including in regexes -- you can pretend that "\n" is LF on Unix and CRLF on Windows. Internally "\n" is really a single character (which happens to be LF). The magic is provided by a layer that translates between internal and platform notation when reading from or writing to a file.

                    UE embeds a third-party engine to provide the Perl-compatible regexes. Without the IO layer magic -- which UE isn't providing -- "\n" only matches LF. I'm going to send UE a request to make "\n" match the newline character(s) for the file being searched. That will make Perl-compatible regexes more Perl compatible. :D

                    236
                    MasterMaster
                    236

                      Apr 29, 2008#10

                      Good news: I have just received a test version of UE 14.00a+2, and the bug that would introduce erroneous \r characters in the above replace operation seems to have been fixed. Also, positive lookaround is working correctly in that version. The limitation of greedy quantifiers that won't span newlines if they can avoid it is still there, though. This means that if you have three or more lines in a row to be replaced, you'll have to run the replace operation a couple of times (until you get the error message "search string not found!"). As of April 29th, V14.00a+2 isn't available for download yet, but it should be soon.