Check for similar lines and mark or output them for manual review

Haxxo · PostAug 08, 2013#12013-08-08T07:50+00:00

Hi!

I have a list of single lines that I want to check for and remove any duplicate lines. But the lines are not identical. They have minor grammar/formatting differences. These lines are full English sentences.

I think that the best way to approach that is to check if any two lines have more than X amount of words in common.

It should:

Exclude any words with less than or equal to three characters.
Ignore grammar/punctuation such as , . : ? / \ ' !.
Compare the words not case-sensitive.

Any help would be greatly appreciated.

hugov · PostAug 08, 2013#22013-08-08T13:15+00:00

If you search you'll find some remove duplicate lines scripts which you can use after you do one or two find/replace actions

Remove all words of 1 2 3 character length (easy regular expression)
Remove all specific characters (another easy regular expression)

Then run one of the remove duplines scripts should produce good results.

Haxxo · PostAug 08, 2013#32013-08-08T14:27+00:00

Wouldn't that make my sentence unreadable?

Here is an example of what I want to do. Let's say I have these two lines among a list of hundreds of other lines.

Code: Select all

"The frog, then decided to jump the fence!"
"so then the frog decided to jump the fence"

So with this example these two lines should be flagged as duplicates and the words "frog" "decided" "jump" and "fence" would be the tagged words. One of them must be removed. I have no preference on which should be removed.

I do not want these lines edited in any way, just duplicates removed.

Mofi · PostAug 08, 2013#42013-08-08T16:32+00:00

First, how large is the file to check for similar lines - 50 KB, 5 MB or 500 MB?

Checking lines on similarity is not a trivial task. As it looks like the script has to process this check on every line against all other lines below as it cannot be expected that similar lines start with the same character and could be sorted first.

All sequences of non alphanumeric characters must be replaced by a space on each line. The line now containing only words consisting only of letters and numbers must be splitted into a list of words. Next all words with less than 4 characters must be removed from the list. When this was done for the current line compared against all other lines below and the next line to compare, a loop is executed which has to count how much entire words in the current line match not case sensitive entire words in the line to compare. If the number of equal words exceeds a threshold value, the line compared with can be treated as similar and the two lines being similar are written to output window, a new file or appended to the clipboard.

Okay, that's how the task could be done. It will take quite long to accomplish as lots of memory allocations/releases, regular expression replaces, and string compares must be done. But it can be done efficiently at all only if the number and length of the lines is not too large to do as much as possible in memory without accessing the file.

Haxxo · PostAug 09, 2013#52013-08-09T00:57+00:00

The largest file I want to check will be about 0.5 MB. I don't think there will be any limitation on computational power or RAM, as the files are < 0.5 MB and I have 16 GB of RAM and an overclocked I7. Creating an efficient script does not seem necessary, a brute force method would even work on such small files.

That was a great explanation of what the script should be, but I would have one hell of a time trying to create it from scratch. I think you can help me start it off (or create it fully).

Thank you Mofi.

Mofi · PostAug 11, 2013#62013-08-11T14:29+00:00

Here is the script which finds similar lines and output them with file name and line number in the output window.

The output with full file name and line number gives you the possibility to use from within the document window Ctrl+Shift+Up / Down to jump to next line in the output window to quickly delete one of the two lines being similar. Output of full file name(line number): can be changed to Line line number: by changing the value of a boolean variable at top of the script.

There are two other variables at top with long comments above explaining the meaning of the variables to fine tune the similar lines detection.

I tested the script first on a very small file with just 25 lines and later executed it on a file with 4800 lines and more than 0.6 MB. On the larger file the script needed some minutes to finish. I suggest to run the script also first on a very small file to check if it produces what you want and then run it on your larger file on which you must be patient on waiting for the finishing.

By the way: It's fine that your computer has 16 GB, but only parts of Windows system, Windows file caching and 64-bit applications can make use of more than 2 GB RAM (or 4 GB RAM with special code). On Windows x86 all 32-bit applications and other parts of the Windows system can only access the first 2 GB of RAM and therefore if all 32-bit applications in total use already 2 GB of RAM, out of free memory occurs for a 32-bit application even if additional 14 GB of free RAM would be available. On Windows x64 any 32-bit application can in total (code and data memory) allocate for itself 2 GB (or 4 GB) of RAM because of WoW6432 (Windows 32-bit on Windows 64-bit) anywhere within the 16 GB address space.

Hint: Use 64-bit browsers on 64-bit Windows by default even if this requires also 64-bit Sun Java being installed and updated (manually as Sun Java does not automatically update 64-bit version automatically) and could result in not working ActiveX scripts embedded or used in webpages because of pointer arithmetic done in 32-bit range. That makes more RAM in first 2 GB available for 32-bit applications where most 32-bit applications are loaded into by default.

Haxxo · PostAug 12, 2013#72013-08-12T05:20+00:00

Well that's just fantastic! It works perfectly, great commenting on the code.

I would like for the script to automatically delete the duplicate when found, I know that this code would be written somewhere between line 212-218 of your script, but I'm not a programmer.

If its not too much trouble to edit the script, it would save me a lot of time (id have to learn UltraEdit syntax).

Once again, thanks for the help.

Mofi · PostAug 12, 2013#82013-08-12T17:49+00:00

Here is a modified version of the script above which deletes in a new file all the lines which are rated as being similar to a line above.

It is not so easy as you thought as with deleting a line the line number of all lines below the removed line changes by -1. Therefore the script has to remember which lines to delete, then mark the lines to delete in a copy of the original file and deletes them finally with a single regular expression Replace All.

Haxxo · PostAug 13, 2013#92013-08-13T08:52+00:00

Thank you very much Mofi, that script is fantastic!

It feels like I should be paying 100$ an hour for this kind of professional service.