Hi!
I have a list of single lines that I want to check for and remove any duplicate lines. But the lines are not identical. They have minor grammar/formatting differences. These lines are full English sentences.
I think that the best way to approach that is to check if any two lines have more than X amount of words in common.
It should:
I have a list of single lines that I want to check for and remove any duplicate lines. But the lines are not identical. They have minor grammar/formatting differences. These lines are full English sentences.
I think that the best way to approach that is to check if any two lines have more than X amount of words in common.
It should:
- Exclude any words with less than or equal to three characters.
- Ignore grammar/punctuation such as , . : ? / \ ' !.
- Compare the words not case-sensitive.