Hello,
I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:
As can be seen the two sets for
have been created. It is evident that since they share the same root word, they should have been merged together but for the reason given above, are treated as separate entities.
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:
The sets are not necessarily contiguous and at times could be separated by another set of words.
Since the data is huge, the script or macro would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.
I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:
Code: Select all
coil
coiled
coiling
coils
coil
coils
coin's
coin
coins
coins'
coin
coined
coining
coins
Code: Select all
coil and coin
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:
Code: Select all
coil
coiled
coiling
coils
coin's
coin
coins
coined
coining
Since the data is huge, the script or macro would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.