Stemmer dictionary cleanup

dictdoc · May 24, 2013#12013-05-24T16:26+00:00

Hello,
I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:

Code: Select all

coil
coiled
coiling
coils

coil
coils

coin's
coin
coins
coins'

coin
coined
coining
coins

As can be seen the two sets for

Code: Select all

coil and coin

have been created. It is evident that since they share the same root word, they should have been merged together but for the reason given above, are treated as separate entities.
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:

Code: Select all

coil
coiled
coiling
coils

coin's
coin
coins
coined
coining

The sets are not necessarily contiguous and at times could be separated by another set of words.
Since the data is huge, the script or macro would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.

Mofi · Jun 04, 2013#22013-06-04T05:38+00:00

Of course this task can be done with an UltraEdit script. I developed now also this script for you. It uses most likely not the fastest method, but I suppose that you do not often use the script and therefore performance is not so important.

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();

   // Delete all trailing spaces in the file.
   UltraEdit.activeDocument.trimTrailingSpaces();

   // Load content of file with DOS/Windows line terminators into an array.
   UltraEdit.activeDocument.selectAll();
   if (UltraEdit.activeDocument.isSel())
   {
      var nBlock;
      var asBlocks = UltraEdit.activeDocument.selection.split("\r\n\r\n");
      var aWordArrays = new Array();
      // Create for every block an array containing the words of this block.
      for (nBlock = 0; nBlock < asBlocks.length; nBlock++)
      {
         // Remove empty blocks.
         if (asBlocks[nBlock] == "") asBlocks.splice(nBlock,1);
         else aWordArrays[nBlock] = asBlocks[nBlock].split("\r\n");
      }
      // It could be that the last word in last word array is an empty string if the
      // file ends with a line termination. Remove this empty string from last word array.
      if (aWordArrays[asBlocks.length-1][aWordArrays[asBlocks.length-1].length-1] == "")
      {
         aWordArrays[asBlocks.length-1].pop();
      }
      var nBlocksModified = 0;
      // Check if there are two or more blocks containing at least one identical word.
      for (nBlock = 0; nBlock < (asBlocks.length-1); nBlock++)
      {
         // Define a pointer to the array of words of the block on
         // which all others words below in the file are compared to.
         var asWordsCompare = aWordArrays[nBlock];
         // The word compares are done from last block in file backwards
         // to the block following the current block in the file.
         for (var nOther = asBlocks.length - 1; nOther > nBlock; nOther--)
         {
            // Define a pointer to the array of words of the current
            // block which are searched (= compared) in the block above.
            var asWordsSearch = aWordArrays[nOther];
            // Get the number of words of this block.
            var nSearchCount = asWordsSearch.length;
            // Get current number of words in the block which are
            // compared against all other words in the file below.
            var nCompareCount = asWordsCompare.length;
            // Search respectively compare each word in the current
            // block in the words array of the block to compare.
            for (var nWordSearch = 0; nWordSearch < nSearchCount; nWordSearch++)
            {
               // Define a pointer to the current word to compare.
               var sWordCompare = asWordsSearch[nWordSearch];
               for (var nWordCompare = 0; nWordCompare < nCompareCount; nWordCompare++)
               {
                  if (asWordsCompare[nWordCompare] != sWordCompare) continue;
                  // This word exists in both blocks. Append all words not existing
                  // in both blocks to the first block. Then remove the other block.
                  var nWordsMoved = 0;
                  for (var nWordAppend = 0; nWordAppend < nWordSearch; nWordAppend++)
                  {
                     asWordsCompare.push(asWordsSearch[nWordAppend]);
                     nWordsMoved++;
                  }
                  for (nWordSearch++; nWordSearch < nSearchCount; nWordSearch++)
                  {
                     sWordCompare = asWordsSearch[nWordSearch];
                     for (nWordCompare = 0; nWordCompare < nCompareCount; nWordCompare++)
                     {
                        if (asWordsCompare[nWordCompare] == sWordCompare)
                        {
                           // This word exists already in first block. Break the loop.
                           nWordCompare = nCompareCount + 1;
                        }
                     }
                     // Was this word not found in first block, append it to first block.
                     if (nWordCompare == nCompareCount)
                     {
                        asWordsCompare.push(sWordCompare);
                        nWordsMoved++;
                     }
                  }
                  // Output which block is removed in the output window.
                  var sBlock = asWordsSearch.join(', ');
                  UltraEdit.outputWindow.write("Blocked removed: "+sBlock);
                  // Remove the words of this block from the words array.
                  aWordArrays[nOther].splice(0,nSearchCount);
                  // Remove the pointer to the words array from the pointer array.
                  aWordArrays.splice(nOther,1);
                  // Remove the block string of this block from the blocks array.
                  asBlocks.splice(nOther,1);
                  if (nWordsMoved > 0)   // Was any word moved?
                  {
                     // Output the updated blocked on which words are appended.
                     sBlock = asWordsCompare.join(', ');
                     UltraEdit.outputWindow.write("Blocked updated: "+sBlock);
                     // Build the block string new with the additional words.
                     asBlocks[nBlock] = asWordsCompare.join("\r\n");
                  }
                  // Break the inner word comparing loop.
                  nWordCompare = nCompareCount;
                  // Break the outer word comparing loop.
                  nWordSearch = nSearchCount;
                  // Increase the block modified counter.
                  nBlocksModified++;
               }
            }
         }
      }
      if (nBlocksModified > 0)  // Was any block removed or updated?
      {
         // Rebuild the entire file content from the block
         // strings and overwrite still selected file content.
         UltraEdit.selectClipboard(9);
         UltraEdit.clipboardContent = asBlocks.join("\r\n\r\n");
         UltraEdit.clipboardContent += "\r\n";
         UltraEdit.activeDocument.paste();
         UltraEdit.activeDocument.top();
         UltraEdit.clearClipboard();
         UltraEdit.selectClipboard(0);
         // Output a summary information and make output window visible.
         UltraEdit.outputWindow.write("Summary: "+nBlocksModified+" block"+(nBlocksModified > 1 ? "s" : "")+" modified.");
         UltraEdit.outputWindow.showWindow(true);
      }
      else  // Inform user with a message box that nothing changed.
      {
         UltraEdit.activeDocument.top();
         UltraEdit.messageBox("Nothing modified in file.");
      }
   }
}

dictdoc · Jul 11, 2013#32013-07-11T10:19+00:00

Sorry for the late response. I was hospitalised for a month and had no access to the internet. I have just got back and checked the completed script and it works very well.
Thank you for your kind help and once again my excuses for this late response.