Removing dupes which are in a transitive relationship

dictdoc · Dec 20, 2013#12013-12-20T16:25+00:00

My problem is as described below.

I am working on name variants and one of the tools developed in C is a Metaphone Engine. Basically the engine looks for name variants and conjoins them to provide similar homographs. The engine is designed for Indian languages but the examples given below are in English. The Engine is based on a large database of homographs with the following structure:

Code: Select all

name=name variant

So a variant of a name is provided on a line separated by an equal sign. An example will make this clear.

Code: Select all

Mark=Marc
Mark=Marque

Since the database has been manually prepared, it often happens that duplicates have been created where the left hand side variant and right hand side variant are cross-linked as in the example below.

Code: Select all

Mark=Marc
Marc=Mark

This has created a database which is bloated because of such additions. This confuses the engine which goes in a loop.
What I need is a script or macro which will remove such cross-linked duplicates.

Example of input and output.

Input:

Code: Select all

Mark=Marc
Mark=Marque
Marc=Mark
Marque=Mark

Expected output after removal of duplicates:

Code: Select all

Mark=Marc
Mark=Marque

The cross-linked data is removed.
Many thanks in anticipation for your help. And Happy Holidays to all members of the forum.

Mofi · Dec 21, 2013#22013-12-21T20:53+00:00

Here is the script which works on your English example copied into an ANSI file with DOS line terminators.

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();

   // Move caret to top of the active file.
   UltraEdit.activeDocument.selectAll();
   if (UltraEdit.activeDocument.isSel())
   {
      // Determine type of line terminator. Default is DOS.
      var sLineTerm = "\r\n";
      if (typeof(UltraEdit.activeDocument.lineTerminator) == "number")
      {
         if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
         else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
      }
      // Get all lines into an array of strings.
      var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
      // Remove the last string if it is an empty string.
      if (asLines[asLines.length-1] == "") asLines.pop();

      // Find lines with same terms as on another line and remove them.
      for (nLine = 0; nLine < (asLines.length-1); nLine++)
      {
         // Optimization: Next line can be commented if
         // all lines contain safely always an equal sign.
         if (asLines[nLine].indexOf('=') < 0) continue;
         
         // Get the terms around the equal sign exchanged in a new string.
         var sCompare = asLines[nLine].replace(/^(.+?)=(.+)$/,"$2=$1");

         var nCompare = nLine+1;
         while (nCompare < asLines.length)
         {
            // Optimization: The next line could be uncommented to speed up
            // the script if the file contains the lines sorted and the words
            // left and right the equal sign always start with same character.
            // if (asLines[nCompare][0] != sCompare[0]) break;

            if (asLines[nCompare] != sCompare) nCompare++;
            else
            {
               asLines.splice(nCompare,1);
               // Optimization: Uncomment next line for decreasing process
               // time if file contains the lines sorted and all duplicate
               // lines were already removed before running the script.
               // break;
            }
         }
      }
      // Append an empty string to get finally a line termination
      // at end of file after pasting the joined lines via user
      // clipboard over the selection in active file.
      asLines.push("");
      UltraEdit.selectClipboard(9);
      UltraEdit.clipboardContent = asLines.join(sLineTerm);
      UltraEdit.activeDocument.paste();
      UltraEdit.clearClipboard();
      UltraEdit.selectClipboard(0);
   }
   UltraEdit.activeDocument.top();
}

Let me know if it works also on your Unicode file. Read the comments, especially those starting with Optimization. You can speed up the script and reduce the process time by adding or removing comments on 3 lines depending on file content.

dictdoc · Dec 22, 2013#32013-12-22T06:24+00:00

Dear Mofi,
I tried the script out with and without the commented lines. It worked just great. At present my data is in English so there are no issues. I tried it on a small file in Unicode following your suggestions and it worked just fine.
Many thanks. Schoene Weihnachten und Alles Gutes fuer das NeuJahr

Ovg · Dec 22, 2013#42013-12-22T11:00+00:00

It seems there is no problem with Cyrillic 1251/UTF-8 with this script. Thank you Mofi !!!!