Removing dupes which are in a transitive relationship

Removing dupes which are in a transitive relationship

24
Basic UserBasic User
24

    Dec 20, 2013#1

    My problem is as described below.

    I am working on name variants and one of the tools developed in C is a Metaphone Engine. Basically the engine looks for name variants and conjoins them to provide similar homographs. The engine is designed for Indian languages but the examples given below are in English. The Engine is based on a large database of homographs with the following structure:

    Code: Select all

    name=name variant
    So a variant of a name is provided on a line separated by an equal sign. An example will make this clear.

    Code: Select all

    Mark=Marc
    Mark=Marque
    Since the database has been manually prepared, it often happens that duplicates have been created where the left hand side variant and right hand side variant are cross-linked as in the example below.

    Code: Select all

    Mark=Marc
    Marc=Mark
    This has created a database which is bloated because of such additions. This confuses the engine which goes in a loop.
    What I need is a script or macro which will remove such cross-linked duplicates.

    Example of input and output.

    Input:

    Code: Select all

    Mark=Marc
    Mark=Marque
    Marc=Mark
    Marque=Mark
    Expected output after removal of duplicates:

    Code: Select all

    Mark=Marc
    Mark=Marque
    The cross-linked data is removed.
    Many thanks in anticipation for your help. And Happy Holidays to all members of the forum.

    6,603548
    Grand MasterGrand Master
    6,603548

      Dec 21, 2013#2

      Here is the script which works on your English example copied into an ANSI file with DOS line terminators.

      Code: Select all

      if (UltraEdit.document.length > 0)  // Is any file opened?
      {
         // Define environment for this script.
         UltraEdit.insertMode();
         UltraEdit.columnModeOff();
      
         // Move caret to top of the active file.
         UltraEdit.activeDocument.selectAll();
         if (UltraEdit.activeDocument.isSel())
         {
            // Determine type of line terminator. Default is DOS.
            var sLineTerm = "\r\n";
            if (typeof(UltraEdit.activeDocument.lineTerminator) == "number")
            {
               if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
               else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
            }
            // Get all lines into an array of strings.
            var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
            // Remove the last string if it is an empty string.
            if (asLines[asLines.length-1] == "") asLines.pop();
      
            // Find lines with same terms as on another line and remove them.
            for (nLine = 0; nLine < (asLines.length-1); nLine++)
            {
               // Optimization: Next line can be commented if
               // all lines contain safely always an equal sign.
               if (asLines[nLine].indexOf('=') < 0) continue;
               
               // Get the terms around the equal sign exchanged in a new string.
               var sCompare = asLines[nLine].replace(/^(.+?)=(.+)$/,"$2=$1");
      
               var nCompare = nLine+1;
               while (nCompare < asLines.length)
               {
                  // Optimization: The next line could be uncommented to speed up
                  // the script if the file contains the lines sorted and the words
                  // left and right the equal sign always start with same character.
                  // if (asLines[nCompare][0] != sCompare[0]) break;
      
                  if (asLines[nCompare] != sCompare) nCompare++;
                  else
                  {
                     asLines.splice(nCompare,1);
                     // Optimization: Uncomment next line for decreasing process
                     // time if file contains the lines sorted and all duplicate
                     // lines were already removed before running the script.
                     // break;
                  }
               }
            }
            // Append an empty string to get finally a line termination
            // at end of file after pasting the joined lines via user
            // clipboard over the selection in active file.
            asLines.push("");
            UltraEdit.selectClipboard(9);
            UltraEdit.clipboardContent = asLines.join(sLineTerm);
            UltraEdit.activeDocument.paste();
            UltraEdit.clearClipboard();
            UltraEdit.selectClipboard(0);
         }
         UltraEdit.activeDocument.top();
      }
      Let me know if it works also on your Unicode file. Read the comments, especially those starting with Optimization. You can speed up the script and reduce the process time by adding or removing comments on 3 lines depending on file content.
      Best regards from an UC/UE/UES for Windows user from Austria

      24
      Basic UserBasic User
      24

        Dec 22, 2013#3

        Dear Mofi,
        I tried the script out with and without the commented lines. It worked just great. At present my data is in English so there are no issues. I tried it on a small file in Unicode following your suggestions and it worked just fine.
        Many thanks. Schoene Weihnachten und Alles Gutes fuer das NeuJahr

        11327
        MasterMaster
        11327

          Dec 22, 2013#4

          It seems there is no problem with Cyrillic 1251/UTF-8 with this script. Thank you Mofi !!!!
          It's impossible to lead us astray for we don't care even to choose the way.