Identifying duplicates in file 1 also in file 2 and report the uniques in file 2 only

Identifying duplicates in file 1 also in file 2 and report the uniques in file 2 only

24
Basic UserBasic User
24

    Jan 04, 2011#1

    Hello,
    my problem and solution which I need are as follows:

    PROBLEM STATEMENT:
    I have two sets of files.
    File 1 is bi-lingual i.e. it is English and another language with the structure:
    English=Foreign Language
    File 2 is basically new words that I want to add. These are mono-lingual, i.e. only in English and do not contain language 2.

    SOLUTION DESIRED:
    The solution desired is to find out and store unique words in file 2 which have not been listed in file 1.

    EXAMPLE:

    FILE 1
    John=Jean
    Marie=Marie
    Teresa=Therese

    FILE 2
    John
    Peter
    Teresa
    Margaret

    The desired output file should have
    Peter
    Margaret

    since John and Teresa are already listed.

    Please help. The files are huge: over 98,000 words and using third party tools to identify duplicates is not possible. I have tried to compare columns (Column 1 in file 1 and file 2), but don't know how to set about it. A script or macro would be just great.
    With all good wishes for the New Year and many thanks in anticipation,

    Dictdoc

    6,603548
    Grand MasterGrand Master
    6,603548

      Jan 04, 2011#2

      I have done such a task usually with a macro in the past. But I post here now a script solution which is definitely faster than a macro solution because it makes everything in memory.

      Code: Select all

      if (UltraEdit.document.length > 1) {  // Are at least 2 files open?
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
         UltraEdit.document[0].hexOff();    // Bilingual file is first (left) one.
         UltraEdit.document[1].hexOff();    // Monolingual file is the second file.
         UltraEdit.document[0].selectAll();
         /* Get all words from bilingual file left the equal sign with just line-feed
            as delimiter into a string. The non-capturing OR expression (?:\r\n|\n|\r|$)
            is used to be independent of line terminator type of source file and the
            existence of a line termination for the last line of the file. */
         var sBilingual = UltraEdit.document[0].selection.replace(/(.+)=.*(?:\r\n|\n|\r|$)/g,"$1\n");
         // Get all words from monolingual file again with just line-feed as delimiter.
         UltraEdit.document[1].selectAll();
         var sMonolingual = UltraEdit.document[1].selection.replace(/(.+)(?:\r\n|\n|\r|$)/g,"$1\n");
         /* Split up the 2 strings into words (array of strings). The last string in both
            arrays is always an empty string because both input strings always end with a
            line-feed. Therefore the empty strings are removed from both string arrays. */
         var asWordsPresent = sBilingual.split('\n');
         var asWordsEnglish = sMonolingual.split('\n');
         asWordsPresent.pop();
         asWordsEnglish.pop();
         /* Search in array of already present words in bilingual file for
            the English words in the monolingual file. If the English word
            is present, delete it from the array of English words. */
         var nEnglishIndex = 0;
         while (nEnglishIndex < asWordsEnglish.length) {
      
            var nPresentIndex = 0;
            while (nPresentIndex < asWordsPresent.length) {
               if (asWordsPresent[nPresentIndex] == asWordsEnglish[nEnglishIndex]) {
                  asWordsEnglish.splice(nEnglishIndex,1);
                  break;
               }
               nPresentIndex++;
            }
            // Increment index in array of English words only if the
            // word at current index was not removed from the array.
            if (nPresentIndex == asWordsPresent.length) nEnglishIndex++;
         }
         /* Join the remaining English words and output them, if there
            are English words not present in the bilingual file at all. */
         if (asWordsEnglish.length > 0) {
            var sOutput = asWordsEnglish.join("\r\n");
            UltraEdit.newFile();
            UltraEdit.activeDocument.unixMacToDos();
            UltraEdit.activeDocument.write(sOutput+"\r\n");
         } else {
            UltraEdit.messageBox("All words from monolingual file found in bilingual file!");
         }
      }
      Best regards from an UC/UE/UES for Windows user from Austria

      24
      Basic UserBasic User
      24

        Jan 06, 2011#3

        Dear Mofi,
        Many thanks for the script. The script is superfast: did over 100,000 words in around 2.8 seconds. (Set a time in time out.)
        Thanks once again.
        Dictdoc

        4

          Oct 28, 2011#4

          Hi folks,

          This string code you have here is exactly what I would like too utilize as well. The only difference is that while dictdocs' request is for words, mine is for numbers. I've tried deciphering the code to see if I could manipulate to suit my needs but I must admit it's beyond my current capabilities.

          File 1 is used as a reference file.
          File 2 are possible new numbers I want to add.

          In the example below, the highlighted area is the section that would be considered as duplicate or not (columns 1-8).

          File 1:

          05 92 84 27 88          021549682766
          01 20 03 59 99          154987265244
          77 55 33 32 26          100050048796


          File 2:

          05 92 84 13 58          021549682766
          25 66 84 79 11          145236985478
          01 20 03 22 09          154987265244
          77 55 33 45 76          100050048796
          33 33 56 98 47          965874125632


          Output file:

          25 66 84 79 11          145236985478
          33 33 56 98 47          965874125632

          May I ask for your help in a modification of this script so it would benefit my files as they are written above?

          6,603548
          Grand MasterGrand Master
          6,603548

            Oct 28, 2011#5

            It was no problem for me to change the code of the script to your requirements.

            Code: Select all

            if (UltraEdit.document.length > 1)    // Are at least 2 files open?
            {
               UltraEdit.insertMode();
               if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
               else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
               UltraEdit.document[0].hexOff();    // Reference file with the existing numbers.
               UltraEdit.document[1].hexOff();    // Second file with numbers to add, but only the
                                                  // new numbers not already existing in first file.
               /* Get first 3 number pairs from first file with just line-feed as delimiter
                  into a string. The non-capturing OR expression (?:\r\n|\n|\r|$) is used to
                  be independent of line terminator type of source file and the existence
                  of a line termination for the last line of the file. */
               UltraEdit.document[0].selectAll();
               var sReference = UltraEdit.document[0].selection.replace(/(\d\d \d\d \d\d).*(?:\r\n|\n|\r|$)/g,"$1\n");
               var asRefNums = sReference.split("\n");
               asRefNums.pop();                   // Remove empty string at end of array.
               UltraEdit.document[0].top();       // Cancel the selection.
            
               /* Get all lines from second file to an array of
                  strings with each string containing one line. */
               UltraEdit.document[1].selectAll();
               var sNumbersToAdd = UltraEdit.document[1].selection.replace(/\r\n|\n|\r/g,"\n");
               var asAddNums = sNumbersToAdd.split("\n");
               /* Remove last string from array if it is an empty string
                  because the second file ends with a line termination. */
               if (asAddNums[asAddNums.length-1] == "") asAddNums.pop();
               UltraEdit.document[1].top();       // Cancel the selection.
            
               /* Search in array with the reference numbers for the
                  numbers to add which checking if already present. */
               var asNewNums = new Array();
               for (var nAddIndex = 0; nAddIndex < asAddNums.length; nAddIndex++)
               {
                 // Compare always only the first 8 characters of every line.
                  var sNumbers = asAddNums[nAddIndex].substr(0,8);
                  for (var nRefIndex = 0; nRefIndex < asRefNums.length; nRefIndex++)
                  {
                     if (asRefNums[nRefIndex] == sNumbers) break;
                  }
                  // Were the 3 numbers found in the reference file?
                  if (nRefIndex == asRefNums.length)
                  {  // No. So this line contains really new numbers.
                     asNewNums.push(asAddNums[nAddIndex]);
                  }
               }
            
               /* Join the lines with really new numbers and output them,
                  if there are any new numbers in the second file at all. */
               if (asNewNums.length > 0)
               {
                  var sOutput = asNewNums.join("\r\n");
                  UltraEdit.newFile();
                  UltraEdit.activeDocument.unixMacToDos();
                  UltraEdit.activeDocument.write(sOutput+"\r\n");
                  UltraEdit.activeDocument.top();
               }
               else UltraEdit.messageBox("All numbers from second file found in first file!");
            }

            4

              Oct 29, 2011#6

              Very nice, Mofi!

              Just finished an official test run. File 1 has 1875 lines and File 2 has 9500 lines. It took mere minutes to process the output file as in comparison to my old way (which is an embarrassment to mention the time span to process.)

              Gracious kudos to your education and knowledge,

              Thank you much.