Identifying duplicates in file 1 also in file 2 and report the uniques in file 2 only

dictdoc · PostJan 04, 2011#12011-01-04T02:56+00:00

Hello,
my problem and solution which I need are as follows:

PROBLEM STATEMENT:
I have two sets of files.
File 1 is bi-lingual i.e. it is English and another language with the structure:
English=Foreign Language
File 2 is basically new words that I want to add. These are mono-lingual, i.e. only in English and do not contain language 2.

SOLUTION DESIRED:
The solution desired is to find out and store unique words in file 2 which have not been listed in file 1.

EXAMPLE:

FILE 1
John=Jean
Marie=Marie
Teresa=Therese

FILE 2
John
Peter
Teresa
Margaret

The desired output file should have
Peter
Margaret
since John and Teresa are already listed.

Please help. The files are huge: over 98,000 words and using third party tools to identify duplicates is not possible. I have tried to compare columns (Column 1 in file 1 and file 2), but don't know how to set about it. A script or macro would be just great.
With all good wishes for the New Year and many thanks in anticipation,

Dictdoc

Mofi · PostJan 04, 2011#22011-01-04T11:38+00:00

I have done such a task usually with a macro in the past. But I post here now a script solution which is definitely faster than a macro solution because it makes everything in memory.

Code: Select all

if (UltraEdit.document.length > 1) {  // Are at least 2 files open?
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
   UltraEdit.document[0].hexOff();    // Bilingual file is first (left) one.
   UltraEdit.document[1].hexOff();    // Monolingual file is the second file.
   UltraEdit.document[0].selectAll();
   /* Get all words from bilingual file left the equal sign with just line-feed
      as delimiter into a string. The non-capturing OR expression (?:\r\n|\n|\r|$)
      is used to be independent of line terminator type of source file and the
      existence of a line termination for the last line of the file. */
   var sBilingual = UltraEdit.document[0].selection.replace(/(.+)=.*(?:\r\n|\n|\r|$)/g,"$1\n");
   // Get all words from monolingual file again with just line-feed as delimiter.
   UltraEdit.document[1].selectAll();
   var sMonolingual = UltraEdit.document[1].selection.replace(/(.+)(?:\r\n|\n|\r|$)/g,"$1\n");
   /* Split up the 2 strings into words (array of strings). The last string in both
      arrays is always an empty string because both input strings always end with a
      line-feed. Therefore the empty strings are removed from both string arrays. */
   var asWordsPresent = sBilingual.split('\n');
   var asWordsEnglish = sMonolingual.split('\n');
   asWordsPresent.pop();
   asWordsEnglish.pop();
   /* Search in array of already present words in bilingual file for
      the English words in the monolingual file. If the English word
      is present, delete it from the array of English words. */
   var nEnglishIndex = 0;
   while (nEnglishIndex < asWordsEnglish.length) {

      var nPresentIndex = 0;
      while (nPresentIndex < asWordsPresent.length) {
         if (asWordsPresent[nPresentIndex] == asWordsEnglish[nEnglishIndex]) {
            asWordsEnglish.splice(nEnglishIndex,1);
            break;
         }
         nPresentIndex++;
      }
      // Increment index in array of English words only if the
      // word at current index was not removed from the array.
      if (nPresentIndex == asWordsPresent.length) nEnglishIndex++;
   }
   /* Join the remaining English words and output them, if there
      are English words not present in the bilingual file at all. */
   if (asWordsEnglish.length > 0) {
      var sOutput = asWordsEnglish.join("\r\n");
      UltraEdit.newFile();
      UltraEdit.activeDocument.unixMacToDos();
      UltraEdit.activeDocument.write(sOutput+"\r\n");
   } else {
      UltraEdit.messageBox("All words from monolingual file found in bilingual file!");
   }
}

dictdoc · PostJan 06, 2011#32011-01-06T02:58+00:00

Dear Mofi,
Many thanks for the script. The script is superfast: did over 100,000 words in around 2.8 seconds. (Set a time in time out.)
Thanks once again.
Dictdoc

spottiswoad · PostOct 28, 2011#42011-10-28T03:55+00:00

Hi folks,

This string code you have here is exactly what I would like too utilize as well. The only difference is that while dictdocs' request is for words, mine is for numbers. I've tried deciphering the code to see if I could manipulate to suit my needs but I must admit it's beyond my current capabilities.

File 1 is used as a reference file.
File 2 are possible new numbers I want to add.

In the example below, the highlighted area is the section that would be considered as duplicate or not (columns 1-8).

File 1:

05 92 84 27 88 021549682766
01 20 03 59 99 154987265244
77 55 33 32 26 100050048796

File 2:

05 92 84 13 58 021549682766
25 66 84 79 11 145236985478
01 20 03 22 09 154987265244
77 55 33 45 76 100050048796
33 33 56 98 47 965874125632

Output file:

25 66 84 79 11 145236985478
33 33 56 98 47 965874125632

May I ask for your help in a modification of this script so it would benefit my files as they are written above?

Mofi · PostOct 28, 2011#52011-10-28T06:06+00:00

It was no problem for me to change the code of the script to your requirements.

Code: Select all

if (UltraEdit.document.length > 1)    // Are at least 2 files open?
{
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
   UltraEdit.document[0].hexOff();    // Reference file with the existing numbers.
   UltraEdit.document[1].hexOff();    // Second file with numbers to add, but only the
                                      // new numbers not already existing in first file.
   /* Get first 3 number pairs from first file with just line-feed as delimiter
      into a string. The non-capturing OR expression (?:\r\n|\n|\r|$) is used to
      be independent of line terminator type of source file and the existence
      of a line termination for the last line of the file. */
   UltraEdit.document[0].selectAll();
   var sReference = UltraEdit.document[0].selection.replace(/(\d\d \d\d \d\d).*(?:\r\n|\n|\r|$)/g,"$1\n");
   var asRefNums = sReference.split("\n");
   asRefNums.pop();                   // Remove empty string at end of array.
   UltraEdit.document[0].top();       // Cancel the selection.

   /* Get all lines from second file to an array of
      strings with each string containing one line. */
   UltraEdit.document[1].selectAll();
   var sNumbersToAdd = UltraEdit.document[1].selection.replace(/\r\n|\n|\r/g,"\n");
   var asAddNums = sNumbersToAdd.split("\n");
   /* Remove last string from array if it is an empty string
      because the second file ends with a line termination. */
   if (asAddNums[asAddNums.length-1] == "") asAddNums.pop();
   UltraEdit.document[1].top();       // Cancel the selection.

   /* Search in array with the reference numbers for the
      numbers to add which checking if already present. */
   var asNewNums = new Array();
   for (var nAddIndex = 0; nAddIndex < asAddNums.length; nAddIndex++)
   {
     // Compare always only the first 8 characters of every line.
      var sNumbers = asAddNums[nAddIndex].substr(0,8);
      for (var nRefIndex = 0; nRefIndex < asRefNums.length; nRefIndex++)
      {
         if (asRefNums[nRefIndex] == sNumbers) break;
      }
      // Were the 3 numbers found in the reference file?
      if (nRefIndex == asRefNums.length)
      {  // No. So this line contains really new numbers.
         asNewNums.push(asAddNums[nAddIndex]);
      }
   }

   /* Join the lines with really new numbers and output them,
      if there are any new numbers in the second file at all. */
   if (asNewNums.length > 0)
   {
      var sOutput = asNewNums.join("\r\n");
      UltraEdit.newFile();
      UltraEdit.activeDocument.unixMacToDos();
      UltraEdit.activeDocument.write(sOutput+"\r\n");
      UltraEdit.activeDocument.top();
   }
   else UltraEdit.messageBox("All numbers from second file found in first file!");
}

spottiswoad · PostOct 29, 2011#62011-10-29T00:23+00:00

Very nice, Mofi!

Just finished an official test run. File 1 has 1875 lines and File 2 has 9500 lines. It took mere minutes to process the output file as in comparison to my old way (which is an embarrassment to mention the time span to process.)

Gracious kudos to your education and knowledge,

Thank you much.