How to get the lines in a list file not found in a large data file to a new file?

How to get the lines in a list file not found in a large data file to a new file?

1
NewbieNewbie
1

    Dec 16, 2015#1

    I have 2 list of object identifiers. One file has about 6 millions rows and the other one has about 20 thousands rows.

    I want to find from smaller list file all lines which are not present in large data file.

    An example for the lines in both files:

    Code: Select all

    09000071800d230d
    09000071800d230e
    09000071800d230f
    09000071800d2310
    09000071800d2311
    09000071800d2312
    09000071800d2313
    09000071800d2314
    I tried using UltraCompare, but the results are not as expected. I got few rows with >! which seems fine (as expected these rows are not present in 6 million rows file), but some with * which doesn't look right (one is present, next is not).

    Thanks

    6,602548
    Grand MasterGrand Master
    6,602548

      Dec 16, 2015#2

      In other words you want to search in large data file for all strings in smaller list file and want reported which strings from list file are not found in data file.

      A file comparison tool like UltraCompare can't be used for this task except the strings are sorted alphabetically in both files. And even in this case with lots of lines with very similar data lines the comparison result could be wrong as the comparison tool does not know that it should compare always only entire lines. A text comparison tool is not designed for making the job of a database application used usually for such tasks.

      A text editor like UltraEdit is also not designed for database tasks like this one. But UltraEdit can open any text file of any size and has built-in scripting support making it possible to code a small script for this "check each line from list file against each line in data file" task.

      Requirements for script execution in UltraEdit:
      • First opened file must be the large data file.
      • Second opened file must be the small list file.
      • Other files are ignored by the script.
      Third file could be the script file itself with the code posted below saved as ASCII/ANSI file with DOS line terminators with a name like FindUniqueLinesInList.js and executed with clicking on Run Active Script in menu Scripting.

      A new file with all lines from second file not found in first file is created only if there are lines from list file not found in data file at all. Otherwise a message box is displayed with the information that all object identifiers in list file were found in the data file.

      Code: Select all

      if (UltraEdit.document.length >= 2)  // Are at least 2 files opened?
      {
         // Define the environment for the script.
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
         UltraEdit.perlReOn();
      
         // The first opened file - most left in open file tabs bar - must
         // be the large data file with the millions of rows (lines).
      
         // Define the parameter for the case sensitive Perl regular expression
         // finds executed in the loop below for each line in list file.
         UltraEdit.document[0].findReplace.mode=0;
         UltraEdit.document[0].findReplace.matchCase=true;
         UltraEdit.document[0].findReplace.matchWord=false;
         UltraEdit.document[0].findReplace.regExp=true;
         UltraEdit.document[0].findReplace.searchDown=true;
         UltraEdit.document[0].findReplace.searchInColumn=false;
      
         // Move caret in data file to top of file.
         UltraEdit.document[0].top();
      
         // The second opened file must be the list file containing the lines
         // to search for in data file. It must be small enough to load it
         // completely into memory as an array of strings for this script.
      
         // This file is made active which avoids display updates on first file
         // if document windows are displayed maximized on script start. Frequent
         // display updates result in a much longer script execution time.
         UltraEdit.document[1].setActive();
      
         UltraEdit.document[1].selectAll();
         if (UltraEdit.document[1].isSel())
         {
            var sLineTerm;
            if (UltraEdit.document[1].lineTerminator <= 0) sLineTerm = "\r\n";
            else if (UltraEdit.document[1].lineTerminator == 1) sLineTerm = "\n";
            else sLineTerm = "\r";
      
            // Get the selected lines into as an array of strings.
            var asSearchData = UltraEdit.document[1].selection.split(sLineTerm);
            UltraEdit.document[1].top();  // Just for discarding the selection.
      
            // The finds in first file are done with the strings from list file
            // in reverse order to make it easy to remove all strings found in
            // data file from the array.
            nDataIndex = asSearchData.length;
            while(nDataIndex > 0)
            {
               nDataIndex--;
               // Is the next search string empty, remove it from array.
               if (asSearchData[nDataIndex].length == 0)
               {
                  asSearchData.splice(nDataIndex,1);
               }
               else
               {  // A Perl regular expression is used to make sure to search
                  // always for entire lines and not just substrings. This
                  // requires data strings which do not contain characters with
                  // special meaning in Perl regular expression search strings.
                  sFindExp = "^" + asSearchData[nDataIndex] + "$";
                  if (UltraEdit.document[0].findReplace.find(sFindExp))
                  {
                     // This line is found in data file, remove it from list.
                     asSearchData.splice(nDataIndex,1);
                     UltraEdit.document[0].top();
                  }
               }
            }
      
            // Are there lines from list file not found in data file?
            if (asSearchData.length)
            {
               // Append an empty string to have finally the last
               // line in new file also with a line termination.
               asSearchData.push("");
      
               // Create a new file and determine type of line termination.
               UltraEdit.newFile();
               if (UltraEdit.activeDocument.lineTerminator <= 0) sLineTerm = "\r\n";
               else if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
               else sLineTerm = "\r";
      
               // Write all not found lines into new file line by line as one block.
               UltraEdit.activeDocument.write(asSearchData.join(sLineTerm));
               UltraEdit.activeDocument.top();
            }
            else
            {
               UltraEdit.messageBox("All object identifiers from list file found in data file.","Object Identifiers Check");
            }
         }
      }
      
      Best regards from an UC/UE/UES for Windows user from Austria