Find numbers in huge file and output them to files line by line

Find numbers in huge file and output them to files line by line

1
NewbieNewbie
1

    Oct 17, 2015#1

    Hello all and Mofi, you seem to all be way ahead of the game on here.

    I'm a new user of UE as of the last 24 hours. I have a csv file that is about 23 GB and not very organized. It has millions of lines and half of the time I can't even open it the file because of its size.

    Within this file, I need to know how to use a macro or script to pull out specific information from the file and be able to export this information to excel. Within the file, there are numbers that all start with "231" and then is followed by 10 digits. For example "2311234567891" "2311234567892". There are about 18 millions of these numbers within the file that is surrounded by text. My goal is to just extract these numbers, and nothing else, into a separate file that I can use in Excel. Since I know Excel only allows about 1 million rows, there will need to be several exported files so that I can view these in Excel.

    I'm thinking that this should definitely be possible since all of the numbers start with "231".

    Any assistance Mofi or anyone else can give me on the correct macro or script to use and how to accomplish this would be greatly appreciated.

    Thanks

    6,687586
    Grand MasterGrand Master
    6,687586

      Oct 18, 2015#2

      There are already the scripts FindStringsToNewFileExtended.js and Split up large file based on line number count which could be used for this task.

      But I wrote and tested (on a small file) two more scripts which both need function GetFileName not included in the posted code below.

      Both scripts have the variable nLinesPerFile with the value used to determine how many lines (= numbers) are written to an output file before creating one more output file.

      Please read the comments in the scripts for more information.


      1. Using Find in Files

      This script uses Find in Files to get all lines containing a number starting with 231 into a results file being processed further.

      It is necessary to adapt in script code below "C:\\Temp\\Test.csv".

      This script as coded can be used only if no file contains a number of interest more than once. Otherwise it would be necessary to modify:

      Code: Select all

        UltraEdit.activeDocument.findReplace.replace("^.*(\\<231[0-9]+).*$","\\1");
      It is not necessary to open the huge file for using the script with code below.

      Code: Select all

      // Insert here the function GetFileName from FileNameFunctions.js which can be
      // downloaded from http://www.ultraedit.com/files/scripts/FileNameFunctions.js
      
      // Define name of input file with full path.
      var sNameOfInputFileWithPath = "C:\\Temp\\Test.csv";
      
      var nLinesPerFile = 1000000;  // Define number of lines per output file.
      var nFileNumber = 0;          // Define first file number (pre-incremented).
      
      // Define environment for this script.
      UltraEdit.insertMode();
      if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
      else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
      
      // Get file name of active file with path, but without file extension.
      var sFileNameWithPath = GetFileName(sNameOfInputFileWithPath,false,true);
      
      // Define the parameters for the UltraEdit regular expression
      // Find in Files used below to find the strings of interest.
      UltraEdit.ueReOn();
      UltraEdit.frInFiles.filesToSearch=0;
      UltraEdit.frInFiles.directoryStart="";
      UltraEdit.frInFiles.searchInFilesTypes=sNameOfInputFileWithPath;
      UltraEdit.frInFiles.displayLinesDoNotMatch=false;
      UltraEdit.frInFiles.openMatchingFiles=false;
      UltraEdit.frInFiles.ignoreHiddenSubs=true;
      UltraEdit.frInFiles.useOutputWindow=false;
      UltraEdit.frInFiles.reverseSearch=false;
      UltraEdit.frInFiles.unicodeSearch=false;
      UltraEdit.frInFiles.useEncoding=false;
      UltraEdit.frInFiles.searchSubs=false;
      UltraEdit.frInFiles.matchCase=true;
      UltraEdit.frInFiles.matchWord=true;
      UltraEdit.frInFiles.regExp=true;
      
      // Find all lines containing a "word" starting with 231 and having
      // at least one more digit and write them into a new Unicode file.
      UltraEdit.frInFiles.find("231[0-9]+");
      
      // Get number of strings found in file converted from string to integer.
      UltraEdit.perlReOn();
      UltraEdit.activeDocument.findReplace.mode=0;
      UltraEdit.activeDocument.findReplace.matchCase=true;
      UltraEdit.activeDocument.findReplace.matchWord=false;
      UltraEdit.activeDocument.findReplace.regExp=true;
      UltraEdit.activeDocument.findReplace.searchDown=false;
      UltraEdit.activeDocument.findReplace.searchInColumn=false;
      UltraEdit.activeDocument.findReplace.find("[0-9]+(?= time)");
      
      var nTotalFound = parseInt(UltraEdit.activeDocument.selection,10);
      
      if (!nTotalFound)    // Was any string found at all?
      {
         UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
         UltraEdit.messageBox("Could not find any string!");
      }
      else
      {
         // Convert the results file to ASCII/ANSI.
         UltraEdit.activeDocument.unicodeToASCII();
         UltraEdit.activeDocument.top();
         
         // Remove from all lines containing a found string
         // everything before and after the found string.
         UltraEdit.activeDocument.findReplace.mode=0;
         UltraEdit.activeDocument.findReplace.searchDown=true;
         UltraEdit.activeDocument.findReplace.preserveCase=false;
         UltraEdit.activeDocument.findReplace.replaceAll=true;
         UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
         UltraEdit.activeDocument.findReplace.replace("^.*(\\<231[0-9]+).*$","\\1");
      
         // Remove all lines not starting with a found string.
         UltraEdit.activeDocument.findReplace.replace("^(?!231).*\r\n","");
      
         // Is number of found strings lower or equal than maximum number of lines
         // per file, save the results file with with name of input file with an
         // underscore and current file number append and with file extension CSV.
         if (nTotalFound <= nLinesPerFile)
         {
            nFileNumber++;
            UltraEdit.saveAs(sFileNameWithPath + "_" + nFileNumber.toString(10) + ".csv");
            UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
         }
         else
         {
            // If used version of UE/UES does not have property activeDocumentIdx, use function
            // GetFileIndex from http://www.ultraedit.com/forums/viewtopic.php?f=52&t=4596#p26710
            var nResultsFileIndex = UltraEdit.activeDocumentIdx;
            UltraEdit.selectClipboard(9);
      
            var nLinesToSave = nTotalFound;
            do
            {
               var nNextLineNumber = (nLinesToSave < nLinesPerFile) ? nLinesToSave : nLinesPerFile;
               nLinesToSave -= nNextLineNumber;
               UltraEdit.document[nResultsFileIndex].gotoLineSelect(++nNextLineNumber,1);
               UltraEdit.document[nResultsFileIndex].cut();
      
               // Create a new ASCI//ANSI file with DOS line terminators.
               UltraEdit.newFile();
               UltraEdit.activeDocument.unixMacToDos();
               UltraEdit.activeDocument.unicodeToASCII();
      
               // Paste the found strings into the file.
               UltraEdit.activeDocument.paste();
      
               // Save the new file with name of input file with an underscore
               // and current file number append and with file extension CSV
               // and then close the file.
               nFileNumber++;
               UltraEdit.saveAs(sFileNameWithPath + "_" + nFileNumber.toString(10) + ".csv");
               UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
            }
            while(nLinesToSave);
      
            // Clear user clipboard 9 and select clipboard of operating system.
            UltraEdit.clearClipboard(9);
            UltraEdit.selectClipboard(0);
            UltraEdit.closeFile(UltraEdit.document[nResultsFileIndex].path,2);
         }
      
         UltraEdit.messageBox("Found " + nTotalFound.toString(10) + " string" +
                              ((nTotalFound != 1) ? "s" : "") + " saved in " +
                              nFileNumber.toString(10) + " file" +
                              ((nFileNumber > 1) ? "s." : "."));
      }
      
      2. Find the numbers in opened file

      This script finds numbers starting with 231 in opened file from top to bottom and writing them to 1 or more new files.

      Code: Select all

      // Insert here the function GetFileName from FileNameFunctions.js which can be
      // downloaded from http://www.ultraedit.com/files/scripts/FileNameFunctions.js
      
      function SaveFoundStrings()
      {
         // Append to array an empty string to terminate also last line in output file.
         g_asFound.push("");
      
         // Join the found strings with DOS line terminator into user clipboard 9.
         UltraEdit.clipboardContent = g_asFound.join("\r\n");
      
         // Create a new ASCI//ANSI file with DOS line terminators.
         UltraEdit.newFile();
         UltraEdit.activeDocument.unixMacToDos();
         UltraEdit.activeDocument.unicodeToASCII();
      
         // Paste the found strings into the file.
         UltraEdit.activeDocument.paste();
      
         // Save new file with name of input file with an underscore and current
         // file number appended with file extension CSV and then close the file.
         g_nFileNumber++;
         UltraEdit.saveAs(g_sFileNameWithPath + "_" + g_nFileNumber.toString(10) + ".csv");
         UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
      
         // Clear user clipboard 9 and the array to free memory.
         UltraEdit.clearClipboard(9);
         g_asFound = [];
      }
      
      
      if (UltraEdit.document.length > 0)  // Is any file opened?
      {
         // Define number of lines per output file.
         var nLinesPerFile = 1000000;
         // Define first file number (pre-incremented in function SaveFoundStrings).
         var g_nFileNumber = 0;
      
         // Define environment for this script.
         UltraEdit.insertMode();
         if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
         else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
         UltraEdit.selectClipboard(9);
      
         // Move caret to top of the active file.
         UltraEdit.activeDocument.top();
      
         // Get file name of active file with path, but without file extension.
         var g_sFileNameWithPath = GetFileName(UltraEdit.activeDocument.path,false,true);
      
         // If used version of UE/UES does not have property activeDocumentIdx, use function
         // GetFileIndex from http://www.ultraedit.com/forums/viewtopic.php?f=52&t=4596#p26710
         var nInputFileIndex = UltraEdit.activeDocumentIdx;
      
         // Open a new file which speeds up script execution as no display updates
         // are necessary on searching in former active file for the strings of
         // interest when the file window is displayed maximized after opening.
         var nNewFileIndex = UltraEdit.document.length;
         UltraEdit.newFile();
      
         var nTotalFound = 0;    // Total number of found strings.
         var g_asFound = [];     // Define an array for the found strings.
      
         // Define the parameters for the UltraEdit regular expression
         // search used below to find the strings of interest.
         UltraEdit.ueReOn();
         UltraEdit.document[nInputFileIndex].findReplace.mode=0;
         UltraEdit.document[nInputFileIndex].findReplace.matchCase=true;
         UltraEdit.document[nInputFileIndex].findReplace.matchWord=true;
         UltraEdit.document[nInputFileIndex].findReplace.regExp=true;
         UltraEdit.document[nInputFileIndex].findReplace.searchDown=true;
         UltraEdit.document[nInputFileIndex].findReplace.searchInColumn=false;
      
         // Run this loop until no string found anymore.
         while (UltraEdit.document[nInputFileIndex].findReplace.find("231[0-9]+"))
         {
            // Append the found string to array of found strings.
            g_asFound.push(UltraEdit.document[nInputFileIndex].selection);
      
            // Is the limit of found strings (lines) per file reached?
            if (g_asFound.length == nLinesPerFile)
            {
               nTotalFound += g_asFound.length;
               SaveFoundStrings();
            }
         }
      
         if (g_asFound.length)   // Are there found strings not yet saved?
         {
            nTotalFound += g_asFound.length;
            SaveFoundStrings();
         }
      
         // Select clipboard of operating system.
         UltraEdit.selectClipboard(0);
         // Close the new file initially created without saving.
         UltraEdit.closeFile(UltraEdit.document[nNewFileIndex].path,2);
      
         // Output a summary information in a message box.
         if (nTotalFound)
         {
            UltraEdit.messageBox("Found " + nTotalFound.toString(10) + " string" +
                                 ((nTotalFound != 1) ? "s" : "") + " saved in " +
                                 g_nFileNumber.toString(10) + " file" +
                                 ((g_nFileNumber > 1) ? "s." : "."));
         }
         else
         {
            UltraEdit.messageBox("Could not find any string!");
         }
      }
      
      Best regards from an UC/UE/UES for Windows user from Austria