Split up large file based on line number count

Split up large file based on line number count

3
NewbieNewbie
3

    Mar 27, 2012#1

    Hi there!

    I'm trying to split a massive text (xyz) file into usable chunks so I can process them into survey software.

    I am trying to copy by row number (from-to) as I know the size of each chunk that I can deal with.

    I want to copy from row 0-1048576 to start. Then from row to 1048576-2097152 as the 2nd set. And so on.

    Is this possible in UE?

    Thanks, sam

    6,686585
    Grand MasterGrand Master
    6,686585

      Mar 27, 2012#2

      There is already a macro solution for this task, see Splitting Big Files. The task could be done nowadays better with an UltraEdit script, but recoding the macro as script would make sense only if you need to do this regularly and not just once.

      3
      NewbieNewbie
      3

        Mar 27, 2012#3

        Thanks Mofi - much appreciated. Being completely new to this.

        Do I have to copy the code, save as a .mac file and run it?

        Do I need to define into how many chunks to split up the file, in other words the amount of output files?

        Cheers

        6,686585
        Grand MasterGrand Master
        6,686585

          Mar 27, 2012#4

          Okay, the macro solution is not easy to setup for a beginner. Therefore I decided to code a script for that task. I first wanted to code it for general usage by every UltraEdit / UEStudio user who needs to split up a file based on number of lines. But I stopped the development for the general script some minutes after starting coding the script because for general usage lots of things must be taken into account like file names with no file extension, splitting up file contents of new file not yet saved, Unicode and ASCII/ANSI files, DOS/UNIX/MAC terminated lines, version of UltraEdit, ...

          So I developed a quick solution for you working only for ASCII/ANSI files. The file to split must be the first file opened in UltraEdit which is the most left file on open file tabs bar.

          Copy the following code into a new file and save it for example with file name SplitFile.js. Then run the script by clicking on menu item Run Active Script in menu Scripting. The script copies now 1.048.576 lines into a new file, saves the new file into the same directory as the first opened file with same file name, but with an incrementing number after an underscore before the file extension.

          Code: Select all

          // First file (most left on file tabs bar) must be the file to split.
          
          if (UltraEdit.document.length > 0) {  // Is any file opened?
          
             var nLinesPerFile = 1048576;
             var nNextLineNum = nLinesPerFile + 1;
             var nFileCount = 0;
          
             // Define the environment for the script.
             UltraEdit.insertMode();
             if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
             else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
             UltraEdit.document[0].hexOff();
             // Move caret to top of the file.
             UltraEdit.document[0].top();
          
             // Quick and dirty solution to get file name without extension
             // and the file extension. Does not work for all file names.
             var nLastPoint = UltraEdit.document[0].path.lastIndexOf('.');
             if (nLastPoint < 0) nLastPoint = UltraEdit.document[0].path.length;
             var sFileName = UltraEdit.document[0].path.substr(0,nLastPoint) + '_';
             var sFileExt = UltraEdit.document[0].path.substr(nLastPoint);
          
             while (1) {
                UltraEdit.document[0].gotoLineSelect(nNextLineNum,1);
                if (!UltraEdit.document[0].isSel()) break;
                UltraEdit.newFile();
                UltraEdit.activeDocument.write(UltraEdit.document[0].selection);
                nFileCount++;
                UltraEdit.saveAs(sFileName + nFileCount + sFileExt);
                UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
                nNextLineNum += nLinesPerFile;
                UltraEdit.document[0].cancelSelect();
             }
             UltraEdit.document[0].top();
             UltraEdit.messageBox(nFileCount + " files created.");
          }
          The script as is requires UltraEdit v17.20 or UEStudio v11.20 or later because of function cancelSelect() and does not work for Unicode files.

          Note: There is a better script below using clipboard for faster copying, supporting Unicode files, and having an additional option to copy to every new file also the first line for CSV files with a header row.

          3
          NewbieNewbie
          3

            Mar 28, 2012#5

            Mofi - I really appreciate this - many many thanks

            I let you know how i get on.

            cheers dude
            sam

            1
            NewbieNewbie
            1

              Jan 22, 2014#6

              Hi,

              First off, thanks. The above code works great to split my huge data dumps. But is there a way to keep the first line of every split text document the same as the original file, like a header row. I have UltraEdit Professional Text/HEX Editor Version 20.00.0.1056.

              Thanks,
              nrama002

              6,686585
              Grand MasterGrand Master
              6,686585

                Jan 23, 2014#7

                It was no problem to enhance the script with an option to copy also the first line into every file. With the setting the value of boolean variable bCopyFirstLine to true or false, it is now possible in the script below to determine the behavior on copying the first line of the file.

                Code: Select all

                // First file (most left on file tabs bar) must be the file to split.
                
                if (UltraEdit.document.length > 0) {  // Is any file opened?
                
                   var bCopyFirstLine = true;    // Copy first line into every file?
                                                 // NO = false, YES = true.
                   var nLinesPerFile = 1048576;  // Number of lines per file without the
                                                 // first line on copying it to all files.
                
                   var nFileCount = 0;           // Counts the files created by this script.
                   var nNextLineNum = nLinesPerFile + 1;  // Line number for next block selection.
                
                   // Define the environment for the script.
                   UltraEdit.insertMode();
                   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                   UltraEdit.document[0].hexOff();
                   // Move caret to top of the file.
                   UltraEdit.document[0].top();
                
                   if (bCopyFirstLine) {
                      // Code to copy first line of file into user clipboard 8 for pasting
                      // it later into every file created on splitting the large file.
                      UltraEdit.selectClipboard(8);
                      UltraEdit.document[0].selectLine();
                      UltraEdit.document[0].copy();
                      UltraEdit.document[0].cancelSelect();
                      UltraEdit.document[0].gotoLine(2,1);
                      nNextLineNum++
                  }
                   UltraEdit.selectClipboard(9);
                
                   // Quick and dirty solution to get file name without extension
                   // and the file extension. Does not work for all file names.
                   var nLastPoint = UltraEdit.document[0].path.lastIndexOf('.');
                   if (nLastPoint < 0) nLastPoint = UltraEdit.document[0].path.length;
                   var sFileName = UltraEdit.document[0].path.substr(0,nLastPoint) + '_';
                   var sFileExt = UltraEdit.document[0].path.substr(nLastPoint);
                
                   while (1) {
                      UltraEdit.document[0].gotoLineSelect(nNextLineNum,1);
                      if (!UltraEdit.document[0].isSel()) break;
                
                      // Copy the selected block to user clipboard 9.
                      UltraEdit.document[0].copy();
                      UltraEdit.document[0].cancelSelect();
                
                      UltraEdit.newFile();
                
                      if (bCopyFirstLine) {
                         // Paste the first line first into the new file.
                         UltraEdit.selectClipboard(8);
                         UltraEdit.activeDocument.paste();
                         UltraEdit.selectClipboard(9);
                      }
                      // Paste the file copied block into the new file.
                      UltraEdit.activeDocument.paste();
                      nFileCount++;
                      UltraEdit.saveAs(sFileName + nFileCount + sFileExt);
                      UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
                      nNextLineNum += nLinesPerFile;
                   }
                   UltraEdit.document[0].top();
                
                   // Clear the used user clipboards and reselect system clipboard.
                   UltraEdit.clearClipboard();
                   if (bCopyFirstLine) {
                      UltraEdit.selectClipboard(8);
                      UltraEdit.clearClipboard();
                   }
                   UltraEdit.selectClipboard(0);
                   UltraEdit.messageBox(nFileCount + " files created.");
                }
                
                The script should work now also for Unicode files as long as UltraEdit is configured to create a new file as UTF-16 or UTF-8 file.

                And this enhanced script should be faster than the first version as using now the commands copy and paste to copy a large block into a new file instead of writing a selection in first file into the new file which takes usually longer than pasting a large block.

                Instructions for usage of this UE/UES script:
                • Open the large / huge file as first file.
                • Then create a new ASCII file as second file and copy and paste the script code above into this file.
                • Change boolean value of variable bCopyFirstLine at top of script to false if first line should not be copied into each file.
                • Change number value of variable nLinesPerFile at top of script to whatever you want.
                • Save the file anywhere for example with name SplitLargeFileByLineNumber.js.
                • Execute the script by clicking on Run Active Script in menu Scripting.
                The script does not modify the large / huge file opened as first file.

                The script can be also added via Scripting - Scripts to the list of regularly used scripts making it possible to execute the script without opening it via menu Scripting or via the Script List.
                Best regards from an UC/UE/UES for Windows user from Austria

                7
                NewbieNewbie
                7

                  May 22, 2015#8

                  mofi,

                  Thanks for pointing me to this topic with the script above; always wanted to do some UE Scripting, just do not have the time. For what it is worth; I have been using UE over the last ten years and this is the first issue that I have not been able to resolve on my own.

                  Aside from setting bCopyFirstLine to FALSE; I ran the script as is with UE v19.10.0.1012 and five blank text files were created (0 KB). Took under a minute to create each one and I received the error "Cannot Allocate Memory" five times.

                  I then started reducing the nLinesPerFile incrementally to determine which line count would work as I wanted it to; 238,020 turned out to be the magic number. The time it took to split the one large file into 19 smaller files was 12 minutes; faster than I expected and the file sizes varied between 269-289 MB.

                  So would the current version of UE allow me to split the one large file into less than ten semi-large files without experiencing a memory error?

                  6,686585
                  Grand MasterGrand Master
                  6,686585

                    May 23, 2015#9

                    I don't have such a large file to evaluate if UE v22.0.0.66 can copy a larger block via clipboard to a new file. But I don't think so as also UE v22 is still a 32-bit application. What is the maximum number of bytes for being copied depends on
                    • architecture of application: x86 or x64
                    • application was coded to handle addresses larger than 2 GB linked with large address aware option, see Wikipedia article about x86-64
                    • and what is the largest free block in memory accessible by the application.
                    The last point is what an application can't really control and what finally determines the largest block which can be copied at once. For example it does not help if there are 1.5 GB of free memory in total for an x86 non large address aware application remaining, but the total accessible 2 GB is fragmented by several memory allocations and memory frees already in several smaller free blocks. It might be helpful to restart Windows to have memory usage reduced to minimum, then start UltraEdit and run the script for splitting up the file.

                    But much better would be to edit the script and use following enhanced version of the script splitting a huge file into several large files (tested only on a small file).

                    Code: Select all

                    // First file (most left on file tabs bar) must be the file to split.
                    
                    if (UltraEdit.document.length > 0) {  // Is any file opened?
                    
                       // Set number of lines per block and number of blocks per file here!
                       // Number of lines per file = nLinesPerBlock * nBlocksPerFile
                       var nLinesPerBlock = 100000;
                       var nBlocksPerFile = 5;
                       var bCopyFirstLine = false;   // Copy first line into every file.
                    
                       var nNextLineNum = nLinesPerBlock + 1;
                       var nFileCount = 0;
                    
                       // Define the environment for the script.
                       UltraEdit.insertMode();
                       if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                       else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                       UltraEdit.document[0].hexOff();
                       // Move caret to top of the file.
                       UltraEdit.document[0].top();
                    
                       if (bCopyFirstLine) {
                          // Code to copy first line of file into user clipboard 8 for pasting
                          // it later into every file created on splitting the large file.
                          UltraEdit.selectClipboard(8);
                          UltraEdit.document[0].selectLine();
                          UltraEdit.document[0].copy();
                          UltraEdit.document[0].cancelSelect();
                          UltraEdit.document[0].gotoLine(2,1);
                          nNextLineNum++;
                       }
                       UltraEdit.selectClipboard(9);
                    
                       // Quick and dirty solution to get file name without extension
                       // and the file extension. Does not work for all file names.
                       var nLastPoint = UltraEdit.document[0].path.lastIndexOf('.');
                       if (nLastPoint < 0) nLastPoint = UltraEdit.document[0].path.length;
                       var sFileName = UltraEdit.document[0].path.substr(0,nLastPoint) + '_';
                       var sFileExt = UltraEdit.document[0].path.substr(nLastPoint);
                    
                       while (1) {
                          UltraEdit.document[0].gotoLineSelect(nNextLineNum,1);
                          if (!UltraEdit.document[0].isSel()) break;
                    
                          // Copy the selected block to user clipboard 9.
                          UltraEdit.document[0].copy();
                          UltraEdit.document[0].cancelSelect();
                    
                          UltraEdit.newFile();
                    
                          // New files are always created with a temporary file independent on
                          // what is selected in configuration for usage of temporary files as
                          // a new file is not yet saved with a name in a specified directory.
                          // This means also the undo feature is always enabled for new files.
                    
                          // Save the still empty new file with right name, close the file and
                          // immediately open it again. If option "Open file without temp file
                          // but NO Prompt" is selected at "Advanced - Configuration - File
                          // Handling - Temporary Files" with a threshold value of 0, the new
                          // file is opened now without usage of a temporary file which means
                          // without Undo feature enabled for this file making it faster
                          // copying and pasting large blocks into this new file.
                    
                          nFileCount++;
                          var sFileNameWithPath = sFileName + nFileCount + sFileExt;
                          UltraEdit.saveAs(sFileNameWithPath);
                          UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
                          UltraEdit.open(sFileNameWithPath);
                    
                          if (bCopyFirstLine) {
                             // Paste the first line first into the new file.
                             UltraEdit.selectClipboard(8);
                             UltraEdit.activeDocument.paste();
                             UltraEdit.selectClipboard(9);
                          }
                          // Paste the file copied block into the new file.
                          UltraEdit.activeDocument.paste();
                    
                          // Copy up to 4 more blocks from input file to current new file.
                          // Each new file contains than 500.000 lines if input file has
                          // enough lines remaining.
                          for (var nBlock = 1; nBlock < nBlocksPerFile; nBlock++ )
                          {
                             nNextLineNum += nLinesPerBlock;
                             UltraEdit.document[0].gotoLineSelect(nNextLineNum,1);
                             if (!UltraEdit.document[0].isSel()) break;
                             UltraEdit.document[0].copy();
                             UltraEdit.document[0].cancelSelect();
                             UltraEdit.activeDocument.paste();
                          }
                    
                          // Close the file with saving although if empty file was really
                          // opened without usage of a temporary file each paste was directly
                          // written to storage media and therefore the file does not need
                          // to be explicitly saved because everything is saved already.
                          UltraEdit.closeFile(UltraEdit.activeDocument.path,1);
                          nNextLineNum += nLinesPerBlock;
                       }
                       UltraEdit.document[0].top();
                    
                       // Clear the used user clipboards and reselect system clipboard.
                       UltraEdit.clearClipboard();
                       if (bCopyFirstLine) {
                          UltraEdit.selectClipboard(8);
                          UltraEdit.clearClipboard();
                       }
                       UltraEdit.selectClipboard(0);
                       UltraEdit.messageBox(nFileCount + " files created.");
                    }
                    
                    This script copies just 100.000 lines via clipboard to the output files. But up to 5 blocks each with 100.000 lines are copied now into one output file before next output file is created. This has the advantage that a smaller free memory block must be available in accessible RAM for UltraEdit than when trying to copy 500.000 lines at once from opened file to split into each output file.

                    But there is one problem left if the output files should be several hundred MB. A new file is always created with usage of a temporary file. This means copying several large blocks results in allocating several large blocks in memory for the undo feature which could easily result again in an out of memory condition.

                    But there is a workaround for this problem.

                    The configuration option Open file without temp file but NO Prompt must be selected at Advanced - Configuration - File Handling - Temporary Files and 0 must be set for Threshold for above (KB). The script saves now each new file immediately after creation, closes it and re-opens the just created new output file. Now the output file is opened without usage of a temporary file which means also with undo feature disabled for this file, too. Copying and pasting the up to 5 x 100.000 lines is done now without recording anything for undo feature.

                    And additionally on final close of each output file, UltraEdit does not need to copy the large amount of data from temporary file to final location of output file as there is no temporary file. This should make splitting a huge file into several still very large files faster decreasing total script execution time.
                    Best regards from an UC/UE/UES for Windows user from Austria

                    7
                    NewbieNewbie
                    7

                      Jun 01, 2015#10

                      Mofi,

                      Just wanted to say Thanks for the script. :D

                      I tried it on one of our moderate size data sets; 4,517,595 data rows with an average row length of 1500 characters and I left the row count at a half million. It runs and produces 10 files in 22 minutes.

                      2

                        Jan 24, 2018#11

                        Mofi,

                        I am running the same script on a data set that is 2,680,905 rows with 2480 characters. The text file I am trying to process is 6.5 GB. I left the row count at 500,000 and it produces 6 files in a few seconds but only the 6th file has any data in it. I can get it to work when I run at 100,000 lines but would rather not have 27 files.

                        Is there something I am missing?

                        6,686585
                        Grand MasterGrand Master
                        6,686585

                          Jan 25, 2018#12

                          Well, the script code itself obviously works. What could be a problem is that selected are more than 1 GB on data which should be copied to clipboard. This requires a free block in RAM of that size. If the 6.5 GB file is UTF-8 encoded and UltraEdit detected that the file is UTF-8 encoded, then a free block in RAM double the number of selected characters is required for copying the Unicode data into clipboard. I suppose that independent on how much total free RAM your machine has, there is no such large free RAM block available for UltraEdit. Perhaps the script works when Windows is restarted and the script is executed before fragmenting the RAM by starting and exiting applications.

                          What do you have configured for nLinesPerBlock and nBlocksPerFile (on using last script posted by me here highly recommended on such a huge file creating also huge files)?

                          Divide the number of nLinesPerBlock by 5 and multiply the number of nBlocksPerFile by 5 to copy from huge file multiple smaller blocks to the new files?
                          Best regards from an UC/UE/UES for Windows user from Austria

                          7
                          NewbieNewbie
                          7

                            Jan 25, 2018#13

                            marc.a.branham wrote:I am running the same script on a data set that is 2,680,905 rows with 2480 characters. The text file I am trying to process is 6.5 GB. I left the row count at 500,000 and it produces 6 files in a few seconds but only the 6th file has any data in it. I can get it to work when I run at 100,000 lines but would rather not have 27 files.

                            Is there something I am missing?
                            Unfortunately I ran into the same issue; you are going to be limited to Window's 2 GB clipboard size and as a result you are going to have multiple files. The longer the data row the fewer lines per file you will have. I have to adjust the line count for every large file I split; currently I run between 125,000 and 250,000 lines per file, using increments of 25,000.

                            2

                              Jan 25, 2018#14

                              Mofi wrote:What do you have configured for nLinesPerBlock and nBlocksPerFile (on using last script posted by me here highly recommended on such a huge file creating also huge files)?

                              Divide the number of nLinesPerBlock by 5 and multiply the number of nBlocksPerFile by 5 to copy from huge file multiple smaller blocks to the new files?
                              I was using 500,000 nLinesPerBlock and 1 nBlocksPerFile. I didn't quite understand what the nBlocksPerFile represented but following your advice worked and allowed me to create 6 files as opposed to the 27 I was getting at 100,000 nLinesPerBlock and 1 nBlocksPerFile.

                              Thank you guys for your help. 

                              1

                                Apr 03, 2020#15

                                Mofi,

                                I'm using an older version of UE (15.10.0.1018) on a Windows 2012 R2 server, and I'm trying to split a large file with about 1.5 GB. I've tried using the SplitFile.js listed from 3/27/12 above and it fails due to the reasons you stated. I'm basically wanting to the same thing and not familiar with JavaScript. Is it possible for you to adapt the script from 3/27/12 that will work with version 15? If not, I'll see if I can get a hold of a Java book and modify it myself.

                                Thanks again.

                                Read more posts (1 remaining)