Split file based on number of bytes with considering the end of a line

johnmsch · PostMay 18, 2024#12024-05-18T23:35+00:00

I've seen various posts about splitting files by line numbers, etc. but nothing about file size.
For example, I have a 2.7 GB text file that needs to be split into no-larger-than 10 MB chunks, preferably each chunk ending at the end of a line.
Any ideas?

Mofi · PostMay 20, 2024#22024-05-20T11:53+00:00

That can be done with the following UltraEdit script. Please read at least the comments at top of the script.

Code: Select all

/* Script Name:   SplitFileBasedOnBytes.js
   Creation Date: 2024-05-20
   Last Modified: 2024-05-20
   Copyright:     Copyright (c) 2024 by Mofi

This script splits up by default the active file based on the number
of bytes per file defined with the variable nBytesPerFile which must
be entered by the script user on being defined in this script with a
value less than 1.

If the active file has the file extension JS, the script assumes that the
active file is the script file itself and splits up in this case the first
file in the list of opened files. If the first opened file has also the file
extension JS, the script splits up the second opened file if there is opened
a second file too. That makes it possible to open the file to split and the
script file in any order and run the active script for splitting up the other
opened file. It is advisable to add the script to the script list on using
it often for running it on active file from the script list without opening
the script file at all in UltraEdit or UEStudio.

There are supported:

1. Binary files which are all files opened in hex edit mode on script start.
2. Text files with ANSI encoding (= one byte per character).
3. Text files with UTF-8 Unicode encoding without or with a byte order mark.
4. Text files with UTF-16 Unicode encoding without or with a byte order mark.

The text files can have DOS/Windows, Unix/Linux or Mac < OS X line endings.

The created files have the same character encoding as the file to split
and have all also a byte order mark (BOM) if the file to split has a BOM.
The number of bytes of the BOM are considered on number of bytes per file.

The number of bytes per file is made even if the file to split is a UTF-16
encoded Unicode text file.

The script considers line endings and tries to let each created text file
end with a line termination by not copying bytes of characters into a new
file which are beyond the last line ending in the new file. But it is
possible that a new text file does not end with a line termination if the
number of bytes per file is lower than the length of a line in the text file
to split up into smaller chunks. The last created file always ends as the
file to split ends. There is never added a line ending in a created text file.

It is possible with number of bytes per file less than length of a line in
a text file that the bytes of a UTF-8 or UTF-16 encoded character or the
newline character(s) of a line termination are written into two different
files if there is no complete line termination in a newly created text file.
That means the last/first character of newly created text files are invalid
after the file splitting. But that can really occur only if the number of
bytes per file is less than the length of the longest line in a text file
to split.

The file to split must be a saved file because of its file size property
is used by this script.

This function is copyright protected by Mofi for free usage by UE/UES
users. The author cannot be made responsible for any damage caused by
this function. You use it at your own risk. */

var nBytesPerFile = 0;     // Number of bytes per file


// === CreateNewBinaryFile ===================================================

// This function creates a new binary file and
// turns on the hex edit mode for the new file.

function CreateNewBinaryFile()
{
   UltraEdit.newFile();
   // Make sure the line ending type of the new file is DOS/Windows.
   UltraEdit.activeDocument.unixMacToDos();
   // Make sure the new file is an ANSI encoded file and not a UTF-8 or
   // UTF-16 encoded Unicode file to save the binary or text data correct.
   if ((UltraEdit.activeDocument.encoding == 65001) &&
       (typeof(UltraEdit.activeDocument.UTF8ToASCII) == "function"))
   {
      UltraEdit.activeDocument.UTF8ToASCII();
   }
   else UltraEdit.activeDocument.unicodeToASCII();
   // The file must have at least one byte to be able turning on the
   // hex edit mode. A space is written into the the file which is
   // deleted after turning on the hex edit mode for the new file.
   UltraEdit.activeDocument.write(" ");
   UltraEdit.activeDocument.hexOn();
   UltraEdit.activeDocument.gotoPos(0);
   UltraEdit.activeDocument.hexDelete(1);
}


// === GetFileIndex ==========================================================

// Posted at https://forums.ultraedit.com/viewtopic.php?f=52&t=4596#p26710
// Based on  https://forums.ultraedit.com/viewtopic.php?f=52&t=4571

function GetFileIndex (sFullNameOfFile)
{
   // Is the passed value not a string because simply nothing passed?
   if (typeof(sFullNameOfFile) != "string")
   {
      // With UltraEdit v16.00 and later there is a property which holds active document index.
      if (typeof(UltraEdit.activeDocumentIdx) == "number") return UltraEdit.activeDocumentIdx;
      sFullNameOfFile = UltraEdit.activeDocument.path;
   }
   else if (!sFullNameOfFile.length) // It is a string. Is the string empty?
   {
      if (typeof(UltraEdit.activeDocumentIdx) == "number") return UltraEdit.activeDocumentIdx;
      sFullNameOfFile = UltraEdit.activeDocument.path;
   }
   // Windows file systems are not case sensitive. So best make all file
   // names lowercase before comparing the name of the file to search
   // for with the names of the already opened files. Users of UEX should
   // use a case sensitive file name comparison and therefore don't need
   // toLowerCase() here and in the following loop.
   var sFileNameToCompare = sFullNameOfFile.toLowerCase();

   // Compare the name of the file to search for with the (lowercase) file
   // names of all already opened files. Return the document index of the
   // file already opened when found.
   for (var nDocIndex = 0; nDocIndex < UltraEdit.document.length; nDocIndex++)
   {
      if (UltraEdit.document[nDocIndex].path.toLowerCase() == sFileNameToCompare)
      {
         return nDocIndex;
      }
   }
   return -1; // This file is not open.
}

// === SplitFileBasedOnBytes =================================================

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Determine the file to split.
   var nDocIndex = 0;
   // Does the active file not have the file extension JS?
   if (!UltraEdit.activeDocument.isExt("js"))
   {
      nDocIndex = GetFileIndex();   // Yes, split active file.
   }
   else
   {  // Does the first opened file not have the file extension JS?
      if (!UltraEdit.document[0].isExt("js"))
      {
         nDocIndex = 0;    // Yes, split first opened file.
      }
      // Is there a second file also opened in UE/UES?
      else if (UltraEdit.document.length > 1)
      {
         nDocIndex = 1;    // Yes, split second opened file.
      }
   }
   var oFileToSplit = UltraEdit.document[nDocIndex];

   // Get the file size and the fully qualified file name of file to split.
   var nFileSize = oFileToSplit.fileSize;
   var sFullName = oFileToSplit.path;

   while (nBytesPerFile < 1)
   {
      nBytesPerFile = UltraEdit.getValue("Please enter the number of bytes per file:",1);
   }

   // Is the file size of the file greater than the number of bytes per file?
   if (nFileSize > nBytesPerFile)
   {
      var bBinaryFile = false;   // Information about type of file to split
      var nByteOrderMark = 0;    // Number of bytes for the byte order mark
      // Number of bytes to copy with considering the bytes of the BOM
      var nBytesToCopy = nBytesPerFile;
      var sEncodingInfo = "";    // Information text for the encoding type
      var sFileTypeInfo = "";    // Information text for the file type
      var sLineTerminator = "";  // Hexadecimal string for line termination

      UltraEdit.insertMode();    // Make sure the insert mode is active.
      UltraEdit.ueReOn();        // Define the search engine to use.

      // Is the first opened file displayed in text edit mode?
      if (!oFileToSplit.isHexModeOn())
      {
         // The file to split is interpreted as text file by this script.
         // Get the line terminator type of the text file and define
         // accordingly a string for searching hexadecimal in the
         // binary data of the file for the newline character(s).
         switch (oFileToSplit.lineTerminator)
         {
            case -1:
            case  1: sLineTerminator = "0A";
                     sFileTypeInfo = " Unix/Linux text";
                     break;
            case -2:
            case  2: sLineTerminator = "0D";
                     sFileTypeInfo = " Mac < OS X text";
                     break;
            default: sLineTerminator = "0D0A";
                     sFileTypeInfo = " DOS/Windows text";
                     break;
         }

         // Get the character encoding type of the text file and define
         // accordingly the number of bytes of a possible existing byte
         // order mark on text file is a UTF-8 or UTF-16 Unicode file.
         // The search string for the newline character(s) must be
         // adapted if the file is a UTF-16 encoded Unicode text file.
         switch (oFileToSplit.encoding)
         {
            case  1200: nByteOrderMark = 2;
                        sEncodingInfo = "UTF-16 LE";
                        if (sLineTerminator.length > 2)
                        {
                           sLineTerminator = "0D000A00";
                        }
                        else
                        {
                           sLineTerminator += "00";
                        }
                        break;
            case  1201: nByteOrderMark = 2;
                        sEncodingInfo = "UTF-16 BE";
                        if (sLineTerminator.length > 2)
                        {
                           sLineTerminator = "000D000A";
                        }
                        else
                        {
                           sLineTerminator = "00" + sLineTerminator;
                        }
                        break;
            case 65001: nByteOrderMark = 3;
                        sEncodingInfo = "UTF-8";
                        break;
            default:    sEncodingInfo = "ANSI";
                        break;
         }

         // Switch to exit edit mode and move caret to top of the file.
         oFileToSplit.hexOn();
         oFileToSplit.gotoPos(0);

         // Find out if the file is a Unicode encoded text file with a
         // byte order mark and copy the bytes of the byte order mark to
         // the user clipboard 8 for pasting them later into every new file.
         if (nByteOrderMark > 0)
         {
            // Make sure the number of bytes per file is even
            // on file to split is a UTF-16 encoded Unicode file.
            if ((nByteOrderMark == 2) && (nBytesPerFile & 1))
            {
               nBytesPerFile--;
               nBytesToCopy = nBytesPerFile;
            }
            // The file size of the Unicode encoded file must be greater
            // or equal the number of bytes of the byte order mark as
            // otherwise the file cannot have a byte order mark at all.
            if (nFileSize >= nByteOrderMark)
            {
               UltraEdit.selectClipboard(8);
               oFileToSplit.gotoPosSelect(nByteOrderMark);
               oFileToSplit.copy();
            }
            else
            {
               nByteOrderMark = 0;
            }
         }

         // Could the text file have a byte order mark?
         if (nByteOrderMark > 0)
         {
            // Copy and paste just the first two or three bytes into a
            // new binary file and use a hexadecimal find for finding out
            // if these bytes are of a UTF-8, UTF-16 LE or UTF-16 BE BOM.
            CreateNewBinaryFile();
            UltraEdit.activeDocument.paste();
            UltraEdit.activeDocument.gotoPos(0);
            UltraEdit.activeDocument.findReplace.mode=0;
            UltraEdit.activeDocument.findReplace.matchCase=false;
            UltraEdit.activeDocument.findReplace.matchWord=false;
            UltraEdit.activeDocument.findReplace.regExp=false;
            UltraEdit.activeDocument.findReplace.searchDown=true;
            UltraEdit.activeDocument.findReplace.searchAscii=false;
            if (nByteOrderMark == 3)            // UTF-8 BOM?
            {
               if (UltraEdit.activeDocument.findReplace.find("EFBBBF"))
               {
                  sEncodingInfo += " with BOM";
               }
               else
               {
                  UltraEdit.clearClipboard();   // UTF-8 without BOM
                  nByteOrderMark = 0;
               }
            }
            else if (nByteOrderMark == 2)       // UTF-16 BOM?
            {
               if (UltraEdit.activeDocument.findReplace.find("FFFE"))
               {
                  sEncodingInfo += " with BOM"; // UTF-16 LE BOM
               }
               else if (UltraEdit.activeDocument.findReplace.find("FEFF"))
               {
                  sEncodingInfo += " with BOM"; // UTF-16 BE BOM
               }
               else
               {
                  UltraEdit.clearClipboard();   // UTF-16 without BOM
                  nByteOrderMark = 0;
               }
            }
            // Delete the binary file without saving as not longer needed.
            UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
            // Reduce the number of files to copy from first file to split
            // into each new binary file by the number of bytes for the BOM
            // which can be also zero if the Unicode text file has no BOM.
            nBytesToCopy -= nByteOrderMark;
         }
      }
      else  // The file to split is interpreted as binary file by this script.
      {
         sFileTypeInfo = "binary";
         bBinaryFile = true;
      }

      // Write information about kind of file to split into the output window.
      UltraEdit.outputWindow.write(sFullName + " detected as " +
                      sEncodingInfo + sFileTypeInfo + " file.");

      // Are there still bytes to copy after subtracting
      // the number of bytes of the byte order mark?
      if (nBytesToCopy > 0)
      {
         // Quick and dirty solution to get file name without extension
         // and the file extension. That does not work for all file names.
         // Use the functions in FileNameFunctions.js as described at
         // https://forums.ultraedit.com/viewtopic.php?f=52&t=6762
         // for getting the file name with path assigned to sFileName
         // and the file extension with the dot to sFileExt on using the
         // script on a file on which this simple code does not work.
         var nLastPointPos = sFullName.lastIndexOf('.');
         if (nLastPointPos < 0) nLastPointPos = sFullName.length;
         var sFileName = sFullName.substr(0,nLastPointPos) + '_';
         var sFileExt = sFullName.substr(nLastPointPos);

         // The following loop makes the file splitting job.
         // The user clipboard 9 is used for copying the bytes.
         UltraEdit.selectClipboard(9);
         var bOneMoreFile = true;
         var nFileCount = 0;
         var nLineTerminator = sLineTerminator.length / 2;
         var nNextPosition = nByteOrderMark;
         do
         {
            // Move the caret to the beginning of the next block.
            oFileToSplit.gotoPos(nNextPosition);
            // Select the next block of bytes to copy to a new file if the
            // file size is still greater than the byte offset of the last
            // byte of the block to copy. Otherwise select all bytes to the
            // end of the file which will be the last copied block of bytes.
            if (nFileSize > (nNextPosition + nBytesToCopy))
            {
               oFileToSplit.gotoPosSelect(nNextPosition + nBytesToCopy);
            }
            else
            {
               oFileToSplit.selectToBottom();
               bOneMoreFile = false;
            }

            // Copy the block of selected bytes to user clipboard 9 and
            // create a new binary file which becomes the active file.
            oFileToSplit.copy();
            CreateNewBinaryFile();

            // If the file to split is a Unicode encoded text file with
            // a byte order mark, paste first into the new file the BOM
            // from user clipboard 8. Then paste the copied bytes.
            if (nByteOrderMark > 0)
            {
               UltraEdit.selectClipboard(8);
               UltraEdit.activeDocument.paste();
               UltraEdit.selectClipboard(9);
            }
            UltraEdit.activeDocument.paste();

            // Is the file to split a text file and this is not the
            // last block of the file to write into a new file?
            if (!bBinaryFile && bOneMoreFile)
            {
               // Define the parameters for searching upwards in the binary
               // new file with a hexadecimal search for the last occurrence
               // of a line ending if there is one at all in the new file.
               UltraEdit.activeDocument.findReplace.mode=0;
               UltraEdit.activeDocument.findReplace.matchCase=false;
               UltraEdit.activeDocument.findReplace.matchWord=false;
               UltraEdit.activeDocument.findReplace.regExp=false;
               UltraEdit.activeDocument.findReplace.searchDown=false;
               UltraEdit.activeDocument.findReplace.searchAscii=false;
               UltraEdit.activeDocument.findReplace.find(sLineTerminator);
               if (UltraEdit.activeDocument.isFound() && UltraEdit.activeDocument.currentPos)
               {
                  var nLastBytePosition = UltraEdit.activeDocument.currentPos + nLineTerminator;
                  // Determine the number of bytes to delete after last
                  // found line termination to the end of the new file.
                  var nBytesToDelete = nBytesPerFile - nLastBytePosition;
                  if (nBytesToDelete > 0) // Are there any bytes to delete?
                  {
                     // Yes, do that from current position at end of the
                     // last found line termination and reduce the next
                     // position by the number of deleted bytes to copy
                     // them once again as first bytes of the next block.
                     UltraEdit.activeDocument.gotoPos(nLastBytePosition);
                     UltraEdit.activeDocument.hexDelete(nBytesToDelete);
                     nNextPosition -= nBytesToDelete;
                  }
               }
            }

            // Increment the file counter and save the new file. Then close
            // the new file and increment the position in the file to split
            // by the number of last copied bytes to get start position of
            // the next block to copy if there is one more to copy at all.
            nFileCount++;
            UltraEdit.saveAs(sFileName + nFileCount + sFileExt);
            UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
            nNextPosition += nBytesToCopy;
         }
         while (bOneMoreFile);

         // Clear the used user clipboards and reselect system clipboard.
         UltraEdit.clearClipboard();
         if (nByteOrderMark > 0)
         {
            UltraEdit.selectClipboard(8);
            UltraEdit.clearClipboard();
         }
         UltraEdit.selectClipboard(0);

         // Inform the user about the file splitting result.
         UltraEdit.outputWindow.write("Created " + nFileCount + " " +
                          sEncodingInfo + sFileTypeInfo + " files.");
      }

      // Move the caret to top of the split file and switch back to the text
      // edit mode if this file was displayed initially in text edit mode.
      oFileToSplit.gotoPos(0);
      if (!bBinaryFile)
      {
         oFileToSplit.hexOff();
      }
   }
   else  // The file cannot be split up into at least two parts as
   {     // it has not more bytes than defined with nBytesPerFile.
      UltraEdit.outputWindow.write(sFullName + " has with " + nFileSize +
                     " bytes not more than " + nBytesPerFile + " bytes.");
      UltraEdit.outputWindow.write("There is no file splitting to do.");
   }

   // Show the output window if it is currently not visible.
   if (!UltraEdit.outputWindow.visible)
   {
      UltraEdit.outputWindow.showWindow(true);
   }
}

The script was developed and tested with UltraEdit for Windows / UEStudio v2024.0.0.28 with small binary and small text files with various encodings and line endings and with an ANSI (Windows-1252) encoded text file with DOS/Windows line endings with a file size of 6,100,209,960 bytes (~5,68 GiB) with 265,226,520 lines to split up into twelve chunks of up to 524,288,000 bytes (512 MiB).

I did not test the script with older versions of UE/UES. It should work also with older versions. Please let me know if the script fails to split up a file correct with an older version of UE/UES by posting a reply with information about used version of UE or UES, the type of the file to split, its encoding, its line ending type, its file size and the number of bytes per file.

johnmsch · PostMay 20, 2024#32024-05-20T19:37+00:00

@Mofi saves the day again!
Thank you sir