Split text files by keyword

Split text files by keyword

671
Advanced UserAdvanced User
671

    Oct 11, 2019#1

    It occurs to me that I am not the first to want to want to split text files by keyword.
    Is there any solutions floating around out there?
    I did search the forum but surprisingly found nothing.
    I have found a tool called TEXTWEDGE off source forge which seems to work so in a sense I have a solution already.
    I would love to be able to do this from within UE though.
    If starting from scratch I am thinking a script could do such a thing?
    Many thanks for any replies

    6,602547
    Grand MasterGrand Master
    6,602547

      Oct 13, 2019#2

      This is possible quite easily with an UltraEdit script or with an UltraEdit macro. There are 147 matches on searching with forum search for +split* +files just in the two forums Macros and Scripts. A search with a www search engine with the term site:forums.ultraedit.com split file finds also lots of UltraEdit forum topics about file splitting with a macro or script. So there are already lots of customized file splitting macros and scripts.

      The general issue on file splitting is that a general script is not what most users need. Every user has slightly different requirements on how a large file should be split into smaller files, how the smaller files should be named, etc. How the large file can be split using a macro or script depends also on which configuration settings are set for handling large files. It makes a difference if counting lines is enabled in configuration even on opening a very large file in UltraEdit or if this configuration setting is disabled to more quickly work with very large files with hundreds of MB or even several GB.

      Let us know what are your requirements for splitting a file based on a string entered on starting the script or macro. But define all requirements as much as possible.
      • Should the string be interpreted as literal string or as regular expression string?
      • Which regular expression engine to use on interpreting the entered string as regular expression string?
      • Can the string be found anywhere in file or just within a specific context like beginning of a line, at end of a line, between other strings?
      • Is the line containing the string the last line of a block to write into a new file or the first line of next block?
      • What should be the file names of the new files? Simple File1.txt, File2.txt or a name derived from file split?
      • Should a leading 0 be used in file name left to number 1 to 9 for the case on more than 9 files are created during split operation?
      • Or should the new files saved with a file name depending on a string within the block as many customized split scripts/macros do?
      • Should the line ending type of the new files be fixed DOS or UNIX or should it be derived from line ending type of file to split?
      • Should the character encoding of the new files be fixed ANSI or UTF-8 or UTF-16 or derived from file to split?
      • On file name or file path derived in any way from file to split, what should be used on file to split is a new, unsaved file?
      • etc.
      Best regards from an UC/UE/UES for Windows user from Austria

      671
      Advanced UserAdvanced User
      671

        Oct 15, 2019#3

        Thanks @Mofi sorry for delay I need to aborb what you have said and do a few better searches will come back and post here in due course

          Oct 30, 2019#4

          Greetings Mofi and UE Fans

          I have assembled a list of Macro (surpisingly alot of posts) and Script posts of interest which I am trying to understand.
          By answering Mofi's questions above I have ended up writing a more comprehensive spec below.
          Any comments or responses greatly appreciated as always.

          Target specification for splitting a file by keyword/string
          - Keyword/String by which the file is split should be a literal
          - The format of the source file is such that the Keyword/String will occur at the beginning of line.
          - The source file is a series of outputs of show commands from the CLI of a Huawei/Cisco/Juniper router/switch.
          - The string will be for the form <hostname> show command
          - Line containing the string should be the first line of next file
          - Line ending type of the new files be fixed DOS preferably.
          - File names of the new files would ideally a combination of the source filename as a prefix and remainder of the line containing the string. (See example below)
          - The angled brackets < and > would need to be dropped from the filename
          - Prefer a leading zero/0 be used in output filenames preceding the numbers 1 to 9. Prefer this to be the case regardless of whether 9 or less files are created during split operation. So numbering would be 2 digits regardless of single or multiple digit numbers.
          - Character encoding of the new files not sure. Will read up on what this means. Maybe derived from file to split?

          Macro Forum Interesting posts
          http://forums.ultraedit.com/how-to-split-a-single-file-into-multiple-text-file-t17844.html
          http://forums.ultraedit.com/how-to-split-a-large-file-into-smaller-files-depen-t15397-s15.html
          http://forums.ultraedit.com/macro-to-split-a-large-file-and-save-the-new-ones--t15399.html
          http://forums.ultraedit.com/how-to-split-a-large-file-into-smaller-files-depen-t15397.html
          http://forums.ultraedit.com/splitting-based-on-content-t3710.html
          http://forums.ultraedit.com/a-question-about-splitting-files-t3726.html
          http://forums.ultraedit.com/splitting-text-file-t3609.html
          http://forums.ultraedit.com/spliting-files-saving-but-line-starts-with-same-wo-t2393.html

          Script Forum Interesting posts
          http://forums.ultraedit.com/how-to-split-a-single-file-into-multiple-several-f-t17170.html
          http://forums.ultraedit.com/how-to-split-a-file-with-a-script-based-on-a-speci-t6704.html

          Source File Format
          Filename name format =
          YYYY-MM-DD hhhmmmss hostname.log
          e.g. 2019-10-14 06h51m59 hostname.log

          File content format
          This would be a series of commands carried out at CLI as shown below (NOTE in preview the full file does not show up here)
          <hostname>disp command1
          Text output
          Text output
          Text output
          Text output
          <hostname>disp command2
          Text output
          Text output
          Text output
          Text output
          <hostname>disp command3
          Text output
          Text output
          Text output
          Text output
          <hostname>disp command4
          Text output
          Text output
          Text output
          Text output
          <hostname>disp command5
          Text output
          Text output
          Text output
          Text output
          Destination Files Format
          Desired output for case above would be 5 files with filename comprised of original filename with suffix being the command in question.
          For example above

          File 1 name = 2019-10-14 06h51m59 hostname disp command1.log
          File1 contents:
          <hostname>disp command1
          Text output
          Text output
          Text output
          Text output
          File 2 name = 2019-10-14 06h51m59 hostname disp command2.log
          File2 contents:
          <hostname>disp command2
          Text output
          Text output
          Text output
          Text output
          File 3 name = 2019-10-14 06h51m59 hostname disp command3.log
          File3 contents:
          <hostname>disp command3
          Text output
          Text output
          Text output
          Text output
          File 4 name = 2019-10-14 06h51m59 hostname disp command4.log
          File4 contents:
          <hostname>disp command4
          Text output
          Text output
          Text output
          Text output
          File 5 name = 2019-10-14 06h51m59 hostname disp command5.log
          File5 contents:
          <hostname>disp command5
          Text output
          Text output
          Text output
          Text output

          6,602547
          Grand MasterGrand Master
          6,602547

            Oct 30, 2019#5

            That is a very good requirements specification for the coding task. The better solution is in this case an UltraEdit script.

            There are just two requirements unclear regarding to the format of the file name of the files to create.

            2019-10-14 06h51m59 hostname is from name of the file to split. disp command1 is from line containing <hostname> at beginning of the line. So there is no guarantee that no new file has by chance the same file name as another file created before if the source file contains two or more times the same string after <hostname>. There is no automatic incremented number with a leading 0 for the files 1 to 9 in file name. Such an incremented number should be somewhere in file name, best between 2019-10-14 06h51m59 hostname and disp command1 to make sure each file to create has a unique file name.
            Best regards from an UC/UE/UES for Windows user from Austria

            671
            Advanced UserAdvanced User
            671

              Oct 30, 2019#6

              Ah of course and this would happen thank you Mofi.
              I should have specified a 2 digit number as you say.
              If/when I get that far, if possible I would merely increment the last 2 digits that exist in the timestamp of the existing source filename.
              This is in fact the seconds field shown in underlined bold below:
              YYYY-MM-DD hhhmmmss hostname.log
              That is the 59 in the sample source filename
              2019-10-14 06h51m59 hostname.log
              So the output files in the example above would then be
              File 1 name = 2019-10-14 06h51m59 hostname disp command1.log
              File 2 name = 2019-10-14 06h51m60 hostname disp command2.log
              File 3 name = 2019-10-14 06h51m61 hostname disp command3.log
              File 4 name = 2019-10-14 06h51m62 hostname disp command4.log
              File 5 name = 2019-10-14 06h51m63 hostname disp command5.log

              If that 2 digit field loops at 97 > 98 > 99 > 00 > 01 that would allow for 100 duplicates which would be enough I think..

              6,602547
              Grand MasterGrand Master
              6,602547

                Oct 31, 2019#7

                Here is the script for the specific file splitting task as described above. The code of the functions GetFileExt and GetFileName must be copied into this script before it can be used on a file.

                If the script is executed on a new, unnamed file, it saves the blocks into directory C:\Temp with file name Part01*.txt, Part02*.txt, etc. The directory C:\Temp must exist or the script fails to save the files.

                Code: Select all

                function SaveBlock ()
                {
                   // Create a new file with character encoding and line ending type
                   // as defined in configuration of UltraEdit/UEStudio.
                   UltraEdit.newFile();
                
                   // Convert the empty file to same encoding as file to split.
                   if (g_nEncoding == 0)   // New file should not be Unicode encoded.
                   {
                      if (UltraEdit.activeDocument.encoding == 65001)
                      {
                         UltraEdit.activeDocument.UTF8ToASCII();
                      }
                      else if ((UltraEdit.activeDocument.encoding == 1200) ||
                               (UltraEdit.activeDocument.encoding == 1201))
                      {
                         UltraEdit.activeDocument.unicodeToASCII();
                      }
                   }
                   else if (g_nEncoding == 1) // New file should be UTF-8 Unicode encoded.
                   {
                      if ((UltraEdit.activeDocument.encoding == 1200) ||
                          (UltraEdit.activeDocument.encoding == 1201))
                      {
                         UltraEdit.activeDocument.unicodeToASCII();
                         UltraEdit.activeDocument.ASCIIToUTF8();
                      }
                      else if (UltraEdit.activeDocument.encoding != 65001)
                      {
                         UltraEdit.activeDocument.ASCIIToUTF8();
                      }
                   }
                   else  // New file should be UTF-16 LE Unicode encoded.
                   {
                      if (UltraEdit.activeDocument.encoding == 65001)
                      {
                         UltraEdit.activeDocument.UTF8ToASCII();
                         UltraEdit.activeDocument.ASCIIToUnicode();
                      }
                      else if (UltraEdit.activeDocument.encoding != 1200)
                      {
                         UltraEdit.activeDocument.ASCIIToUnicode();
                      }
                   }
                
                   // Make sure the new file uses DOS/Windows line endings.
                   UltraEdit.activeDocument.unixMacToDos();
                
                   // Paste the block from user clipboard 9 into the new file.
                   UltraEdit.activeDocument.paste();
                
                   // Get character position of first new line character in clipboard.
                   var nEndOfFirstLine = UltraEdit.clipboardContent.search(/[\r\n]/);
                   if (nEndOfFirstLine < 0)   // For safety reasons, should be never true.
                   {
                      nEndOfFirstLine = UltraEdit.clipboardContent.length;
                   }
                
                   // Get first line in clipboard.
                   var sFirstLine = UltraEdit.clipboardContent.substring(0,nEndOfFirstLine);
                
                   // Get the string after <hostname> and 0 or more tabs or spaces to end of line.
                   var sVariableNamePart = sFirstLine.replace(/<hostname>[\t ]*(.+)$/,"$1");
                
                   // Is there anything at all after <hostname>?
                   if (sVariableNamePart == sFirstLine)
                   {
                      sVariableNamePart = ""; // No, there is no variable file name part.
                   }
                   else
                   {
                      // Replace all characters not allowed in a file name by an underscore.
                      sVariableNamePart = sVariableNamePart.replace(/[\x00-\x1F\"*\/:<>?\\|]/g,"_");
                      // Add a space at beginning of variable name part.
                      sVariableNamePart = " " + sVariableNamePart;
                   }
                
                   var sFileName
                   var sFileNumber;
                   var nFileNumber = -1;
                
                   // Get second from file name if the file name starts
                   // with date/time in the format YYYY-MM-DD HHhmmmss.
                   var sSecond = g_sFileName.replace(/^.*\\[12][0-9]{3}-[01][0-9]-[0-3][0-9] [0-2][0-9]h[0-5][0-9]m([0-6][0-9]).*$/,"$1");
                   if (sSecond != g_sFileName)
                   {
                      // Convert the second to an integer and add current file count value.
                      nFileNumber = parseInt(sSecond,10) + g_nFileCount;
                      // If value is greater 99, subtract 99.
                      if (nFileNumber > 99) nFileNumber -= 99;
                      // Convert the file number to a string using decimal system.
                      sFileNumber = nFileNumber.toString(10);
                      g_nFileCount++;   // Increment the file counter by one.
                   }
                   else
                   {
                      g_nFileCount++;   // Increment the file counter by one.
                      // Convert the file number to a string using decimal system.
                      sFileNumber = g_nFileCount.toString(10);
                   }
                
                   // Prepend the file number string with one or more leading zeros if
                   // it has less digits than the predefined string g_sLeadingZeros.
                   if (sFileNumber.length < g_sLeadingZeros.length)
                   {
                      sFileNumber = g_sLeadingZeros.substr(sFileNumber.length) + sFileNumber;
                   }
                
                   if (nFileNumber < 0)
                   {
                      // Just append the file number to file name string.
                      sFileName = g_sFileName + sFileNumber;
                   }
                   else
                   {
                      // Replace the second in file name by the file number string.
                      sFileName = g_sFileName.replace(/^(.*\\[12][0-9]{3}-[01][0-9]-[0-3][0-9] [0-2][0-9]h[0-5][0-9]m)[0-6][0-9]/,"$1"+sFileNumber);
                   }
                
                   sFileName += sVariableNamePart + g_sFileExt;
                   UltraEdit.saveAs(sFileName);
                   UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
                }
                
                
                if (UltraEdit.document.length > 0)  // Is any file opened?
                {
                   // Define a string for leading zeros in file name. The number
                   // of 0s in this string determines the minimum number of the
                   // digits of the incremented number in the file names.
                   var g_sLeadingZeros = "00";
                
                   var g_nFileCount = 0;  // Counts the files created by this script.
                
                   // Get document index of active file.
                   var nDocIndex = UltraEdit.activeDocumentIdx;
                
                   var g_sFileName = GetFileName(nDocIndex,false,true);
                   var g_sFileExt = GetFileExt(nDocIndex,true);
                
                   // Has the active file no file name and path, use "Part" as file name
                   // and store the files in the directory C:\Temp which must exist.
                   if (!g_sFileName.length) g_sFileName = "C:\\Temp\\Part";
                   // Has the active file no file extension, use ".txt" as file extension.
                   if (!g_sFileExt.length) g_sFileExt = ".txt";
                
                   var g_nEncoding = 0; // Use by default "ANSI" encoding for new files.
                
                   if (UltraEdit.activeDocument.encoding == 65001)          // UTF-8
                   {
                      g_nEncoding = 1;  // Use "UTF-8" encoding for new files.
                   }
                   else if ((UltraEdit.activeDocument.encoding == 1200) ||  // UTF-16 LE
                            (UltraEdit.activeDocument.encoding == 1201))    // UTF-16 BE
                   {
                      g_nEncoding = 2;  // Use "UTF-16 LE" encoding for new files.
                   }
                
                   var nBlockBegin = 1; // Line number at beginning of current block
                   var nBlockEnd = 1;   // Line number at end of current block
                
                   // Perl regular expression search string to find beginning of next block.
                   var sSearchExp = "[\\r\\n]\\K(?=<hostname>)";
                
                   // Use user clipboard 9 for copying the data blocks.
                   UltraEdit.selectClipboard(9);
                
                   // Define environment for this script.
                   UltraEdit.insertMode();
                   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                
                   // Move caret to first line column 2 for the first search
                   // because of first <hostname> at to of file must be ignored.
                   UltraEdit.activeDocument.gotoLine(1,2);
                
                   // Use the Perl regular expression engine for the searches.
                   UltraEdit.perlReOn();
                   UltraEdit.document[nDocIndex].findReplace.mode=0;
                   UltraEdit.document[nDocIndex].findReplace.matchCase=true;
                   UltraEdit.document[nDocIndex].findReplace.matchWord=false;
                   UltraEdit.document[nDocIndex].findReplace.regExp=true;
                   UltraEdit.document[nDocIndex].findReplace.searchDown=true;
                   UltraEdit.document[nDocIndex].findReplace.searchInColumn=false;
                
                   while (UltraEdit.document[nDocIndex].findReplace.find(sSearchExp))
                   {
                      nBlockEnd = UltraEdit.document[nDocIndex].currentLineNum;
                      if(nBlockEnd == nBlockBegin)  // For safety avoid an endless loop.
                      {
                         UltraEdit.document[nDocIndex].key("RIGHT ARROW");
                         continue;
                      }
                
                      // Select from end of block to beginning of the block to copy.
                      UltraEdit.document[nDocIndex].gotoLineSelect(nBlockBegin,1);
                      // Copy the selection to user clipboard 9.
                      UltraEdit.document[nDocIndex].copy();
                      // Discard the selection and move caret to first line
                      // second column of the next block to copy next.
                      UltraEdit.document[nDocIndex].gotoLine(nBlockEnd,2);
                      // The next block begins where the copied block ends.
                      nBlockBegin = nBlockEnd;
                      // Save this block into a new file.
                      SaveBlock();
                   }
                
                   if (g_nFileCount) // Was at least one block saved into a new file?
                   {
                      // Copy the last block into one more file.
                      UltraEdit.document[nDocIndex].gotoLine(nBlockBegin,1);
                      UltraEdit.document[nDocIndex].selectToBottom();
                      UltraEdit.document[nDocIndex].copy();
                      SaveBlock();
                      UltraEdit.outputWindow.write("Split file into " + g_nFileCount.toString() + " blocks.");
                      UltraEdit.clearClipboard();
                   }
                   else
                   {
                      UltraEdit.outputWindow.write("Nothing found to split the active file.");
                   }
                
                   UltraEdit.selectClipboard(0);
                   UltraEdit.document[nDocIndex].top();
                   UltraEdit.outputWindow.showWindow(true);
                }
                
                The script was (not fully) tested by me with UltraEdit v22.20 and v26.20.

                It does not work for too old versions of UltraEdit not supporting all methods and properties used in this UltraEdit script without changing the code.

                The script is written to split a file of any size, even on having several GB, as long as line counting is enabled as line numbers are used by the script. But if the files to split are always less than 20 MB, a different approach loading the entire file content once into memory of JavaScript core and processing there the data would result in a shorter time to finish the file splitting task. Each update of an UltraEdit window which can be avoided by an UltraEdit script reduces the script execution time.
                Best regards from an UC/UE/UES for Windows user from Austria

                671
                Advanced UserAdvanced User
                671

                  Nov 02, 2019#8

                  Hi Mofi
                  I have no right to expect such comprehensive reply and I am blown away at your help.
                  I will test in as soon as I have time and report results here.
                  Will also look a the code and see if I I can work out what is bring done.
                  Thank you Mofi, very grateful.