How to split a large file into smaller files depending on beginning of line?

How to split a large file into smaller files depending on beginning of line?

10
Basic UserBasic User
10

    Jan 10, 2015#1

    Hello everyone!

    My text file is getting larger and larger every day. So I need to split it up into smaller files. :oops:

    Text file with size 10 MB: (There is no line break between the people.)

    Code: Select all

    1LINE_jimmy_home                  http://jimmy.blogspot.com/
    1LINE_jimmy_company               https://jimmy.company.blogspot.com/
    1LINE_jimmy_twi                   https://twtter.jimmy.com/
    2LINE_Sam_home                    http://sam.home.com/index.php
    2LINE_Sam_blog                    http://sam.blog.com 
    3LINE_Jane_work                   http://jane.company.net/floor2
    3LINE_Jane_twitter                http://twitter.jane.com/
    ...
    ...
    ...
    99999LINE_jenna_home              http://twitter.jenna.com/
    99999LINE_jenna_work              http://workjenna.com/
    =========================================================

    Each text file separated like:

    1LINE.txt

    Code: Select all

    1LINE_jimmy_home                  http://jimmy.blogspot.com/
    1LINE_jimmy_company               https://jimmy.company.blogspot.com/
    1LINE_jimmy_twi                   https://twtter.jimmy.com/
    2LINE.txt

    Code: Select all

    2LINE_Sam_home                    http://sam.home.com/index.php
    2LINE_Sam_blog                    http://sam.blog.com
    3LINE.txt

    Code: Select all

    3LINE_Jane_work                   http://jane.company.net/floor2
    3LINE_Jane_twitter                http://twitter.jane.com/
    ...
    ...

    99999LINE.txt

    Code: Select all

    99999LINE_jenna_home              http://twitter.jenna.com/
    99999LINE_jenna_work              http://workjenna.com/
    I'm a total newbie. I don't know how to do this.
    Please help me.

    6,675585
    Grand MasterGrand Master
    6,675585

      Jan 11, 2015#2

      Do the code blocks display the real content of the large file?

      What is the separator between string to identify lines of a block which should be used as filename and the lines to copy into the smaller file? Is the separator the first underscore found on each line?
      It is not really clear for me how the content of the file really looks.

      Does the string at beginning of a line never contain a character which is not allowed for a filename like ?*: ?

      Could the string at beginning of a line contain characters with a special meaning in a Perl regular expression like .+$ ?

      Which version of UltraEdit do you use?

      Is it possible to use an UltraEdit script instead of an UltraEdit macro?

      In scripts variables can be used making this splitting task easier to code.

      Best would be you copy the first 10-30 lines of your large file into a new text file with name input.txt. Next create for those lines the smaller files manually. And then pack input.txt and the smaller files with ZIP or RAR and attach this archive file to your next post with the answers on my questions. Those example files would make it very clear what the macro or script should do and we could test the macro/script as well after coding macro/script.
      Best regards from an UC/UE/UES for Windows user from Austria

      10
      Basic UserBasic User
      10

        Jan 11, 2015#3

        Big Thanks Mofi! And I'm so sorry for my poor explanation.
        I am not even sure that I could give you more details from this post.
        Version : UltraEdit v12

        "10001LINE" this means like personal ID. Each person has a unique number including "LINE" after the number.

        It doesn't matter how to make it. I just wish I could split one big file up into smaller files.
        If possible, please help me with easy & simple way because I am a horrible newbie. :(

        As you said I attached .zip file (deleted by Mofi later).

        Thanks again, have a good time, Mofi

        6,675585
        Grand MasterGrand Master
        6,675585

          Jan 12, 2015#4

          Here is a quickly recorded and next edited macro.

          Code: Select all

          InsertMode
          ColumnModeOff
          HexOff
          PerlReOn
          Bottom
          IfColNumGt 1
          InsertLine
          EndIf
          Top
          Clipboard 9
          Loop
          Find MatchCase RegExp "^(\d+LINE_).*\r?\n(?:\1.*\r?\n)*"
          IfNotFound
          ExitLoop
          EndIf
          Copy
          NewFile
          Paste
          Top
          Find MatchCase RegExp "^(\d+LINE)"
          SaveAs "^s.txt"
          CloseFile NoSave
          IfEof
          ExitLoop
          EndIf
          EndLoop
          ClearClipboard
          Clipboard 0
          Top
          UnixReOff
          This macro worked with UE v21.30.0.1016 on your input file and produced the output files as provided by you with the difference that last line in all files has also a line termination. Let me know if it does not work in UE v12.xx to find probably a different solution.

          Edit:

          The macro works as posted with UE v12.20b+1. It fails with UE v12.00a+1 as in this version back referencing in search string is not supported by the Perl regexp engine in v12.00a+1. The regular expression search string is valid in UE v12.10. But the macro does not work with this version because of several bugs with Perl regexp engine introduced with UE v12.00 which were fixed later in v12.20. In other words if this macro works depends on exact version of UltraEdit v12.

          It would be possible to code this different using UltraEdit engine and bookmarks which would be slower, but would work also with older versions of UltraEdit. However, you should think about an upgrade to latest version of UltraEdit.
          Best regards from an UC/UE/UES for Windows user from Austria

          10
          Basic UserBasic User
          10

            Jan 12, 2015#5

            OMG big thanks Mofi!
            It works like a charm!
            I really appreciate your help. I couldn't have done it without you!
            But I was kind of shocked because your code was shorter and more simple than I expected. Of course, which means you are great.

            BTW: Mofi, one more question.
            Could I get the .txt files with URLs only? I mean, not including person's name, id etc.
            Only the part that starts with URLs like "http...(https....)"

            6,675585
            Grand MasterGrand Master
            6,675585

              Jan 12, 2015#6

              Well, it is of course no problem to remove everything left to http or ftp on all copied lines before saving the new file.

              Code: Select all

              InsertMode
              ColumnModeOff
              HexOff
              Bottom
              IfColNumGt 1
              InsertLine
              EndIf
              Top
              TrimTrailingSpaces
              PerlReOn
              Find MatchCase RegExp "(\r?\n|\r)\1+"
              Replace All "\1"
              Clipboard 9
              Loop
              PerlReOn
              Find MatchCase RegExp "^(\d+LINE_).*\r?\n(?:\1.*\r?\n)*"
              IfNotFound
              ExitLoop
              EndIf
              Copy
              NewFile
              Paste
              Top
              SelectLine
              Copy
              EndSelect
              Top
              Find MatchCase RegExp "^.*?(http|ftp)"
              Replace All "\1"
              Paste
              SelectToTop
              UnixReOff
              Find MatchCase RegExp "[~^r^n0-9A-Za-z]+"
              Replace All SelectText "_"
              EndSelect
              Top
              Find MatchCase RegExp "_^{http^}^{ftp^}*$"
              Replace ""
              SelectToTop
              Copy
              EndSelect
              DeleteLine
              SaveAs "^c.txt"
              CloseFile NoSave
              IfEof
              ExitLoop
              EndIf
              EndLoop
              ClearClipboard
              Clipboard 0
              Top
              UnixReOff
              Explanation of Perl regular expression search string: ^(\d+LINE_).*\r?\n(?:\1.*\r?\n)*

              ^ ... begin each search at beginning of a line.

              (...) ... is a capturing group. Whatever is found in the expression inside the parentheses is stored for each find/replace and can be back-referenced within search or replace string with \1 as done in this expression later.

              .* ... any character except newline characters (carriage return and line-feed) zero or more times.

              \r? ... a carriage return which can, but must not exist (for UNIX files).

              \n ... a line-feed which must exist.

              (?:...) ... a non-capturing group. The string found by the expression inside is not temporarily stored for back-referencing. It just creates a group for other purposes as done here.

              \1 ... back-references the string found first at beginning of a line. So the next line(s) must start with the same string as first found line.

              .*\r?\n ... once more the rest of the line with line termination.

              * ... is here at end of search string a multiplier for the entire expression in the non-capturing group. This expression can be applied greedy zero or more times. So it matches all lines below first found line starting with the same string as the first found line. Greedy means here as much lines as possible.

              One more note: A Perl regular expression cannot match an unlimited number of characters. So the result could be wrong for a smaller file if in large file thousands of lines start with same string.
              Best regards from an UC/UE/UES for Windows user from Austria

              10
              Basic UserBasic User
              10

                Jan 12, 2015#7

                Thanks! It also works fine.
                But Mofi, sorry about the my third degree, when I had to do this job for specific person's ID, I used to find the person I need, cut them line by line manually, create .txt files and paste them one by one. :oops:

                Could be there any shortcut to do this job for the specific person's ID using your code?

                I mean, like I enter person's id number or his/her name manually on some nag screen. => done this job for the specific person's ID automatically. (If possible, works with more than one person. If not possible, at least with one person.)

                6,675585
                Grand MasterGrand Master
                6,675585

                  Jan 13, 2015#8

                  Okay, here is one more macro for individual lines based on name of a person or identification number. As in UltraEdit macros it is not possible to enter a string stored in a variable like in UltraEdit scripts, it is necessary to write the entered string into a new line at top of large file which is finally deleted.

                  Code: Select all

                  InsertMode
                  ColumnModeOff
                  HexOff
                  Bottom
                  IfColNumGt 1
                  InsertLine
                  EndIf
                  Top
                  TrimTrailingSpaces
                  PerlReOn
                  Find MatchCase RegExp "(\r?\n|\r)\1+"
                  Replace All "\1"
                  ",
                  "
                  Top
                  Key END
                  GetString "Enter persons or ids separated by commas:"
                  Clipboard 9
                  UnixReOff
                  Loop
                  Find MatchCase Up ","
                  Delete
                  IfNotFound
                  ExitLoop
                  EndIf
                  Find MatchCase RegExp "?+$"
                  Cut
                  Find MatchCase "^c"
                  IfFound
                  Key HOME
                  PerlReOn
                  Find MatchCase RegExp "^(\d+LINE_).*\r?\n(?:\1.*\r?\n)*"
                  Copy
                  NewFile
                  Paste
                  Top
                  SelectLine
                  Copy
                  EndSelect
                  Top
                  Find MatchCase RegExp "^.*?(http|ftp)"
                  Replace All "\1"
                  Paste
                  SelectToTop
                  UnixReOff
                  Find MatchCase RegExp "[~^r^n0-9A-Za-z]+"
                  Replace All SelectText "_"
                  EndSelect
                  Top
                  Find MatchCase RegExp "_^{http^}^{ftp^}*$"
                  Replace ""
                  SelectToTop
                  Copy
                  EndSelect
                  DeleteLine
                  SaveAs "^c.txt"
                  CloseFile NoSave
                  EndSelect
                  Top
                  Key END
                  EndIf
                  EndLoop
                  ClearClipboard
                  Clipboard 0
                  Best regards from an UC/UE/UES for Windows user from Austria

                  10
                  Basic UserBasic User
                  10

                    Jan 13, 2015#9

                    Mofi! Your code is awesome.
                    It works just like I would expect it to!
                    So far, I have tested it with 1 person to 7 persons, and working great.

                    I can't thank you enough.
                    Have a good time, Mofi

                      Jan 14, 2015#10

                      Hi, Mofi!

                      It's me again. I have another question. :o

                      About your second code: Could I have .txt file names as their whole profile part like ID number, name, etc. (everything left), but not including the URLs part (everything right). I mean whole left part would be their .txt file names.

                      And of course if a person has more than one line, the first line of profile part would be their .txt file name.

                      For example, as you can see above, jimmy has 3 lines. So it would be like this "1LINE_jimmy_home.txt"

                      Note: There could be also other characters than word characters (letters, digits, underscores) left to URL like commas and spaces.

                      6,675585
                      Grand MasterGrand Master
                      6,675585

                        Jan 16, 2015#11

                        Okay, I updated macro 2 - split entire file into smaller - and macro 3 - get individual blocks into new files - to fulfill new requirement for file name. Each string consisting of 1 or more characters not being a letter or a digit like underscore, comma, space, question mark, ... are replaced by a single underscore to get a file name consisting only of letters, digits and underscores between the words.

                        Recoding the macros was a hard work because of some bugs in UE v12.20b+1 all fixed years ago in later versions of UltraEdit. The task coding the macros would have been much easier for currently latest UE v21.30.0.1016. The third macro could be also optimized in size by me by some small improvements at beginning.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        10
                        Basic UserBasic User
                        10

                          Jan 16, 2015#12

                          Thank you is not enough to express my appreciation for all your hard work.
                          But, Mofi, first of all, I am So Sorry. Please forgive me. :oops:

                          I made a horrible typo. My version number is "v21" (the latest) not "v12". OMG could you forgive me? :oops: :oops: :oops:

                          I was testing your latest modified macros 2, 3 over and over again. And it didn't work at all.

                          So I read your instructions slowly and repeatedly. And I found out that the color of version number you wrote. Red color!

                          OMG

                          Could you make them work properly for the latest v21.
                          But I'd like the URL part to be saved as it is, I mean just urls without any modification, like underscores. Is it possible?
                          This is what I expect:

                          Before:

                          Code: Select all

                          1LINE_jimmy_home 3A-NHATTANVILLE, NY 100272 st.32   http://jimmy.blogspot.com/
                          1LINE_jimmy_company                                                https://jimmy.company.blogspot.com/
                          1LINE_jimmy_twi                                                        https://twtter.jimmy.com/
                          After:
                          1LINE_jimmy_home 3A-NHATTANVILLE, NY 100272 st.32.txt

                          Code: Select all

                          http://jimmy.blogspot.com/
                          https://jimmy.company.blogspot.com/
                          https://twtter.jimmy.com/
                          Those 3 URLs are the contents of .txt file.

                          Before:

                          Code: Select all

                          2LINE_Sam_pay monthly         http://sam.blog.com/pay.xls
                          2LINE_Sam_picture                http://sam.home.com/holidays.zip
                          After:
                          2LINE_Sam_pay monthly.txt

                          Code: Select all

                          http://sam.blog.com/pay.xls
                          http://sam.home.com/holidays.zip
                          Before:

                          Code: Select all

                          3LINE_Jane_work II              http://jane.company.net/floor2
                          After:
                          3LINE_Jane_work II.txt

                          Code: Select all

                          http://jane.company.net/floor2

                          6,675585
                          Grand MasterGrand Master
                          6,675585

                            Jan 16, 2015#13

                            smallville wrote:My version number is "v21" (the latest) not "v12".
                            You would have save us both a lot of time on copying version information right from About dialog of UltraEdit into edit field in browser window.

                            Yes, the version information can be selected with mouse and copied with Ctrl+C or by making a right click to open context menu and a left click on Copy.
                            smallville wrote:But I'd like the URL part to be saved as it is, I mean just urls without any modification, like underscores.
                            It was totally unexpected for me that the macros working for UE v12.20b+1 did not produce the same correct output files on using UE v21.30.0.1016. But the contents of the output files were indeed wrong with currently latest version of UltraEdit.

                            I could quickly find out why. There is a new bug in UE v21.30.0.1016 introduced with UE v19.00 as I found out later. The caret is erroneously moved to top of file after running first Perl regular expression Replace All from beginning of last line of file to end of file to remove the URL just from last line. On a Replace All the position of the caret should never change at all. Therefore the second Perl regular expression Replace All for replacing 1 or more characters not suitable for file name by an underscore was executed on entire file instead of last line only.

                            Of course I will report this bug to IDM support quickly as it can easily result in many macros and scripts not working correct.

                            I updated macro 2 and 3 once more with a workaround solution for this issue. The two replaces for preparing the file name are done now with UltraEdit regexp engine where this erroneous caret move does not occur. The macros work now with UE v21.30.0.1016 as well as with UE v12.20b+1 and most likely all other versions of UltraEdit between those 2 versions.

                            By the way: With UE v21.xx (any UE >= v13.00) it would have been better to code everything with a script as I mentioned already in my first post. The comma separated names or ids entered by user could have been written directly into a string variable with a script for further processing instead of the input file. And the file name could have been also directly prepared in a string variable instead of the output file. And finally with input file being not too large it would have been even possible to do also searching for the blocks in memory instead of input file which would avoid lots of display updates and would be therefore much faster. However, now we have the macros and I don't want to recode them once more as scripts.
                            Best regards from an UC/UE/UES for Windows user from Austria

                            10
                            Basic UserBasic User
                            10

                              Jan 16, 2015#14

                              Thanks Mofi!

                              About macro 3, it's working unless I enter something in common that people have with others, like "home".

                              For example, with marco 3, if I enter "home", it will make .txt file for only the first person who has "home" in his/her profile (left part). In this case, only jimmy has a .txt file. Sam and jenna doesn't have a .txt file. Could I have .txt files for all of them?

                              And about macro 2, it's not working properly with blank/empty lines between the lines. Only for jimmy, it's working. And for others, it gives the rest of them 2 txt files per person, which have to give 1 txt file per person. (Of course in this case macro 3 won't work either.) Could the macro be adapted to work for such a file content, too?

                              Code: Select all

                              1LINE_jimmy_home, sundale         http://jimmy.blogspot.com/
                              1LINE_jimmy_company               https://jimmy.company.blogspot.com/
                              1LINE_jimmy_twi                   https://twtter.jimmy.com/
                              2LINE_Sam_home                    http://sam.home.com/index.php
                              
                              2LINE_Sam_blog                    http://sam.blog.com
                              3LINE_Jane_work                   http://jane.company.net/floor2
                              
                              3LINE_Jane_twitter                http://twitter.jane.com/
                              
                              99999LINE_jenna_home              http://twitter.jenna.com/
                              
                              99999LINE_jenna_work              http://workjenna.com/
                              

                              6,675585
                              Grand MasterGrand Master
                              6,675585

                                Jan 18, 2015#15

                                I updated once more macro 2 and 3 with commands to remove all trailing spaces and delete all blank lines before doing searching, copying to new file and saving each new file. Also code to create file name for each new file was edited once more to produce correct results for all versions of UE since UE v12.20b+1.

                                Two macros are necessary for new requirement of finding and saving all blocks containing ANYWHERE (not just on left side) one of the strings entered at beginning and saving ALL lines with same ID containing the searched string. Nested loops are not possible in a macro, just in a script, which is the reason why 2 macros are necessary now.

                                The first macro must be created first with name SaveAllFound:

                                Code: Select all

                                Loop
                                Find MatchCase "^c"
                                EndSelect
                                IfNotFound
                                ExitLoop
                                EndIf
                                Key HOME
                                Find MatchCase RegExp "%[0-9]+LINE_"
                                Clipboard 9
                                Copy
                                EndSelect
                                Top
                                Find MatchCase "^c"
                                EndSelect
                                Key HOME
                                PerlReOn
                                Find MatchCase RegExp "^(\d+LINE_).*\r?\n(?:\1.*\r?\n)*"
                                Copy
                                EndSelect
                                NewFile
                                Paste
                                Top
                                SelectLine
                                Copy
                                EndSelect
                                Top
                                Find MatchCase RegExp "^.*?(http|ftp)"
                                Replace All "\1"
                                Paste
                                SelectToTop
                                UnixReOff
                                Find MatchCase RegExp "[~^r^n0-9A-Za-z]+"
                                Replace All SelectText "_"
                                EndSelect
                                Top
                                Find MatchCase RegExp "_^{http^}^{ftp^}*$"
                                Replace ""
                                SelectToTop
                                Copy
                                EndSelect
                                DeleteLine
                                SaveAs "^c.txt"
                                CloseFile NoSave
                                ClearClipboard
                                Clipboard 8
                                UnixReOff
                                IfEof
                                ExitLoop
                                EndIf
                                EndLoop
                                ClearClipboard
                                The second macro can have any name:

                                Code: Select all

                                InsertMode
                                ColumnModeOff
                                HexOff
                                Bottom
                                IfColNumGt 1
                                InsertLine
                                EndIf
                                Top
                                TrimTrailingSpaces
                                PerlReOn
                                Find MatchCase RegExp "(\r?\n|\r)\1+"
                                Replace All "\1"
                                ",
                                "
                                Top
                                Key END
                                GetString "Enter persons or ids separated by commas:"
                                UnixReOff
                                Loop
                                Find MatchCase Up ","
                                Delete
                                IfNotFound
                                ExitLoop
                                EndIf
                                Find MatchCase RegExp "?+$"
                                Clipboard 8
                                Cut
                                PlayMacro 1 "SaveAllFound"
                                EndSelect
                                Top
                                Key END
                                EndLoop
                                Clipboard 0
                                This second macro must be executed by user to find all blocks containing ANYWHERE one of the entered strings separated by commas.

                                Note: This was the last time that I edited the macros because of new requirements. You have to do further edits by yourself in future or somebody else here in user-to-user forum helps you on further questions. I don't want to spend more time on doing the coding job for you.
                                Best regards from an UC/UE/UES for Windows user from Austria

                                Read more posts (3 remaining)