Splitting based on content

Splitting based on content

10
Basic UserBasic User
10

    Oct 05, 2006#1

    I have a huge files containing the sales of a shop for over the past 6 months. One of the fields is the exact sales date.

    I'll show a simplified example below, where the last field is the sales-date. What I now would like to do is to create separate files per sales-date.
    There is no identifier which marks the end or start of a new day and the amount of rows per day is also changing per day.

    In Excel there is the "Subtotals"-option which has the option "At each change in" the value of a certain column, you can do some statistics... And that's what I'm looking for...

    Thanks for your help!

    Code: Select all

    "artist 1 ";"Title"; 1; 09,99;20060828
    "artist 2 ";"Title"; 1; 09,99;20060828
    "artist 1 ";"Title"; 1; 05,99;20060829
    "artist 2 ";"Title"; 1; 06,99;20060829
    "artist 3 ";"Title"; 1; 03,99;20060829
    "artist 1 ";"Title"; 1; 03,99;20060830
    

    6,675585
    Grand MasterGrand Master
    6,675585

      Oct 05, 2006#2

      Should be no problem. The following macro works for your example. The macro property Continue if a Find with Replace not found must be checked for this macro.

      Because of the focus issue after closing a file described at Problem with Previous Window/Tab Command make sure you have only your CSV file open or it is the most right file in the open file tabs order.

      I don't have time currently to explain the macro. But I think, it's not too difficult to understand.

      InsertMode
      ColumnModeOff
      HexOff
      Bottom
      IfColNum 1
      Else
      "
      "
      EndIf
      Top
      Clipboard 9
      Key END
      StartSelect
      Find Up Select ";"
      Key RIGHT ARROW
      Copy
      EndSelect
      Key RIGHT ARROW
      Loop
      Find "^c"
      IfFound
      Key LEFT ARROW
      Else
      Key HOME
      IfColNumGt 1
      Key HOME
      EndIf
      Key DOWN ARROW
      SelectToTop
      Clipboard 8
      Cut
      NewFile
      Paste
      Top
      Clipboard 9
      Paste
      ".csv"
      SelectToTop
      Cut
      SaveAs "^c"
      CloseFile
      IfEof
      ExitLoop
      Else
      Key END
      StartSelect
      Find Up Select ";"
      Key RIGHT ARROW
      Copy
      EndSelect
      Key RIGHT ARROW
      EndIf
      EndIf
      EndLoop
      CopyFilePath
      CloseFile NoSave
      Open "^c"
      ClearClipboard
      Clipboard 8
      ClearClipboard
      Clipboard 0
      Best regards from an UC/UE/UES for Windows user from Austria

      10
      Basic UserBasic User
      10

        Oct 05, 2006#3

        Wow....magic! :D

        Works great...up till now in the test I've done only one file error (couldn't write to disk) but apart from that it works like a charm on the 50mb testfile I created.

        For the real thing (400mb) It will take a night of hard work for the pc I think, but it's way better than doing the manual copy paste thing over and over again by myself...

        One litte question: this macro is based on the date being the last field (which is perfect for this file); what if there were let's say two fields after the date, how should I adapt the macro?
        Two more "find up select ";"" ???

        Thanks an awful lot, mofi!

        6,675585
        Grand MasterGrand Master
        6,675585

          Oct 05, 2006#4

          Wow ... 400 MB! That's a very important info. My first macro modifies the source file and when finished, it restores it by closing without saving and reopening it. That's okay for normal files, but not for a 400 MB file which is hopefully opened without a temp file and so all changes are permanent.

          I have changed some lines in the macro to work now without modifying the source file. This should increase the speed of the macro a lot.

          The macro contains some additional commands to make sure it works independent of the configuration options Home Key Always Goto Column 1 and Bookmark column with line (second currently only for UEStudio 6.00+, will be available in UE in next major release). It's also independent of the current regular expression engine because it is regex free.

          How it works:

          First the macro verifies if the last line of the source file is terminated with EOL character(s). This is important because the macro contains Key DOWN ARROW with IfEof and this would produce an endless loop if the last line of the file is not terminated because Key DOWN ARROW does not work and so end of file is never reached. That's the only possible modification of the source file!

          The macros works with 2 bookmarks now. So next it clears every existing bookmark if there is any.

          Next it selects in the first line the date string and copies it to user clipboard 9. I have inserted the red lines to show you how to select the date string if it is not at end of a line.

          Find Up Select ";" selects from current cursor position till the found string with including the found string. Because the ';' should be not included in the file name the select mode is started before this special find with selecting and so Key RIGHT ARROW is executed in select mode which reduces the selected string by the ';'. Find Select "" is the same as when you hold the SHIFT key while pressing the Find Next button in the find dialog.

          Key RIGHT ARROW moves the cursor once right to make sure, that the just copied string is not found again in the following loop (not really needed but more secure).

          The main loop always searches for the current date string in clipboard 9.

          If it is found again, unselect the found string and move cursor once left before continue search. Well, this is not really needed, but it's better for security.

          If the date string in clipboard 9 is not found, the cursor is in the last line with this date string. Set the cursor now to start of the next line and bookmark this line. That would fail at the last line of the file if it would not be terminated with CRLF (or only LF or only CR depending on the file format and current edit mode).

          Next clipboard 8 is selected and from current cursor position till previous bookmark everything is selected (same as pressing Shift+F2 for Search - Next Bookmark with selecting).

          Copy the selected block into clipboard 8, clear the bookmark here and move the cursor down to the remaining bookmark where the next date block starts.

          Then open a new file, paste the block, move cursor to top, insert here the date string and inser (=append) ".csv" to get the file name. Select the file name, cut it from the file, save the new file with "date string.csv" and close it.

          Back in the source file check if end of file is reached. If so, exit the loop. If not, again select in the already bookmarked line the new date string, copy it to clipboard 9, set cursor to a new position in the current line where the date string cannot be found again and continue the loop.

          After the loop clear the remaining bookmark at end of the file, clear the 2 used clipboards to free RAM and switch back to the windows clipboard.

          Once again: The macro property Continue if a Find with Replace not found must be checked for this macro. And because of the focus issue after closing a file described at Problem with Previous Window/Tab Command make sure you have only your CSV file open or it is the most right file in the open file tabs order.

          InsertMode
          ColumnModeOff
          HexOff
          Bottom
          IfColNum 1
          Else
          "
          "
          EndIf
          Loop
          GotoBookMark
          IfEof
          ExitLoop
          Else
          ToggleBookmark
          Bottom
          EndIf
          EndLoop
          Top
          ToggleBookmark
          Clipboard 9
          Key END
          Find Up ";"
          Find Up ";"
          Key LEFT ARROW

          StartSelect
          Find Up Select ";"
          Key RIGHT ARROW
          Copy
          EndSelect
          Key RIGHT ARROW
          Loop
          Find "^c"
          IfFound
          Key LEFT ARROW
          Else
          Key HOME
          IfColNumGt 1
          Key HOME
          EndIf
          Key DOWN ARROW
          ToggleBookmark
          Clipboard 8
          GotoBookMarkSelect
          Copy
          EndSelect
          ToggleBookmark
          GotoBookMark
          NewFile
          Paste
          Top
          Clipboard 9
          Paste
          ".csv"
          SelectToTop
          Cut
          SaveAs "^c"
          CloseFile
          IfEof
          ExitLoop
          Else
          Key END
          Find Up ";"
          Find Up ";"
          Key LEFT ARROW

          StartSelect
          Find Up Select ";"
          Key RIGHT ARROW
          Copy
          EndSelect
          Key RIGHT ARROW
          EndIf
          EndIf
          EndLoop
          ToggleBookmark
          ClearClipboard
          Clipboard 8
          ClearClipboard
          Clipboard 0
          Best regards from an UC/UE/UES for Windows user from Austria

          10
          Basic UserBasic User
          10

            Oct 06, 2006#5

            Great explanation, really good to understand what makes the differences and how it works. It indeed works quite a bit faster now, but I still have some file save errors once in every 3 or 4 saved files.

            I get the message "File Error" and after that "File/device maybe readonly, or open for write by another application"; then I get the chance to manually save the new Tab.
            Any idea what can cause this error? There is plenty of room on the disc and as the other files are saved o.k., I can't imagine it has something to do with access rights...

            Thanks a lot once again!

            Michael

            6,675585
            Grand MasterGrand Master
            6,675585

              Oct 06, 2006#6

              Looks like the file name in clipboard 9 is sometimes not a valid file name. I once detected on a very slow computer (Pentium 166 MHz) that the command Top after a big Paste was not executed completely before the macro has continued with the next command and so the next command was executed anywhere in the middle of the file. Such a synchronization problem in your macro would cause a very large and invalid file name and also a partly destroyed file content!

              Replace the section

              Top
              Clipboard 9
              Paste
              ".csv"
              SelectToTop
              Cut


              in your macro with


              Clipboard 9
              Paste
              ".csv"
              StartSelect
              Key HOME
              Cut
              EndSelect


              With this modification the file name is created at end of the new file instead at top of the file and so UltraEdit has not to move the cursor up to top of the file. This works for your macro because the last line of the new file is terminated surely always with EOL character(s) and so the file name is created on a blank line at bottom of the file.

              And I really hope that in the last 2 columns of your CSV file there is never an escaped semicolon - column text is enclosed in double quotes and so a ';' inside a double quoted column text should not be interpreted here has delimiter according to CSV standard. The macro does not handle such exceptions.
              Best regards from an UC/UE/UES for Windows user from Austria

              10
              Basic UserBasic User
              10

                Oct 06, 2006#7

                Well, now I can give you a very big "Muchos Gracias" from Holland! The file is just split-up into 198 separate parts without any error!

                I just concatenated those parts in to a new total file to verify for missing records ; the size is almost equal. I'm missing just three records, the UW File Compare is already doing it's very best to determine which ones are missing (going to add them manually).

                Thanks a lot, UE was already one of my favorite and "couldn't do without" program, this kind of options and support only makes that feeling stronger.

                Have a nice weekend Mofi!

                4

                  Oct 11, 2006#8

                  Adapted above mentioned code. Works fine :lol: for record length up to 2600.

                  With record lengths above 4500 only the first 12 records are ok, the rest give files (with correct names) of 0 bytes.

                  Already changed the configuration: maximum columns before line wraps to 20000, otherwise the line gets wrapped.

                  Here's my code (modified by Mofi - see posts below):

                  InsertMode
                  ColumnModeOff
                  HexOff
                  Bottom
                  IfColNum 1
                  Else
                  "
                  "
                  EndIf
                  Loop
                  GotoBookMark
                  IfEof
                  ExitLoop
                  Else
                  ToggleBookmark
                  Bottom
                  EndIf
                  EndLoop
                  Top
                  Find "field string to break line"
                  Replace All "field string to break line^p#!?#"

                  ToggleBookmark
                  Clipboard 9
                  Key HOME
                  Find "SOURCE1</fieldLabel><fieldvalue"
                  Key RIGHT ARROW
                  StartSelect
                  Find Select "_OMRA"
                  Copy
                  EndSelect
                  Key RIGHT ARROW
                  Loop
                  Find "^c"
                  IfFound
                  Key LEFT ARROW
                  Else
                  Key HOME
                  IfColNumGt 1
                  Key HOME
                  EndIf
                  Key DOWN ARROW
                  ToggleBookmark
                  Clipboard 8
                  GotoBookMarkSelect
                  Copy
                  EndSelect
                  ToggleBookmark
                  GotoBookMark
                  NewFile
                  Paste
                  Key UP ARROW
                  Key END
                  IfColNum 1
                  "Nothing selected or copied to clipboard! Macro execution stopped!

                  Check position in source file and content of active clipboard 8 and also of clipboard 9."
                  ExitMacro
                  Else
                  Key HOME
                  IfColNumGt 1
                  Key HOME
                  EndIf
                  Key DOWN ARROW
                  EndIf

                  Clipboard 9
                  Paste
                  ".xml"
                  StartSelect
                  Key HOME
                  Cut
                  EndSelect
                  GetValue "Continue macro execution (0/1) ?"
                  Key LEFT ARROW
                  IfCharIs "0"
                  Key DEL
                  ExitMacro
                  Else
                  Key DEL
                  EndIf

                  Top
                  Find "^p#!?#"
                  Replace All ""

                  SaveAs "^c"
                  CloseFile
                  IfEof
                  ExitLoop
                  Else
                  Key HOME
                  Find "SOURCE1</fieldLabel><fieldvalue"
                  Key RIGHT ARROW
                  StartSelect
                  Find Select "_OMRA"
                  Copy
                  EndSelect
                  Key RIGHT ARROW
                  EndIf
                  EndIf
                  EndLoop
                  ToggleBookmark
                  ClearClipboard
                  Clipboard 8
                  ClearClipboard
                  Clipboard 0
                  Top
                  Find "^p#!?#"
                  Replace All ""

                  6,675585
                  Grand MasterGrand Master
                  6,675585

                    Oct 11, 2006#9

                    You have forgotten to mention which version of UltraEdit you have?

                    A maximum columns number before wrap of 20,000 is possible since v10.10. Prior versions have the limit 4096.
                    The maximum bytes in clipboard or selected in a search with ^c or ^s is 30,000 since v9.20. You hopefully do not break this limit.

                    Your macro looks good. I could not see any mistake. So maybe UltraEdit has really a bug when the old 4096 limit is crossed.

                    Insert following code after the 2 macro commands NewFile and Paste before Clipboard 9:

                    Key UP ARROW
                    Key END
                    IfColNum 1
                    "Nothing selected or copied to clipboard! Macro execution stopped!

                    Check position in source file and content of active clipboard 8 and also of clipboard 9."
                    ExitMacro
                    Else
                    Key HOME
                    IfColNumGt 1
                    Key HOME
                    EndIf
                    Key DOWN ARROW
                    EndIf

                    With this additional code the macro will exit if nothing was pasted into the new file. Maybe you can see in the source file why.
                    Best regards from an UC/UE/UES for Windows user from Austria

                    4

                      Oct 12, 2006#10

                      Version of UltraEdit is 10.10c.

                      Changed the max limit to 9000 characters => 62 get processed, rest with 0 bytes length. Macro doesn't stop when 0 bytes files are written.

                      6,675585
                      Grand MasterGrand Master
                      6,675585

                        Oct 12, 2006#11

                        That becomes more and more suspect. The maximum columns before line wraps setting has an influence on how many blocks are successfully saved to a file? Sounds like a problem of v10.10c. I found in the history of UltraEdit v12.10b following line for UltraEdit v11.10b:
                        Fixed heap corruption in undo buffer, specifically search/replace operations on files with long lines
                        Well, the undo buffer is not used here, but who knows!

                        Is your XML file an UTF-8 or UTF-16 file (Unicode editing) - see status bar at bottom of the UE window?

                        There are known issues with Unicode editing.

                        I have merged my first debugging suggestion (gray color) which you have inserted correctly with a new one (red color) in your initial post. This new code asks you now for every file to continue or not. So you can look what the new file contains before it is saved and the macro continues.

                        But I think you are debugging here a bug of UltraEdit v10.10c.

                        Maybe you can break up the long lines with a search and replace in the source file at top of the macro and undo it in every new file before save. This depends on the content of your file. I have inserted in green color a suggestion in the macro code at your initial post.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        4

                          Oct 12, 2006#12

                          Sorry, no luck at all.
                          Get question when loading file: Do you want to convert file xxx to dos-format. When not converting the text UNIX is shown at the bottom. When converted text DOS is shown.

                          When running the macro with the question (0/1) to answer, it runs fine.
                          Leaving this question out, same error occurs, even with splitting the line in two parts (and bringing back the max record length in the config to 3000).

                          Think my possibilities are run out now, so I will now pass the files to a unix box and do the splitting overthere, as getting a new release for ultraEdit will take months.

                          Thanks for your effort !

                          6,675585
                          Grand MasterGrand Master
                          6,675585

                            Oct 12, 2006#13

                            Okay, no Unicode, only ASCII files with UNIX line endings. For avoiding troubles with Unix files on Windows you should set the config option Automatically convert to DOS format at General - Load/Save/Conversions - Unix/Mac file detection/conversion in the configuration dialog AND additionally Save file as input format (UNIX/MAC/DOS). With these settings you always edit in (WIN)DOS mode (good for copying and pasting with other applications), but save the file always in the same mode as it should be - UNIX or DOS.

                            When it runs fine with the question, it looks like a timing problem after the paste. You could try to help UE to synchronize by inserting following instead of the gray block:

                            Top
                            Bottom
                            " "
                            Key BACKSPACE

                            Maybe this helps. If not, UE is shareware. You can download and install latest version and test the macro with v12.10b. You should only rename your existing UltraEdit program directory and also create a backup of the uedit32.* files in the Windows directory before you install temporarily the latest version 12.10b with the same target directory name as your v10.10c has had before rename.

                            After your test you can delete the program directory of the new version and delete the *.mfg, *.pfg, *.tfg and the uedit32.* files in the Windows directory, restore your uedit32.* backups and rename the UltraEdit program directory back. Then you have your registered version restored.
                            Best regards from an UC/UE/UES for Windows user from Austria

                            4

                              Oct 13, 2006#14

                              downloaded the latest version (12.20). Altered some settings in the config (pre version 11 bookmark style), but still doesn't work.
                              Already did the job on a unix box.
                              So I will stop now investigating this.
                              Thanks for the effort !

                              10
                              Basic UserBasic User
                              10

                                Nov 28, 2006#15

                                Mofi and/or others of course :lol: ,

                                the macro did its job very well, and I'm still very happy with it. However, I tried to adjust it as I needed the same principle but with some other selection...

                                I now wanted to make the split based on the first eight characters of every line. That's the date field and using those positions I planned to make 300 files out of 1 total file (2.5 million records, 450 mb in size).
                                It seemed to work o.k., but halfway it stopped working, made several bookmarks within one date-selection and then created one huge file for all dates coming after that one.

                                Any idea what went wrong? I can't find any discrepancy in the file itself and I'm now in doubt if the splits that were made, are o.k.

                                Here's the code I used, can someone please check this one for mistakes?

                                Thanks once again,
                                Michael

                                Code: Select all

                                InsertMode
                                ColumnModeOff
                                HexOff
                                Bottom
                                IfColNum 1
                                Else
                                "
                                "
                                EndIf
                                Loop 
                                GotoBookMark
                                IfEof
                                ExitLoop
                                Else
                                ToggleBookmark
                                Bottom
                                EndIf
                                EndLoop
                                Top
                                ToggleBookmark
                                Clipboard 9
                                Key HOME
                                StartSelect
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Copy 
                                EndSelect
                                Loop 
                                Find "^c"
                                IfFound
                                Key LEFT ARROW
                                Else
                                Key HOME
                                IfColNumGt 1
                                Key HOME
                                EndIf
                                Key DOWN ARROW
                                ToggleBookmark
                                Clipboard 8
                                GotoBookMarkSelect
                                Copy 
                                EndSelect
                                ToggleBookmark
                                GotoBookMark
                                NewFile
                                Paste 
                                Clipboard 9
                                Paste 
                                ".txt"
                                StartSelect
                                Key HOME
                                Cut 
                                EndSelect
                                SaveAs "^c"
                                CloseFile
                                IfEof
                                ExitLoop
                                Else
                                Key HOME
                                StartSelect
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Key RIGHT ARROW
                                Copy 
                                EndSelect
                                Key RIGHT ARROW
                                EndIf
                                EndIf
                                EndLoop
                                ToggleBookmark
                                ClearClipboard
                                Clipboard 8
                                ClearClipboard
                                Clipboard 0
                                

                                Read more posts (3 remaining)