Split a CSV file into multiple files based on different values in one data column

Split a CSV file into multiple files based on different values in one data column

2

    Nov 07, 2021#1

    Hello,

    I have a problem that I could not figure out and I am really stressing with it.

    I have multiple CSV files of data for a lot of stocks in the same format like the following:

    Code: Select all

    DATE         TIME_M               EX   SYM_ROOT   SYM_SUFFIX   SIZE    PRICE   TR_CORR   TR_SEQNUM   TR_ID            TR_SOURCE   TR_RF
    02/01/2018   7:01:50.176780000    K    APA                     200     42.4    00        114501      52983525027889   C
    02/01/2018   7:01:50.370307000    K    APA                     100     42.4    00        114601      52983525027890   C
    02/01/2018   9:30:01.005430000    N    AR                      18462   19.24   00        814401      52983644893497   C
    02/01/2018   9:30:01.988385000    Y    AR                      100     19.2    00        839801      52983525027899   C
    02/01/2018   8:49:46.103487000    D    ARCH                    1434    93.16   00        420601      71675222775888   C           T
    02/01/2018   9:31:05.105643000    N    ARCH                    1190    93.84   00        1645101     52983812076865   C
    02/01/2018   8:49:45.295761007    D    ARLP                    2777    19.7    00        5490        1                N           Q
    02/01/2018   8:59:16.906599111    D    ARLP                    30      19.85   00        6036        1                N           N
    02/01/2018   9:30:09.062350000    N    AT                      8823    2.35    00        919201      52983661088591   C
    02/01/2018   9:33:26.099352000    D    AT                      300     2.375   00        3215501     71675240855880   C           T
    02/01/2018   9:30:00.089795661    Q    AXAS                    34473   2.52    00        9410        1                N
    02/01/2018   9:30:00.089826676    Q    AXAS                    34473   2.52    00        9412        2                N
    02/01/2018   4:31:49.886368000    T    BBL                     500     40.62   00        71001       62879129944153   C
    02/01/2018   6:01:21.022612000    T    BBL                     200     40.69   00        82401       62879129948234   C
    02/01/2018   9:30:00.105336000    N    BC                      7491    55.45   00        765401      52983594498378   C
    02/01/2018   9:30:00.551813000    D    BC                      82      55.19   00        796801      71675222898246   C           T
    02/01/2018   9:30:01.052533000    N    BCEI                    1422    27.83   00        820801      52983594177109   C
    02/01/2018   9:30:07.129540000    D    BCEI                    100     27.55   00        904501      71675223294802   C           T
    02/01/2018   13:02:48.518108378   D    BKEP                    100     5.3     00        1188175     51               N           Q
    02/01/2018   13:02:48.519017865   Q    BKEP                    2151    5.3     00        1188176     4                N
    
    I want to split one very large file into multiple files according to the value in the fourth data column as displayed below:

    File 1:

    Code: Select all

    DATE         TIME_M               EX   SYM_ROOT   SYM_SUFFIX   SIZE    PRICE   TR_CORR   TR_SEQNUM   TR_ID            TR_SOURCE   TR_RF
    02/01/2018   7:01:50.176780000    K    APA                     200     42.4    00        114501      52983525027889   C
    02/01/2018   7:01:50.370307000    K    APA                     100     42.4    00        114601      52983525027890   C
    
    File 2:

    Code: Select all

    DATE         TIME_M               EX   SYM_ROOT   SYM_SUFFIX   SIZE    PRICE   TR_CORR   TR_SEQNUM   TR_ID            TR_SOURCE   TR_RF
    02/01/2018   9:30:01.005430000    N    AR                      18462   19.24   00        814401      52983644893497   C
    02/01/2018   9:30:01.988385000    Y    AR                      100     19.2    00        839801      52983525027899   C
    
    File 3:

    Code: Select all

    DATE         TIME_M               EX   SYM_ROOT   SYM_SUFFIX   SIZE    PRICE   TR_CORR   TR_SEQNUM   TR_ID            TR_SOURCE   TR_RF
    02/01/2018   8:49:46.103487000    D    ARCH                    1434    93.16   00        420601      71675222775888   C           T
    02/01/2018   9:31:05.105643000    N    ARCH                    1190    93.84   00        1645101     52983812076865   C
    
    File n: ...

    More important facts:
    1. I use UltraEdit version 28.20.0.70 on Windows 10 Home.
    2. I use UltraEdit with the default configuration settings.
    3. The CSV files are ASCII files with Unix line endings.
    4. The CSV files use the horizontal tab character as separator between the values.
    5. The CSV files do not contain values containing a horizontal tab character or multi-line values, i.e. there is no double quoted value in any CSV file.
    6. The CSV files are all very large with hundreds of MB or even some GB with around 60-70 millions of lines.
    7. The data rows (lines) in the CSV files are always sorted alphabetically according to the value in for data column.
      So all data rows to put into a new file are in a CSV file in one more or less large block which can be millions of lines too.
    I uploaded one CSV file compressed into a ZIP file with about 154 MB on Google Drive. The CSV file has 821.106.452 bytes and 10.808.060 lines. The first block with APA has 298.419.794 bytes in 3.861.285 lines.

    Could you please help me with the macro or script in UltraEdit to do this.

    Thank you so much!

    6,675585
    Grand MasterGrand Master
    6,675585

      Nov 20, 2021#2

      Please read first the power tip Large file text editor. There should be used to following configuration settings for the usage of the attached UltraEdit script:
      1. Select at Filehandling - DOS/Unix/Mac handling the option Never prompt to files to DOS format.
        That is important because the CSV files use Unix line endings. They should be handled without a conversion to DOS/Windows line endings on file open.
      2. Select at File handling - Temporary files the option Open file without temp file but NO prompt.
        WARNING: If this option is selected, all changes are permanent!
        So be careful not modifying in any way an opened CSV file. Pressing key Del modifies with this setting the opened file without being able to undo this modification. Best would be to work always with a copy of the file to split on using this option which is highly recommend for working with such large files.
      3. Uncheck at Editor display - Miscellaneous the option Disable line numbers.
        ATTENTION: This is the opposite of what the power tip suggests because of the script needs the line numbers.
      The other configuration settings do not really matter in this case. The CSV files are not syntax highlighted at all. Therefore code folding is not applied automatically on file open and there is not done a search for function strings by UltraEdit on the not syntax highlighted CSV files. The files are not XML files and for that reason the XML manager configuration does not matter too.

      The UltraEdit script in the attached ZIP file is designed for splitting up the active file on start of script execution. So the script file must be added to the Script list and executed from this list. On using the ribbon mode click on last ribbon tab Advanced in fourth group Script on second item All scripts to open the Scripts window. The button Add can be used to add the script without or with a description. Then the script can be executed by clicking on same ribbon tab in same group on item Play script, or by opening the Script list by checking this item on ribbon tab Layout and double clicking on the script in the list.

      The script creates the files in the directory of the active file on script start. The file name is the found value string in fourth data column. The file extension of the active file on script start is used also for the file extension of the created files.

      The script opens the output window and shows the file name of each saved file during the script execution. There is the boolean variable bShowFullProgress at top of the script of which value can be modified to true to get even more information displayed during execution of the script in the output window.

      There is also at top of the script the variable nMaxLinesPerBlock with the value 500000. This value defines how many lines with same value in fourth data column are copied at once from the file to split to the new file. It was not possible with using 32-bit and not 64-bit UltraEdit v28.20.0.92 on my PC with just 4 GB of total RAM and 3 GB of free RAM according to Windows to copy the entire APA block of the large example file to clipboard. UltraEdit respectively Windows could not allocate a large enough memory block for all lines with data value APA. That would have perhaps worked with using 64-bit UltraEdit. But it is unpredictable how large a block of lines with same data value can be in your CSV files with some GB. So I decided to code the script to copy always not more than 500.000 lines at once from active file on script start to the new file. You may need to decrease this value if you get ever displayed a message box with Cannot allocate memory because of the lines are longer than the lines in the large example file on Google Drive. Cancel in such a case the script execution, open the script file, reduce this value, save and close the script file and run it once again on the same CSV file.
      Split_Large_TSV_File.zip (3.42 KiB)   7
      This ZIP file contains the UltraEdit script file to split the large TSV (tab separated values) file.
      Best regards from an UC/UE/UES for Windows user from Austria

      2

        Jan 01, 2022#3

        Due to some private problem I just read your reply now!

        Thank you very much Mofi, it is very kind of you. I really appreciate this.

        Minh Thu

        2
        NewbieNewbie
        2

          May 23, 2023#4

          How would the regular expression need to be modified if I have columns separated by commas in my csv. 
          Any assistance would be much appreciated 

          6,675585
          Grand MasterGrand Master
          6,675585

            May 23, 2023#5

            Replace the six occurrences of \\t in the script by the comma character , and that´s it.
            Best regards from an UC/UE/UES for Windows user from Austria

            2
            NewbieNewbie
            2

              May 24, 2023#6

              Mofi, thank you so much, worked perfectly.