Tapatalk

Data thinning a text file - select 4th row to a new file

Data thinning a text file - select 4th row to a new file

3

    Dec 10, 2014#1

    Hello!

    I need a bit of help. In searching the forums I saw some mentions of what I need to do in UltraEdit but not the solution.

    I need to thin down some database dump files of high resolution data something similar to the example below. These files are several GB's in size. Basically I need it to copy the first line and then every forth line to the end of the file into a new text file which would be one quarter in size. Basically turning the high resolution data into not so high resolution. Thank you in advance!

    Code: Select all

    Tag,Time Stamp,Value,Base Status,User Status
    CAL.UISP1:2587782246_COMMSTAT,5/21/2011 15:21:29.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,5/21/2011 15:21:42.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,5/26/2011 08:48:06.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,5/26/2011 08:48:47.000,2:FAILED: No Res,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:23:30.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:24:19.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:01.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:03.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:05.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:08.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:10.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:12.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:14.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:16.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:20.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:22.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:24.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:26.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:28.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:32.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:34.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:37.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:03:10.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:03:17.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:05:07.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:05:18.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:06:15.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:06:20.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:06:48.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:07:04.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:07:28.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:07:36.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:08:51.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:09:15.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:12:08.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:12:20.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/18/2011 10:13:13.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/18/2011 10:13:21.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/21/2011 15:23:23.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/21/2011 15:23:30.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/24/2011 06:50:10.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,10/24/2011 06:50:16.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:49:26.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:49:31.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:27.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:33.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:47.000,1:In Progress   ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:50.000,3:Succeeded     ,0x0003,0x00000000
    CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:55:48.000,1:In Progress   ,0x0003,0x00000000

    6,685587
    Grand MasterGrand Master
    6,685587

      Dec 11, 2014#2

      My suggestion is as follows for a file with DOS/Windows line terminators as I think this is the fastest:
      1. Create a copy of the huge file with Windows Explorer.
      2. Start UltraEdit, but do NOT open the file.
      3. Click on Search - Replace in Files.
      4. As search string enter (paste) the string ^(.*)(?:\r\n.*){3}$
      5. As replace string enter (paste) the string \1
      6. Select Files listed.
      7. For In files/types enter the name of the file to shrink.
      8. For Directory browse to the directory containing the file to shrink.
      9. Check option Match case as case-sensitive replaces are in general faster.
      10. Check Regular Expressions and select Perl as regular expression engine.
      11. All other options should not be checked.
      12. Click on button Replace All and wait some minutes until UltraEdit finished removing the lines.
      The file should be now much smaller and is definitely heavy fragmented stored on hard disk. Moving it to a different partition is the fastest method to get the file stored defragmented if the destination partition was defragmented before moving the file.

      The result of the Replace in Files for your example block is:

      Code: Select all

      Tag,Time Stamp,Value,Base Status,User Status
      CAL.UISP1:2587782246_COMMSTAT,5/26/2011 08:48:47.000,2:FAILED: No Res,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:03.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:12.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:22.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:32.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:03:17.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:06:20.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:07:36.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:12:20.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,10/21/2011 15:23:30.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:49:31.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:50.000,3:Succeeded     ,0x0003,0x00000000
      CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:55:48.000,1:In Progress   ,0x0003,0x00000000
      Best regards from an UC/UE/UES for Windows user from Austria

      3

        Dec 14, 2014#3

        Mofi, thanks for the prompt reply! Just want to acknowledge that I saw your answer. We are in low gear for the holidays at the office, but I am anxious to try it ASAP! :D

          Dec 15, 2014#4

          Mofi, I got to work with your solution today.  And it works fabulously! I did have to tweak on it just a little to get it to do what I needed for the record. To get to the every 4th row I changed your search string to ^(.*)(?:\r\n.*){4}$ and your replace string with \1\r\n to get a CR that was needed in the actual file I was thinning. I have attached a file comparison screenshot as a before and after. Anyway, thanks again! Good stuff!

          FileCompare.jpg (206.97KiB)
          File Compare

          6,685587
          Grand MasterGrand Master
          6,685587

            Jan 09, 2015#5

            I'm not sure why it was necessary for you to modify the search and replace strings.

            Using your search and replace strings on your data results in following file contents on test data.

            Code: Select all

            Tag,Time Stamp,Value,Base Status,User Status
            
            CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:23:30.000,1:In Progress   ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:08.000,3:Succeeded     ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:20.000,1:In Progress   ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/12/2011 10:25:32.000,3:Succeeded     ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:05:07.000,1:In Progress   ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:07:04.000,3:Succeeded     ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/13/2011 11:12:08.000,1:In Progress   ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,10/21/2011 15:23:30.000,3:Succeeded     ,0x0003,0x00000000
            
            CAL.UISP1:2587782246_COMMSTAT,11/18/2011 07:52:27.000,1:In Progress   ,0x0003,0x00000000
            
            So 4 lines instead of 3 lines are removed and an empty line is inserted instead between all the remaining lines.

            However, the final result is what really matters.
            Best regards from an UC/UE/UES for Windows user from Austria