Find text & extract to new file

Find text & extract to new file

2
NewbieNewbie
2

    Jul 17, 2012#1

    Hi,

    I'm total novice to ultraedit, macros, scripts etc.

    I have lot of multiple GB csv files, ranging from 2-12GB, what I need to do is to find text, and extract lines which contain that text to new file and save it as custom file name

    For example: find "English", extract/copy all lines, and paste them into new file called english.csv

    I tried to do it with Search/Replace & "List line containing string", but it's so bloody slow. and when trying to copy few hundred thousands lines ultaedit hangs, and crashes from time to time

    any help will be appreciated

    2362
    MasterMaster
    2362

      Jul 17, 2012#2

      In my considered opinion, if this is something you are needing to do on any type of regular basis, then you need a specialized tool that does what you are asking. Not a text editor that has search capabilities. The editor would have to load the file. I personally think you would be best leaving the file on the drive, and allowing a stand-alone compiled program search line by line, placing those lines into a new file that it finds meeting the criteria.

      This would require a relatively small amount of code, and is something that could be run from the command line or a batch file to repeat the process on a recurring basis, changing the text to search for each time as needed. Or perhaps better yet, a small Windows program that will ask for "text to search for", "file to search in", and "save as filename".

      Nothing will be lightening fast on files that size, but a compiled program using C, C++, C#, or Object Pascal will give you the fastest possible solution for what you are looking for.

      What you are trying to do is called "Data Mining". Data mining is best left up to optimized, specialized programs. UltraEdit was created to write such specialized programs, not to do the actual data mining.

      If you need such a program written for you, I would be willing to help. I can't charge anyone else this year, because it would throw me into a higher income bracket and I'd pay too much in taxes, so this is better timing for you than for me. I would have to do the project "pro bono". I have a bit of spare time this week, but probably not after this week. If you are interested in having this professionally handled without charge, you may contact me on my contact page. You are welcome to review my privacy policy linked at the bottom of that page before contacting me.

      Why do I offer this? I like helping out the UltraEdit community from time to time.

      You may also wait and see if anyone else has another idea. You don't need to respond to my post, except by contacting me if you decide to do so.

      6,686585
      Grand MasterGrand Master
      6,686585

        Jul 18, 2012#3

        A macro or script is even slower than using a Find with List Lines Containing String as UltraEdit updates the display on every found line.

        Windows has a command line tool for finding lines containing a string and output all lines found. The name of this tool is Find. Enter in a command line window Find /? to get help about this command. But I don't know if Find is capable searching in files with several GB. Example:

        Find "English" "name of input file" > "name of output file"

        2362
        MasterMaster
        2362

          Jul 18, 2012#4

          Mofi's idea is best. Try that before you try anything else. Hopefully it will work with the large files.

          I'm afraid, however, that you may run into limitations with that method. It is quite possible that there is a "line number" limitation that would limit the number of lines in a file it can process to either 2,47,483,647 or 4,294,967,295, depending on whether Find uses a signed or unsigned integer.

          I do have a solution available, with source code included, using Object Pascal, that will handle up to 9,223,372,036,854,779,999 lines in a single file. You would use it the same way as the Windows Find command, except it will only process a single file at a time, and it does not display the filename back to the screen or pipe it into a file, so you will get a "cleaner" solution.

          Let me know if you need it. I'd be happy to post it here with source if there is anyone that wants it. You could always take the source and compile it yourself if you don't trust an executable. I haven't tried it with Free Pascal, but it may work with that one.

          2
          NewbieNewbie
          2

            Jul 19, 2012#5

            thanks a mil mofi, find function works like charm, and so far it works on files with millions of lines, output file had around 1.4mil lines and it only took around hour to extract them

            rhapdog, I would love to see your pascal code, and try it. I had pascal in high school, but that was long time ago and I've actually forgot that it exist

            2362
            MasterMaster
            2362

              Jul 19, 2012#6

              Okay, I'm attaching an archive that contains the source code.

              I originally compiled the program with Delphi 2010, but recompiled it with FreePascal 2.6.0, just to make sure it would work. I used the -Mdelphi switch with the FPC compiler.
              To compile with FreePascal, I used the following command.

              Code: Select all

              FPC -B -Mdelphi "FindLinesContString.pas"
              If you want the executable that goes with it, in case you have trouble getting the compiler to work, let me know. I did a default install of FreePascal, and ran the above command from the command line after making sure it was added to my path. It should be no problem for you.
              FindLinesContString.zip (2.16 KiB)   426