Extracting Lines from a File

mcgint · Mar 01, 2007#12007-03-01T20:06+00:00

I have a number of multiple-gb text files that I want to import into a database manager. They are nested files containing a dozen different types of records, each of which has a different layout. The 12th and 13th characters in each line denote what type of record the line is. I would like to copy all of the lines in the file that are record-type "01" and paste them in a new file. I thought this macro would work, but it doens't loop; it only runs on the first line it finds, then stops. Any help anyone can offer will be appreciated.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Loop
Find RegExp "%???????????01*^p"
Copy
NextWindow
Paste
NextWindow
IfEof
EndLoop
EndIf
ExitLoop

pietzcker · Mar 01, 2007#22007-03-01T20:14+00:00

Erm, I don't know the UE macro language well, but shouldn't you swap EndLoop and ExitLoop?

Tim

mcgint · Mar 01, 2007#32007-03-01T20:20+00:00

Swapping the locations of "ExitLoop" and "EndLoop" did work. Thanks.

Bego · Mar 01, 2007#42007-03-01T20:21+00:00

There is also a command "PreviousWindow".

Didn't get it 100%. Do you have many files opened ?
Maybe you can do your regexp-search in the search box (ctrl-f) and click the checkbox "list lines containing string" (or sth.).
Then you can copy the result (your 01-lines only into clipboard/a new file.

mcgint · Mar 01, 2007#52007-03-01T20:35+00:00

I thought about the "list line" option, but I'm going to have 300,000 to 400,000 lines per file that I need to copy, and that seemed like too much for the clipboard.

mrainey56 · Mar 01, 2007#62007-03-01T20:59+00:00

Maybe you could break the files up into sections and do the List Lines/Clipboard method, which is extremely fast. I think it'll take all day with the combination of a huge file and a loop.

Mofi · Mar 18, 2007#72007-03-18T20:13+00:00

Is this problem solved! I have some other ideas how to do this job.

mcgint · Mar 19, 2007#82007-03-19T18:04+00:00

Not really. I tried the list lines option, but with the gargantuan files I'm working with (I think the biggest is 6 gb), it seemed to take forever. I think I need to learn PERL or Python.

Mofi · Mar 20, 2007#92007-03-20T12:37+00:00

Okay, I have 2 suggestions.

1) Delete all lines which does not have record 01 and save it with a new file name.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Top
Find RegExp "%???????????[~0]*^p"
Replace All ""
Find RegExp "%???????????0[~1]*^p"
Replace All ""
SaveAs ""

2) Collecting the lines of interest in a clipboard in several steps.

This solution needs 2 macros because nested loops are not possible.

Create first following macro which contains the inner loop with name "Copy2000Lines".

ClearClipboard
Loop 2000
Find RegExp "%???????????01*^p"
IfFound
CopyAppend
Else
PreviousWindow
Paste
ExitLoop
EndIf
EndLoop

The main macro has following code:

InsertMode
ColumnModeOff
HexOff
UnixReOff
Top
NewFile
NextWindow
Clipboard 9
Loop
PlayMacro 1 "Copy2000Lines"
IfNameIs ""
ExitLoop
Else
PreviousWindow
Paste
NextWindow
EndIf
EndLoop
ClearClipboard
Clipboard 0

I think, this solution is also quite easy to understand. The first macro copies always a maximum of 2000 lines to user clipboard 9. If the string is not found anymore before 2000 lines are found, it pastes the already found lines into the new file and exits the loop. If 2000 lines are copied to clipboard, the loop exits without changing to the new file.

The main macro make the necessary preparations and runs in a loop the first macro. After finishing the first macro, it checks if the current file has no name. If this is true, the first macro has already pasted the rest of the found lines into the new file which does not have a name. If the first macro has found 2000 lines and so still the named source file has the focus, the main macro pastes the found lines into the new file and calls again the first macro.

You can increase the number of 2000 if it works also with a higher number.

All the 3 macros above require an enabled macro property Continue if a Find with Replace not found.

3) Using Find In Files to get the list of lines

Last you can use Find In Files with your search string and with option "Results to Edit Window" and then convert the results file to ASCII if necessary and delete with a few regex replace all commands everything except the content of the found lines.

pietzcker · Mar 21, 2007#102007-03-21T00:45+00:00

mcgint wrote:I think I need to learn PERL or Python.

In Python, this would do the job. Put the script in the same directory as the files you need to examine and run it. You will probably need to correct the filename wildcard expression (here "*.txt"). The script will generate a new file with the name of the original file + ".out" for each file in the directory that matches the wildcard expression. No checks for errors (read/write permissions etc.) in this brief script.

Code: Select all

# -*- coding: iso-8859-1 -*-
import os, glob
which_files="*.txt" # enter correct wildcard here
match_string="01"   # string to match at pos. 12/13

def all_files():
    for filename in glob.glob(os.path.join(".", which_files)):
        yield filename

if __name__ == '__main__':
    for filename in all_files():
        print "Processing file", filename
        file_in = open(filename, "r")
        file_out = open(filename+".out", "w")
        for line in file_in:
            try:
                if line[11:13]==match_string:
                    file_out.write(line)
            except:
                pass
        file_in.close()        
        file_out.close()

Should be pretty fast and not need any significant amount of memory.

HTH,
Tim

mcgint · Mar 21, 2007#112007-03-21T02:24+00:00

Mofi and pietzcker:

Thanks a lot for the advice. Judging from the speed/performance I've seen so far in UltraEdit, I'm guessing that Python would be a lot faster. Am I right?

pietzcker:

I've got a total of 12 record types in the files I'm using. Reading your program, I'm thinking I could just copy and paste the syntax 11 times, change the match_string in each iteration and add use the match_string to name of the outgoing file:

file_out = open("rectype"+match_string+".out", "w")

Does that make sense? Also (and I'm pushing my luck here) what if I wanted to replace non-printing characters (I've found some in these files that can goof up the import) with spaces. Is there a handy Python cleaning function I can use on the line that's written out?

file_out.write(line)

Thanks again for the help.

pietzcker · Mar 21, 2007#122007-03-21T11:49+00:00

OK, so let's make it a little more versatile. Now you can define the 12 (or any number of) types in an array. Each source file will only be read once, and if characters 12/13 are in that array, the corresponding file will be written to. If one of the array entries doesn't occur in the source file, there will be an empty output file.

I guess that Python will indeed be a lot faster, especially in this scenario, but then Python is no text editor...

As for replacing non-printing characters: Shouldn't be a problem - what exactly would those non-printing characters be? Anything except letters and numbers? ASCII/ANSI/Unicode...?

Code: Select all

# -*- coding: iso-8859-1 -*-
import os, glob
which_files="*.txt" # enter correct wildcard here
match_string=["01", "AB", "XY"]   # string(s) to match at pos. 12/13

def all_files():
    for filename in glob.glob(os.path.join(".", which_files)):
        yield filename

if __name__ == '__main__':
    for filename in all_files():
        print "Processing file", filename
        file_in = open(filename, "r")
        all_output_files={}
        for item in match_string:
            all_output_files[item] = open(filename+item+".out", "w")
        for line in file_in:
            try:
                if line[11:13] in match_string:
                    all_output_files[line[11:13]].write(line)
            except:
                print "This shouldn't happen."

        file_in.close()
        for item in match_string:
            all_output_files[item].close()

mcgint · Mar 21, 2007#132007-03-21T21:52+00:00

Regarding the ASCII/ANSI/Unicode question, I don't really know. The characters I encountered looked like squares (or maybe rectangles) in ultraedit. The records are fixed-width, which is why I'd like to replace the bad characters with spaces. There's punctuation in the data, so I'd need to keep more than just numerals and letters.

One other thing: I have 13 separate files, each of which has 12 different record types within it. My ultimate goal is to put all 01 records from each of the 13 files and put them in one file (and so on for each record type). So my guess is I could take this line:

all_output_files[item] = open(filename+item+".out", "w")

and change it to:

all_output_files[item] = open("rectype"+item+".out", "w")

That way, when the first file is processed, it will create rectype01.out, and each successive file will open the same file. Or at least that's my guess. Will the "w" in Python automatically append data if the output file already exists?

Thanks again for your help.

Tom

pietzcker · Mar 22, 2007#142007-03-22T11:30+00:00

Hi,

can you check in UE's hex mode what ASCII code corresponds to those rectangles (is it always the same one, or are they variable)?

As for collating all the results into one file for each record type, it makes more sense to do it the other way (open all the files for writing at the start and close them at the very end). If you open a file with "w", it overwrites the old version.

I've also included a possibility to insert a marker for each source file in the output files. It's commented out right now, but if you need it, just remove the hash signs.

Code: Select all

# -*- coding: iso-8859-1 -*-
import os, glob
which_files="*.txt" # enter correct wildcard here
match_string=["01", "AB", "XY"]   # string(s) to match at pos. 12/13

def all_files():
    for filename in glob.glob(os.path.join(".", which_files)):
        yield filename

if __name__ == '__main__':
    all_output_files={}
    for item in match_string:
        all_output_files[item] = open("rectype"+item+".out", "w")

    for filename in all_files():
        print "Processing file", filename
#        for item in match_string:
#            all_output_files[item].write("------- Output from file "+filename+" starts here -------\n")
        file_in = open(filename, "r")
        for line in file_in:
            try:
                if line[11:13] in match_string:
                    all_output_files[line[11:13]].write(line)
            except:
                print "This shouldn't happen!"

        file_in.close()

    for item in match_string:
        all_output_files[item].close()

All in all, we're moving pretty far away from UE in this thread (although UE is great for developing and running Python scripts).

Tim

UltraBoG · Mar 23, 2007#152007-03-23T01:44+00:00

What I do, when I have a similar problem is the following:

First, I go into column mode, and then "select" an insert point right in front of the first column of every record, all the way down the file. (Put cursor in first position of file (Ctrl-Home) and then "drag" to bottom with a Shift-Ctrl-End.) Then from the Column menu, select Insert Number. Clicking OK turns UltraEdit loose to insert line numbers all the way from beginning to end of the file. And, it intelligently makes them all the same width by including appropriate leading zeros. This first step might be unnecessary, but it permits me to ensure that all lines with the same record type stay in their original relative order during the second step.

From the File menu, select Sort -> Advanced Sort Options. Now, suppose that each number inserted in step one was 6 digits wide. That puts the record format indicator now into columns 18 and 19, rather than 12 and 13. So, fill out the sort column table with 18 and 19 on the first row, 1 and 6 on the second row and click Sort. For files of extreme size, you might click the checkbox "Alternate sort not using virtual memory." I haven't yet used this checkbox, so I can't say how well it works, but it is something to try if UltraEdit chokes on the size of your file.

The sort brings all the 01 records together at the top of the file, followed by all the 02 records, and so on. It is then a simple matter to copy/paste all the leading lines into another file, or to delete all lines which follow them, or... choose your approach. Column mode selection followed by delete will remove the inserted line numbers.

The reason the first step might be needed is that some sort algorithms do not guarantee to preserve the original order of records whose sort keys are all identical. I don't know what sort algorithm UltraEdit employs, so I mention the first step as a way to guarantee that the relative order of the resultant 01 records is preserved. Of course if you don't care, then the first step is entirely unnecessary.

"share and enjoy"