How to speed up Convert to Fixed Column?

XenonSurf · May 31, 2012#12012-05-31T17:00+00:00

Hi,

I have a huge text doc with 300.000 segments (lines) and 5 TAB separators per line like this:

Text... [Tab]...Text...[TAB]...Text... [Tab]...Text...[TAB]...Text... [Tab]...Text...[RETURN]

I want to select *single* instances of "Text...[TAB]" and put them in another doc to be replaced later on.
So, I chose Column > Convert to Fixed Column, where I first scan and then use ^t as separator.
The problem is it runs for about 20 minutes now, even while writing this message...
My questions are:

- Is my procedure right, or is there another way to chose Text from TAB to TAB only?
- Is there a setting in the Configuration that would speed up the operation?

Thanks very much to reply, as I must edit such files on a regular basis!

Salutations,
XenonS

Mofi · Jun 01, 2012#22012-06-01T05:55+00:00

The scan feature results in reading entire file to find out what is the longest text for data column 1, 2, 3, ... You can speed up the entire process if you know the maximum lengths of every data column, specify them directly and don't use the scan feature.

There is the power tip Large file text editor and some other articles about working with large/huge files can be found in forum.

But Convert to Fixed Column results in millions of small file contents changes and that takes time to finish. If you have much RAM, you might use a RAM disk. Creating a RAM disk with a batch file, copying the large file into the RAM disk and opening it from RAM disk in UltraEdit without using a temporary file would surely speed up this conversion dramatically.

XenonSurf · Jun 01, 2012#32012-06-01T14:11+00:00

Hi Mofi,
thanx a lot for your details.
Yes, I must realize that all that operation takes time. Although my txt file is 90MB in size, in UE I can see in the lower right status bar that the file size is growing LARGE, in the Temp folder it goes fastly to 1GB and more... I guess that's because of all the Trailing spaces that must be added to fix the columns... I even wonder that I may not have the necessary HD space to take a modified doc like that

Is there any other method to select text in multiple lines from TAB to TAB without converting to fixed columns?

Thanks,
XenonS

Mofi · Jun 02, 2012#42012-06-02T12:34+00:00

XenonSurf wrote:Is there any other method to select text in multiple lines from TAB to TAB without converting to fixed columns?

No. But CSV files can be easily edited using regular expression replaces. For example if you want just data column 3 and 5 from a CSV file with 8 data columns from line 500 to 1000, you can first copy the lines 500 to 1000 into a new file, and then use a tagged regular expression Replace All to delete all except data column 3 and 5. I use for the example the semicolon as separator character because tab characters are displayed as space(s) in browsers.

Code: Select all

data col1;data col2;data col3;data col4;data col5;data col6;data col7;data col8
data col1;;data col3;;data col5;;data col7;data col8
data col1;data col2;;data col4;data col5;data col6;data col7;data col8

The Perl regular expression search string to delete everything except data column 3 and 5 in above CSV file with a semicolon as separator is ^(?:.*?;){2}(.*?;).*?;(.*?);.*$ and the replace string is \1\2. The result is:

Code: Select all

data col3;data col5
data col3;data col5
;data col5

Explanation for the search string:

^ ... start search at beginning of a line.

.*? ... matches 0 or more characters of any value except line terminating characters carriage return and line-feed. The . means any character except line terminating characters. * is the multiplier and means 0 or more. The question mark after .* tells the Perl regular expression engine to match as less as possible to next fixed character which is a semicolon on this example. Because . matches also a semicolon, the question marks is needed to avoid matching to one of the later semicolons than the next one in the line.

; is the fixed separator character.

(?:.*?;) ... the above explained expression is put into round brackets to build a group. This group is used only for repeating the expression as explained next. ?: immediately after opening round bracket tells the Perl regular expression not to tag the string found by the expression inside the bracket. This is called a non-capturing or non-tagging group. This will be more clear after reading complete explanation.

{2} ... means that the expression before in the non tagging group should be applied two times. Therefore data column 1 and 2 are matched by (?:.*?;){2}.

(.*?;) ... .*?; matches also 0 or more characters up to next semicolon as above. And also this expression is enclosed in round brackets. But the difference is the missing ?: immediately after opening round bracket. This means that the Perl regular expression engine should tag the string found by this part of the entire search expression. This part of the found string is referenced in the replace string by \1.

.*?; ... a well known expression to match data column 4.

(.*?) ... matches data column 5, this time without the semicolon. Again the string found by this expression is tagged and as this is the second tagged group, this part of the found string is referenced in the replace string by \2.

;.*$ ... matches the semicolon after the data of the fifth column and everything up to end of line without matching the line terminating characters itself.

So the search string matches always all within a line, but after replace only the data of column 3, the semicolon and the data of column 5 remain. Everything else is removed from all lines.

Instead of ; the expression \t must be used for a CSV file using the tab character as separator.

That reads not very easy, I know. But after some practice you will find it very easy to apply the few regular expression patterns needed in various ways to reformat a CSV file according to your current requirements.

XenonSurf · Jun 02, 2012#52012-06-02T17:33+00:00

Hi Mofi,

Wow,...
Thank you very much Mofi for all that. This is an exceptional useful reply, especially for me challenged in more advanced RegEx handling.
I will dedicate extensive time to go through all this!

Thanks for your time!
Greets,
XenonS