Find in files blocks (containing 1 or more words) and delete them
Finding and deleting blocks with always same content is no problem as just the block must be selected before opening the replace or replace in files dialog to search for this block and replace it by nothing.
Finding and deleting blocks with varying content, but a fixed number of lines, is also quite simple with a regular expression search. Such blocks removing replaces are often no big challenge for users with little practice in regular expression finds/replaces.
But the deletion of blocks with varying content and a not determined number of lines is very often a big problem even for users with practice in regular expressions.
However, UltraEdit offers multiple methods for finding, selecting and deleting such blocks and I want to briefly explain them here.
Note 1:
All regular expressions below are for text files with DOS or UNIX line terminators. The expressions as written are not working on text files containing only carriage returns, i.e. older MAC files. But the expressions are working for files with MAC line terminators with replacing
^r++^n by just
^r respectively
\r*\n by just
\r.
Note 2:
The
block starting string and
block ending string in the regular expression search strings below should not contain characters which are regular expression characters for the used regular expression engine. If such characters exist nevertheless in those 2 strings, escape them with either
^ (UltraEdit) or
\ (Unix/Perl) inserted left to the character with a special regular expression meaning.
Find and select everything between two strings
First, there is the
Find Select feature. With holding key SHIFT on pressing the button
Previous or
Next (or
Find Next) in the Find dialog a find is executed upwards or downwards with selecting everything from current position of the caret to end of found string. The
Find Select feature works also on the commands
Find Next and
Find Previous when holding key SHIFT on execution of the command.
An existing selection is extended or reduced when using the
Find Select feature.
An existing selection is extended if the caret is blinking at top of the selection because the selection was made upwards and the find is executed also upwards, or if the caret is blinking at bottom of the selection because the selection was made downwards and the find is executed also downwards.
An existing selection is reduced and perhaps completely replaced by a new selection if the find is executed in the different direction than the selection was made before whereby a new selection starts where the existing selection started before.
The
Find Select feature can be used to select a block of any size. There is no limit in number of lines or number of characters (bytes).
Selecting and next deleting blocks multiple times in a file with using
Find Select feature should not be done manually. It is better to write a small macro for this task. Here is template for the macro:
InsertMode
ColumnModeOff
UnixReOff
Top
Loop 0
Find "block starting string"
IfNotFound
ExitLoop
EndIf
Key HOME
IfColNumGt 1
Key HOME
EndIf
Find RegExp Select "block ending string*^r++^n"
IfSel
Delete
Else
ExitLoop
EndIf
EndLoop
Top
This little macro runs in a loop first a simple find for the
block starting string. The loop is exited if this string cannot not be found anymore from current position of the caret to end of the file. Caret is moved to beginning of the line containing the
block starting string. Next a second find is executed which is this time an UltraEdit regular expression find searching for the
block ending string and matching also everything up to end of the line including the DOS or UNIX line terminator. This find is executed with additionally selecting everything from beginning of the line with the
block starting string. The selected block is deleted if the
block ending string was found in the file. Otherwise the loop is exited and caret is moved back to top of the file.
However, using the
Find Select feature has some disadvantages in comparison to the other methods explained below:
- Always creating a macro for finding and replacing blocks is time consuming.
- This method cannot be easily extended to run on all files of a directory or all opened files.
So there is mainly only 1 advantage in comparison to the other methods: unlimited block size.
Match everything between two strings - greedy
Most often the blocks to find and delete are small and have just a few kilobytes. In this case it is more efficient to run a regular expression replace all to find and delete the blocks.
With the
UltraEdit regular expression engine the search string is:
%*block starting string[~#]++block ending string*^r++^n
With the
Unix/Perl regular expression engine the search string is:
^.*block starting string[^#]*block ending string.*\r*\n
The replace string is simply an empty string.
The expression
[~#]++ respectively
[^#]* matches 0 or more characters not being
# which should not exist anywhere within the block to find and delete. It does not matter which character is used in this negative character set definition. So it can be also any other character which definitely never exists within the blocks to delete.
With such a regular expression replace it is much easier to find and delete blocks in multiple files.
But there is one big problem here: the expression
[~#]++ respectively
[^#]* is greedy.
What means greedy?
Let me explain this with a small XML like example.
Code: Select all
<resource type="text">
<text>any text</text>
<lang>en</lang>
</resource>
<resource type="bitmap">
<bitmap>file.bmp</bitmap>
<lang>neutral</lang>
</resource>
<resource type="text">
<text>anderer Text</text>
<lang>de</lang>
</resource>
The target is to delete all resource blocks of type text. So for example with the UltraEdit regular expression engine the search string would be:
%*<resource type="text"[~#]++</resource>*^r++^n
If you run first a find in active file to verify if the blocks are correct identified and selected by this expression, you will see something you would have not expected most likely. Instead of finding first the upper text resource block and next the second text resource block, the find selects everything. So the find does not stop on first occurrence of the string
</resource>.
And that is meant with greedy. A greedy expression matches as much characters as possible to get nevertheless a positive result on the entire search expression.
Sometimes a greedy expression is wanted for example on a file containing just a list of file names with full path and from all lines the path should be removed. With greedy expression
%?++\ (UE) or
^.*\\ (Perl) it is very easy to remove all characters from all lines up to and including last backslash of the path.
But very often a greedy expression is not wanted, especially not on finding blocks for deletion.
A non greedy expression is very often needed which matches as less characters as possible to get nevertheless a positive result on the entire search expression.
It is not possible unfortunately with the legacy UltraEdit and Unix regular expression engines to define a non greedy expression for finding blocks.
So the expression templates written here can be safely used only if you are 100% sure that the block to find and delete exists only once in a file and that especially the
block ending string never exists more than once in a file.
Match everything between two strings - non greedy
Well, a non greedy expression for finding blocks is not possible with the legacy UltraEdit or Unix regular expression engines. But since UltraEdit version 12.00 there is a third one, the Perl compatible regular expression engine. And this regular expression engine offers a very simple method to change a greedy expression into a non greedy one: a
question mark must be appended on the greedy expression to change it to a non greedy expression.
With the
Perl regular expression engine the non greedy search string is:
^.*block starting string[^#]*?block ending string.*\r*\n
That's it, really?
Yes, that's it. But with the Perl regular expression engine it is possible to change the expression
[^#]*? to be usable more generally by using instead
.*? which matches any character except new line characters 0 or more times non greedy.
Except new line characters for matching a block?
Well, a point matches by default only all characters except a carriage return or a line-feed. But needed is here a different behavior as multiple lines should be matched by the expression and therefore also line ending characters must be matched by the expression
.*?
The good news: the behavior on what is matched by a point can be modified by using
(?s) in the search string. This is a special expression which tells the Perl regular expression engine to match also line terminating characters for every point meaning now really any character. Usually this special flag expression is used at beginning of the search string.
The non greedy search string could be therefore also:
(?s)^.*?block starting string.*?block ending string.*?\r*\n
As you can see there are two more
? in the search string now. This is necessary as all points match now also carriage returns and line-feeds and therefore the other two
.* for matching just the characters from beginning of a line to
block starting string and from
block ending string to end of the line would be greedy and the result would be a match of the entire file.
Safer against a wrong matching would be in this case with point matching also line terminating characters:
(?s)^[^\r\n]*block starting string.*?block ending string[^\r\n]*\r*\n
[^\r\n] matches any character except a carriage return or a line-feed and is therefore exactly what a point matches without the flag
(?s) at beginning of the search string.
Find and delete only blocks containing a string
Especially for XML files there is often the task to find and delete blocks containing a specific string.
This additional requirement makes the block finding expression a real challenge as it must be avoided that the Perl regular expression engine starts the matching on beginning of block X not containing the specific string and ends the matching on end of block Y really containing the specific string.
The
Perl regular expression search string for finding a block containing a specific string is:
(?s)^[^\r\n]*block starting string(?:(?!block ending string).)*?specific string.*?block ending string[^\r\n]*\r*\n
Deleting all text resource blocks with German text on the XML example above could be done with the search string
(?s)^[^\r\n]*<resource type="text"(?:(?!</resource>).)*?<lang>de</lang>.*?</resource>[^\r\n]*\r*\n
and an empty replace string resulting in
Code: Select all
<resource type="text">
<text>any text</text>
<lang>en</lang>
</resource>
<resource type="bitmap">
<bitmap>file.bmp</bitmap>
<lang>neutral</lang>
</resource>
It is even possible to use a non marking OR expression to find and delete blocks containing one string of a list of specific strings.
(?s)^[^\r\n]*block starting string(?:(?!block ending string).)*?(?:string 1|string 2|string 3).*?block ending string[^\r\n]*\r*\n
I don't want to explain here how this regular expression string works although I know it. It is very difficult to understand even for regular expression experts and therefore not easy to explain people not knowing much about Perl regular expressions.
ATTENTION:
The methods for selecting and deleting a blocking using a regular expression replace command do not work for blocks of unlimited size. None of the 3 offered regular expression engines support matching very large blocks respectively strings. So use this method only for blocks with just some kilobytes. I have made once in the past several attempts to find out the limit, but could not even find out the limiting criteria as I have seen different results depending on search string and content of the files. However, blocks smaller than 64 kilobytes are certainly never a problem.