Search and Delete text between two different words

cmaxnavy · May 21, 2009#12009-05-21T15:54+00:00

I'm new to UltraEdit and I'm using v14.20. I have been through the many help files and forum comments and cannot solve my search problem.

I'm trying to search a large text doc and delete text between word1 and word2. In other words, I want to highlight and delete all of the text starting at word1 and ending at word2, including the searched words (word1, word2). I have to do this many times in the document. So, I think a recursive routine would be helpful if that's possible. Otherwise, I think a macro that I invoke multiple times to complete the tasks would help.

Any ideas?

Max

pietzcker · May 21, 2009#22009-05-21T20:11+00:00

This sounds like you don't need a macro at all - just one single search and replace routine, using a Perl regular expression.

Open the "Replace" dialog, check the check box "Regular Expressions" and set the radio button "Perl regular expression" in the "Advanced" section of that dialog. Then search for

(?s)\bword1\b.*?\bword2\b

and replace all with nothing.

Caution: This fails if word1/word2 pairs can be nested, e. g. "word1 text text word1 text text word2 text text word2".

Also caution if your word1/word2 contains characters that are special to regular expressions like .*[]\+? and a few others. In that case, please be more specific about your exact words.

Explanation:

(?s) allows searches to span multiple lines, i.e. dot matches also newline characters.

\b matches a word boundary, so if your word1 is cat, only cat will match and not advocate.

. matches any character including newlines, thanks to (?s) at beginning of search string.

* allows for any number of matches (including zero).

? makes the * lazy so that it will only match as much as is absolutely necessary. This is mandatory because otherwise, in the text "word1 delete this word2 don't delete this word1 delete this word2" the regular expression would match from the very first word1 to the very last word2, deleting everything in-between.

HTH,
Tim

ridgerunner · Jun 12, 2009#32009-06-12T22:50+00:00

By adding a little negative lookahead to the ".*?" portion of pietzcker's perl style regex, you can match nested instances of "word1 word1 blah blah word2 word2" as follows:

Code: Select all

(?s)\bword1\b(?:(?!\bword1\b).)*?\bword2\b

Then you can run this regex recursively to remove nested word1-word2 instances from the inside out.

pietzcker · Jun 13, 2009#42009-06-13T09:26+00:00

Cool.

zrob · Oct 22, 2010#52010-10-22T15:31+00:00

Hi,

I would like to delete everything between word1 and a semicolon. I've modified the perl regular expression

Code: Select all

(?s)\bword1\b.*?\bword2\b

which works fine with two 'real' words, into:

Code: Select all

(?s)\bword1\b.*?\b;\b

But that doesn't work. What do I have do add to make it accept the semicolon?

Thanks

Rob

Bracket · Oct 22, 2010#62010-10-22T16:32+00:00

The reason your modification isn't working is because "\b" references a word boundary, and a semicolon *is* a word boundary. If you want to make this work, you need to remove the "\b" on either side of the semicolon which means using

Code: Select all

(?s)\bword1\b.*?;

Aguilucho · Aug 25, 2015#72015-08-25T22:53+00:00

May somebody say me why not to try a simple UE expression for all document at once?

word1*word2 Replace with 'nothing' if the string is only in one line, and
word1[~|]+word2 Replace with 'nothing' if the string is in any number of lines.

Assuming a character that is not on document, like | or any other.
At least, that works for me.

Mofi · Aug 26, 2015#82015-08-26T17:56+00:00

The main problem with UltraEdit syntax on this task is no real control over matching behavior. With Perl .* and .+ are by definition greedy, .*? and .*? are per definition non greedy.

What does this mean - greedy - non greedy?

An example for demonstration:

word1 some other words word2 and more words word1 and four more words word2

UltraEdit regular expression word1*word2 is non greedy which means this expression first matches

word1 some other words word2

and second

word1 and four more words word2

and remaining would be after the replace the string

and more words

UltraEdit regular expression word1?+word2 (single line only) or word1[~|]+word2 (multi-line) is non greedy which means this expression first matches completely

word1 some other words word2 and more words word1 and four more words word2

and leaves nothing behind after replace.

Non greedy means match as less as possible to get a positive match for entire expression.

Greedy means match as much as possible to get a positive match for entire expression.

The multi-line UE greedy search string would really try to match entire file if it is possible (not too large) and the file starts with word1 and ends with word2.

Another example for greedy and non greedy matching behavior is removing paths from file names in a file containing one file name with path per line, for example:

C:\Directory 1\Directory 2\File Name.txt

Non greedy UE expression %*\ and non greedy Perl expression ^.*?\\ match only C:\ while greedy UE expression %?++\ and greedy Perl expression ^.*\\ match C:\Directory 1\Directory 2\ which is necessary to remove entire path and get only name of file.

Also with Perl there is \b (any word boundary) and \< (beginning of word) and \> (end of word) to make sure that word1 and word2 are ignored when not matching entire words as Tim has explained already in first post.