Deleting lines containing more than specified number of words within a tag

greg1234 · May 10, 2012#12012-05-10T17:27+00:00

Hi,
I'm trying to find and delete lines that contain more than specified number of words in one line.
For example I would like to delete lines containing more than 5 words (separated by spaces) between tags <tag> and </tag> from file:

<tag>this is first sentence</tag><tag2>....</tag2>
<tag>this is the second sentence</tag><tag2>....</tag2>
<tag>this is the very last sentence</tag><tag2>....</tag2>

In the above example the second and third line should be deleted.

I was able to find and delete lines containing fixed number of words (3 in this example) using pattern:

<tag>^([0-9a-zA-Z/()]+^) ^([0-9a-zA-Z/()]+^) ^([0-9a-zA-Z/()]+^)</tag><tag2>....</tag2>

But how to do this for lines containing for example free and more words?
Please advise,
Greg

Mofi · May 11, 2012#22012-05-11T06:12+00:00

That can be done only with the Perl regular expression engine. The search string to use is ^.*<tag>(?:\<\w+ *){6,}</tag>.*\r\n

greg1234 · May 11, 2012#32012-05-11T11:24+00:00

Works great.
Thanks a lot!

May 17, 2012#42012-05-17T16:06+00:00

Hi,

this expression works great, but now I'm trying to match also lines containing non-alphanumeric characters, like: - ' , / % # .
Examples of lines that I'd like to be matched:

<tag>this can't be so simple</tag><tag2>....</tag2>
<tag>this is sentence with comma, but without dot</tag><tag2>....</tag2>
<tag>this is sentence with comma and dot.</tag><tag2>....</tag2>

Is it possible to modify this macro, so that words containing - symbol were treated as single words, and comas or other symbols were just omitted in search?
Or, maybe it will be simpler to modify the expression, assuming that word is anything separated by space?

Thanks,
Greg

Mofi · May 17, 2012#52012-05-17T17:31+00:00

I quickly found the expression ^.*<tag> *(?:\<\w[^\s<]+ *){6,}</tag>.*\r\n

But I was not happy with it because every "word" must start with a word character. Therefore it does not work for something like

<tag>'this' 'is' 'an' 'example' 'with' '6' 'words'</tag><tag2>....</tag2>

I needed more than 30 minutes to find the hopefully ultimate solution producing correct results:

^.*<tag>\s*(?:[^\s<]+\s+){5,}[^\s<]+\s*</tag>.*\r\n

This expression matches entire lines containing at least 6 whitespace separated strings within <tag>...</tag>. Value 5 in the expression is not a mistake for at least 6 strings. As \s matches also newline characters, there can be now even line breaks within <tag>...</tag> like in following example

<tag>this is the very last sentence
with a line break</tag><tag2>....</tag2>

In this case both lines are completely matched by the expression.

Please note that within <tag>...</tag> no other tag or character < not encoded with < as HTML requires is allowed because in this case the regular expression would ignore such lines.

Explanation of the expression above:

^ ... start the search at beginning of a line.

.* ... matches 0 or more occurrences of any character except newline characters. This expression matches everything up to next string which is <tag> if the current line contains that string at all.

\s* ... matches 0 or more occurrences of whitespace characters. There could be a space, tab or line break after string <tag>. Whitespace characters are at least the horizontal tab character (0x09), line-feed (0x0A), the vertical tab character (0x0B, very rare in text files), the form-feed character (0x0C), carriage return (0x0D), the space character (0x30), the non breaking space character (0xA0), and perhaps also other whitespace characters from Unicode table (not tested by me).

(:?...) groups an expression. Usually everything in round brackets is also marked (tagged) for being back referenced in search or replace string. :? immediately after opening round bracket tells the Perl engine not mark the string found by the expression inside the round brackets as here the group is just for applying the following multiplier expression.

[^\s<]+ ... is a negative character set definition. It matches all characters 1 or more times except whitespace characters and left angle bracket. In other words this expression matches a string surrounded by whitespace characters not containing character <.

\s+ ... next 1 or more whitespace characters must follow and not left angle bracket.

{5,} ... means the previous expression should match non whitespace strings with whitespace(s) following 5 or more times.

[^\s<]+ ... now an already well known expression. After at least 5 non whitespace strings with whitespace(s) following there must one more non whitespace string.

\s* ... the next character can be now 0 or more whitespaces before next fixed string </tag>. But it is allowed that after word 6, 7, 8, ... there is no whitespace and </tag> immediately follows.

.*\r\n ... match 0 or more occurrences of any character up to carriage return and line-feed and match these 2 newline characters too.

greg1234 · May 18, 2012#62012-05-18T06:54+00:00

Hi Mofi,
You are master. I really, really appreciate your help and your time!
That expression works perfectly. Now, I'll take my time to understand what it does.
Thanks a lot.
Greg

Mofi · May 18, 2012#72012-05-18T14:52+00:00

I have added an explanation to my previous post for the regular expression search string.