How to strip out text from HTML tags?

Gabarito · Aug 21, 2020#12020-08-21T18:16+00:00

I'm facing a task that is the exact opposite of what I already have.
I mean, I have a macro copied from this forum, written by Mofi, that can strip HTML tags out from an HTML code.

Here it is:

Code: Select all

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Find MatchCase RegExp "\r\n"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
Else
Find MatchCase RegExp "\n"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\n"
Else
Find MatchCase RegExp "\r"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r"
Else
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
EndIf
EndIf
EndIf
Top
Find MatchCase RegExp "<[^>]+>"
Replace All ""
TrimLeadingSpaces
TrimTrailingSpaces
Top
Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
Replace All ""
Top
Find MatchCase RegExp "\A\v+"
Replace ""
Find MatchCase RegExp "\v+\z"
Replace ""
Bottom
InsertLine
Top
Find MatchCase "&nbsp;"
Replace All " "
Find MatchCase "&thinsp;"
Replace All " "
Find MatchCase "&emsp;"
Replace All " "
Find MatchCase "&ensp;"
Replace All " "
Find MatchCase "&zwj;"
Replace All ""
Find MatchCase "&zwnj;"
Replace All ""
Find MatchCase "&lt;"
Replace All "<"
Find MatchCase "&gt;"
Replace All "<"
Find MatchCase "&amp;"
Replace All "&"
Find MatchCase "&quot;"
Replace All "\""
Find MatchCase "&mdash;"
Replace All "—"
Find MatchCase "&ndash;"
Replace All "–"
Find MatchCase "&shy;"
Replace All "-"
Find MatchCase "&circ;"
Replace All "ˆ"
Find MatchCase "&iexcl;"
Replace All "¡"
Find MatchCase "&brvbar;"
Replace All "¦"
Find MatchCase "&uml;"
Replace All "¨"
Find MatchCase "&macr;"
Replace All "¯"
Find MatchCase "&acute;"
Replace All "´"
Find MatchCase "&cedil;"
Replace All "¸"
Find MatchCase "&iquest;"
Replace All "¿"
Find MatchCase "&tilde;"
Replace All "˜"
Find MatchCase "&lsquo;"
Replace All "‘"
Find MatchCase "&rsquo;"
Replace All "’"
Find MatchCase "&sbquo;"
Replace All "‚"
Find MatchCase "&ldquo;"
Replace All "“"
Find MatchCase "&rdquo;"
Replace All "”"
Find MatchCase "&bdquo;"
Replace All "„"
Find MatchCase "&lsaquo;"
Replace All "‹"
Find MatchCase "&rsaquo;"
Replace All "›"
Find MatchCase "&lt;"
Replace All "<"
Find MatchCase "&gt;"
Replace All ">"
Find MatchCase "&plusmn;"
Replace All "±"
Find MatchCase "&laquo;"
Replace All "«"
Find MatchCase "&raquo;"
Replace All "»"
Find MatchCase "&times;"
Replace All "×"
Find MatchCase "&divide;"
Replace All "÷"
Find MatchCase "&cent;"
Replace All "¢"
Find MatchCase "&pound;"
Replace All "£"
Find MatchCase "&curren;"
Replace All "¤"
Find MatchCase "&yen;"
Replace All "¥"
Find MatchCase "&sect;"
Replace All "§"
Find MatchCase "&copy;"
Replace All "©"
Find MatchCase "&not;"
Replace All "¬"
Find MatchCase "&reg;"
Replace All "®"
Find MatchCase "&deg;"
Replace All "°"
Find MatchCase "&micro;"
Replace All "µ"
Find MatchCase "&para;"
Replace All "¶"
Find MatchCase "&middot;"
Replace All "·"
Find MatchCase "&dagger;"
Replace All "†"
Find MatchCase "&Dagger;"
Replace All "‡"
Find MatchCase "&permil;"
Replace All "‰"
Find MatchCase "&euro;"
Replace All "€"
Find MatchCase "&frac14;"
Replace All "¼"
Find MatchCase "&frac12;"
Replace All "½"
Find MatchCase "&frac34;"
Replace All "¾"
Find MatchCase "&sup1;"
Replace All "¹"
Find MatchCase "&sup2;"
Replace All "²"
Find MatchCase "&sup3;"
Replace All "³"
Find MatchCase "&aacute;"
Replace All "á"
Find MatchCase "&Aacute;"
Replace All "Á"
Find MatchCase "&acirc;"
Replace All "â"
Find MatchCase "&Acirc;"
Replace All "Â"
Find MatchCase "&agrave;"
Replace All "à"
Find MatchCase "&Agrave;"
Replace All "À"
Find MatchCase "&aring;"
Replace All "å"
Find MatchCase "&Aring;"
Replace All "Å"
Find MatchCase "&atilde;"
Replace All "ã"
Find MatchCase "&Atilde;"
Replace All "Ã"
Find MatchCase "&auml;"
Replace All "ä"
Find MatchCase "&Auml;"
Replace All "Ä"
Find MatchCase "&ordf;"
Replace All "ª"
Find MatchCase "&aelig;"
Replace All "æ"
Find MatchCase "&AElig;"
Replace All "Æ"
Find MatchCase "&ccedil;"
Replace All "ç"
Find MatchCase "&Ccedil;"
Replace All "Ç"
Find MatchCase "&eth;"
Replace All "ð"
Find MatchCase "&ETH;"
Replace All "Ð"
Find MatchCase "&eacute;"
Replace All "é"
Find MatchCase "&Eacute;"
Replace All "É"
Find MatchCase "&ecirc;"
Replace All "ê"
Find MatchCase "&Ecirc;"
Replace All "Ê"
Find MatchCase "&egrave;"
Replace All "è"
Find MatchCase "&Egrave;"
Replace All "È"
Find MatchCase "&euml;"
Replace All "ë"
Find MatchCase "&Euml;"
Replace All "Ë"
Find MatchCase "&fnof;"
Replace All "ƒ"
Find MatchCase "&iacute;"
Replace All "í"
Find MatchCase "&Iacute;"
Replace All "Í"
Find MatchCase "&icirc;"
Replace All "î"
Find MatchCase "&Icirc;"
Replace All "Î"
Find MatchCase "&igrave;"
Replace All "ì"
Find MatchCase "&Igrave;"
Replace All "Ì"
Find MatchCase "&iuml;"
Replace All "ï"
Find MatchCase "&Iuml;"
Replace All "Ï"
Find MatchCase "&ntilde;"
Replace All "ñ"
Find MatchCase "&Ntilde;"
Replace All "Ñ"
Find MatchCase "&oacute;"
Replace All "ó"
Find MatchCase "&Oacute;"
Replace All "Ó"
Find MatchCase "&ocirc;"
Replace All "ô"
Find MatchCase "&Ocirc;"
Replace All "Ô"
Find MatchCase "&ograve;"
Replace All "ò"
Find MatchCase "&Ograve;"
Replace All "Ò"
Find MatchCase "&ordm;"
Replace All "º"
Find MatchCase "&oslash;"
Replace All "ø"
Find MatchCase "&Oslash;"
Replace All "Ø"
Find MatchCase "&otilde;"
Replace All "õ"
Find MatchCase "&Otilde;"
Replace All "Õ"
Find MatchCase "&ouml;"
Replace All "ö"
Find MatchCase "&Ouml;"
Replace All "Ö"
Find MatchCase "&oelig;"
Replace All "œ"
Find MatchCase "&OElig;"
Replace All "Œ"
Find MatchCase "&scaron;"
Replace All "š"
Find MatchCase "&Scaron;"
Replace All "Š"
Find MatchCase "&szlig;"
Replace All "ß"
Find MatchCase "&thorn;"
Replace All "þ"
Find MatchCase "&THORN;"
Replace All "Þ"
Find MatchCase "&uacute;"
Replace All "ú"
Find MatchCase "&Uacute;"
Replace All "Ú"
Find MatchCase "&ucirc;"
Replace All "û"
Find MatchCase "&Ucirc;"
Replace All "Û"
Find MatchCase "&ugrave;"
Replace All "ù"
Find MatchCase "&Ugrave;"
Replace All "Ù"
Find MatchCase "&uuml;"
Replace All "ü"
Find MatchCase "&Uuml;"
Replace All "Ü"
Find MatchCase "&yacute;"
Replace All "ý"
Find MatchCase "&Yacute;"
Replace All "Ý"
Find MatchCase "&yuml;"
Replace All "ÿ"
Find MatchCase "&Yuml;"
Replace All "Ÿ"

But now, the challenge is the opposite: to strip out the plain text of the page and keep all tags and JavaScript code.
It's because I need to send to other persons a page saved from a Web WhatsApp conversation, but removing personal data and chat.
I plan to replace that with a fixed warning, like "Edited and removed personal data" where it was the page text.

Because almost all HTML tags has "<" and ">" to begin and end a tag, I thought a regular expression like that:
Search: ">.*?<"
Replace: ">Edited and removed personal data<"

But it's not working.
Problem: It selects all occurrences of ">" and "<", even if there is no text between them.
And replaces where there is no need to do that.

If I search for ">.+?<", regular expression catches other tag inside, like this.
From this code
<span></span><span></span>
it selects
><span><

I'm newbie with regular expressions and I suspect that the solution could be far complex than that.
So, I ask some help to strip out my chats from the tags.
Maybe the solution is better achieved by a macro. Or by scripting.
What do you think?

Here, a small piece of code to test suggestions: RegEx test and debugger

Thanks.

fleggy · Aug 21, 2020#22020-08-21T19:37+00:00

Hi Gabarito,

you were very close :)
>\K[^<\r\n][^<]*+

The first character after ">" can be anything but "<" or a newline.
Then it matches everything until the nearest "<".

It supposes that all trailing spaces are trimmed to elimitate the "empty" matches like this:
</end_tag>spaces...
<next_tag>...

But it might not be suffficient if "<" can appear inside the text.

BR, Fleggy

EDIT:
I am not sure if you really want to replace everything between any tags. This regex
>\K(?![\s\-]*+<)[^<]++
skips "empty" matches (spaces, tab, newlines and "-") but it replaces even the content of the tag <style> in your example.
Shouldn't be the replace considered only for a list of particular tags? E.g. <span>, <title>, <div> and some others?

Gabarito · Aug 21, 2020#32020-08-21T20:04+00:00

Thank you very much, Fleggy!

Your solution works!

.

Later, I'm submit it to a real test code and see its performance and if it needs some adjusts.

You are very advanced in RegExps. Congrats!
I wouldn't ever had expected a solution that fast.

Thanks again, man!

Aug 21, 2020#42020-08-21T20:05+00:00

Saw your EDIT post now.
I'll apply both RegExps later and see results.
I'll be back to tell which one had worked better.

Aug 21, 2020#52020-08-21T23:00+00:00

Because I need to keep style tags, your first solution is the best for my problem.
This: ">\K[^<\r\n][^<]*+"

Applying proposed solution on a real file, I found some other situations that could get better.
There are a lot of instances of date-time.

Like this:
<div class="m61XR">29/07/2020</div>

Or this:
<div class="m61XR">07:46</div>

These data can be keeped. No need to remove.
Is it possible to write an expression to skip such format?

------------------

Now, this is much more important.
Sometimes, I get text inside double quotes. Like this:

<div class="_3tBW6"><span class="_2iq-U" title="Text and more text ... (more text) ...
that can cross ...
more than ...
two lines"><div class="zFnXi">

I think that I can solve this applying two diferent RegExps:
first to remove text inside ">" and "<"

solution is almost ready
second to remove text inside double quotes from

'title=" text ...">'

What do you think?

fleggy · Aug 22, 2020#62020-08-22T07:07+00:00

Well, I think that the correct regex should be context aware. But I know very little about HTML...
It is possible to use a negative lookahead to skip date/time (date can be in any common format):
>(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

For the attribute title you can use:
title="\K[^"]++

BR, Fleggy

Gabarito · Aug 22, 2020#72020-08-22T20:48+00:00

fleggy wrote: ↑
Aug 22, 2020
It is possible to use a negative lookahead to skip date/time (date can be in any common format):
>(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

Your solution stated above has 3 little mistakes, IMHO.
I'm not expert in RegExp and excuse me if I'm wrong, but I think the right expression is:
>(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

Fixes are in red:
>(?!(?:\d\d:\d\d|\d\d([-\./])\d\d\1\d\d\/d\d|\d\d\d\d([-\./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

I spent too much time to realize that.
And I'm still don't fully understanding the negative neither lookahead thingie.

fleggy wrote: ↑
Aug 22, 2020
For the attribute title you can use:
title="\K[^"]++

This solution worked very well.

Thanks.

fleggy · Aug 23, 2020#82020-08-23T06:41+00:00

Oh, sorry. I overlooked that and did tests just for common delimiters.
Negative lookahead (?!test pattern) is a test if the following text does not match a test pattern (time or date in your case). Lookahead (or lookarounds, generally speaking) does not change the current position in the text. Thus, if there is no time/date then the regex \K[^<\r\n][^<]*+ can continue from the current position. Otherwise (time/date found) the lookahead fails so the whole regex fails and the regex engine begins a new attemp to match at the next position.

BR, Fleggy