I'm facing a task that is the exact opposite of what I already have.
I mean, I have a macro copied from this forum, written by Mofi, that can strip HTML tags out from an HTML code.
Here it is:
But now, the challenge is the opposite: to strip out the plain text of the page and keep all tags and JavaScript code.
It's because I need to send to other persons a page saved from a Web WhatsApp conversation, but removing personal data and chat.
I plan to replace that with a fixed warning, like "Edited and removed personal data" where it was the page text.
Because almost all HTML tags has "<" and ">" to begin and end a tag, I thought a regular expression like that:
Search: ">.*?<"
Replace: ">Edited and removed personal data<"
But it's not working.
Problem: It selects all occurrences of ">" and "<", even if there is no text between them.
And replaces where there is no need to do that.
If I search for ">.+?<", regular expression catches other tag inside, like this.
From this code
<span></span><span></span>
it selects
><span><
I'm newbie with regular expressions and I suspect that the solution could be far complex than that.
So, I ask some help to strip out my chats from the tags.
Maybe the solution is better achieved by a macro. Or by scripting.
What do you think?
Here, a small piece of code to test suggestions: RegEx test and debugger
Thanks.
I mean, I have a macro copied from this forum, written by Mofi, that can strip HTML tags out from an HTML code.
Here it is:
Code: Select all
InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Find MatchCase RegExp "\r\n"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
Else
Find MatchCase RegExp "\n"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\n"
Else
Find MatchCase RegExp "\r"
IfFound
Top
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r"
Else
Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
EndIf
EndIf
EndIf
Top
Find MatchCase RegExp "<[^>]+>"
Replace All ""
TrimLeadingSpaces
TrimTrailingSpaces
Top
Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
Replace All ""
Top
Find MatchCase RegExp "\A\v+"
Replace ""
Find MatchCase RegExp "\v+\z"
Replace ""
Bottom
InsertLine
Top
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase "‍"
Replace All ""
Find MatchCase "‌"
Replace All ""
Find MatchCase "<"
Replace All "<"
Find MatchCase ">"
Replace All "<"
Find MatchCase "&"
Replace All "&"
Find MatchCase """
Replace All "\""
Find MatchCase "—"
Replace All "—"
Find MatchCase "–"
Replace All "–"
Find MatchCase "­"
Replace All "-"
Find MatchCase "ˆ"
Replace All "ˆ"
Find MatchCase "¡"
Replace All "¡"
Find MatchCase "¦"
Replace All "¦"
Find MatchCase "¨"
Replace All "¨"
Find MatchCase "¯"
Replace All "¯"
Find MatchCase "´"
Replace All "´"
Find MatchCase "¸"
Replace All "¸"
Find MatchCase "¿"
Replace All "¿"
Find MatchCase "˜"
Replace All "˜"
Find MatchCase "‘"
Replace All "‘"
Find MatchCase "’"
Replace All "’"
Find MatchCase "‚"
Replace All "‚"
Find MatchCase "“"
Replace All "“"
Find MatchCase "”"
Replace All "”"
Find MatchCase "„"
Replace All "„"
Find MatchCase "‹"
Replace All "‹"
Find MatchCase "›"
Replace All "›"
Find MatchCase "<"
Replace All "<"
Find MatchCase ">"
Replace All ">"
Find MatchCase "±"
Replace All "±"
Find MatchCase "«"
Replace All "«"
Find MatchCase "»"
Replace All "»"
Find MatchCase "×"
Replace All "×"
Find MatchCase "÷"
Replace All "÷"
Find MatchCase "¢"
Replace All "¢"
Find MatchCase "£"
Replace All "£"
Find MatchCase "¤"
Replace All "¤"
Find MatchCase "¥"
Replace All "¥"
Find MatchCase "§"
Replace All "§"
Find MatchCase "©"
Replace All "©"
Find MatchCase "¬"
Replace All "¬"
Find MatchCase "®"
Replace All "®"
Find MatchCase "°"
Replace All "°"
Find MatchCase "µ"
Replace All "µ"
Find MatchCase "¶"
Replace All "¶"
Find MatchCase "·"
Replace All "·"
Find MatchCase "†"
Replace All "†"
Find MatchCase "‡"
Replace All "‡"
Find MatchCase "‰"
Replace All "‰"
Find MatchCase "€"
Replace All "€"
Find MatchCase "¼"
Replace All "¼"
Find MatchCase "½"
Replace All "½"
Find MatchCase "¾"
Replace All "¾"
Find MatchCase "¹"
Replace All "¹"
Find MatchCase "²"
Replace All "²"
Find MatchCase "³"
Replace All "³"
Find MatchCase "á"
Replace All "á"
Find MatchCase "Á"
Replace All "Á"
Find MatchCase "â"
Replace All "â"
Find MatchCase "Â"
Replace All "Â"
Find MatchCase "à"
Replace All "à"
Find MatchCase "À"
Replace All "À"
Find MatchCase "å"
Replace All "å"
Find MatchCase "Å"
Replace All "Å"
Find MatchCase "ã"
Replace All "ã"
Find MatchCase "Ã"
Replace All "Ã"
Find MatchCase "ä"
Replace All "ä"
Find MatchCase "Ä"
Replace All "Ä"
Find MatchCase "ª"
Replace All "ª"
Find MatchCase "æ"
Replace All "æ"
Find MatchCase "Æ"
Replace All "Æ"
Find MatchCase "ç"
Replace All "ç"
Find MatchCase "Ç"
Replace All "Ç"
Find MatchCase "ð"
Replace All "ð"
Find MatchCase "Ð"
Replace All "Ð"
Find MatchCase "é"
Replace All "é"
Find MatchCase "É"
Replace All "É"
Find MatchCase "ê"
Replace All "ê"
Find MatchCase "Ê"
Replace All "Ê"
Find MatchCase "è"
Replace All "è"
Find MatchCase "È"
Replace All "È"
Find MatchCase "ë"
Replace All "ë"
Find MatchCase "Ë"
Replace All "Ë"
Find MatchCase "ƒ"
Replace All "ƒ"
Find MatchCase "í"
Replace All "í"
Find MatchCase "Í"
Replace All "Í"
Find MatchCase "î"
Replace All "î"
Find MatchCase "Î"
Replace All "Î"
Find MatchCase "ì"
Replace All "ì"
Find MatchCase "Ì"
Replace All "Ì"
Find MatchCase "ï"
Replace All "ï"
Find MatchCase "Ï"
Replace All "Ï"
Find MatchCase "ñ"
Replace All "ñ"
Find MatchCase "Ñ"
Replace All "Ñ"
Find MatchCase "ó"
Replace All "ó"
Find MatchCase "Ó"
Replace All "Ó"
Find MatchCase "ô"
Replace All "ô"
Find MatchCase "Ô"
Replace All "Ô"
Find MatchCase "ò"
Replace All "ò"
Find MatchCase "Ò"
Replace All "Ò"
Find MatchCase "º"
Replace All "º"
Find MatchCase "ø"
Replace All "ø"
Find MatchCase "Ø"
Replace All "Ø"
Find MatchCase "õ"
Replace All "õ"
Find MatchCase "Õ"
Replace All "Õ"
Find MatchCase "ö"
Replace All "ö"
Find MatchCase "Ö"
Replace All "Ö"
Find MatchCase "œ"
Replace All "œ"
Find MatchCase "Œ"
Replace All "Œ"
Find MatchCase "š"
Replace All "š"
Find MatchCase "Š"
Replace All "Š"
Find MatchCase "ß"
Replace All "ß"
Find MatchCase "þ"
Replace All "þ"
Find MatchCase "Þ"
Replace All "Þ"
Find MatchCase "ú"
Replace All "ú"
Find MatchCase "Ú"
Replace All "Ú"
Find MatchCase "û"
Replace All "û"
Find MatchCase "Û"
Replace All "Û"
Find MatchCase "ù"
Replace All "ù"
Find MatchCase "Ù"
Replace All "Ù"
Find MatchCase "ü"
Replace All "ü"
Find MatchCase "Ü"
Replace All "Ü"
Find MatchCase "ý"
Replace All "ý"
Find MatchCase "Ý"
Replace All "Ý"
Find MatchCase "ÿ"
Replace All "ÿ"
Find MatchCase "Ÿ"
Replace All "Ÿ"
It's because I need to send to other persons a page saved from a Web WhatsApp conversation, but removing personal data and chat.
I plan to replace that with a fixed warning, like "Edited and removed personal data" where it was the page text.
Because almost all HTML tags has "<" and ">" to begin and end a tag, I thought a regular expression like that:
Search: ">.*?<"
Replace: ">Edited and removed personal data<"
But it's not working.
Problem: It selects all occurrences of ">" and "<", even if there is no text between them.
And replaces where there is no need to do that.
If I search for ">.+?<", regular expression catches other tag inside, like this.
From this code
<span></span><span></span>
it selects
><span><
I'm newbie with regular expressions and I suspect that the solution could be far complex than that.
So, I ask some help to strip out my chats from the tags.
Maybe the solution is better achieved by a macro. Or by scripting.
What do you think?
Here, a small piece of code to test suggestions: RegEx test and debugger
Thanks.