How to strip out text from HTML tags?

    Aug 21, 2020

    I'm facing a task that is the exact opposite of what I already have.
    I mean, I have a macro copied from this forum, written by Mofi, that can strip HTML tags out from an HTML code.

    Here it is:

    Code: Select all

    Find MatchCase RegExp "\r\n"
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r\n"
    Find MatchCase RegExp "\n"
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\n"
    Find MatchCase RegExp "\r"
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r"
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r\n"
    Find MatchCase RegExp "<[^>]+>"
    Replace All ""
    Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
    Replace All ""
    Find MatchCase RegExp "\A\v+"
    Replace ""
    Find MatchCase RegExp "\v+\z"
    Replace ""
    Find MatchCase "&nbsp;"
    Replace All " "
    Find MatchCase "&thinsp;"
    Replace All " "
    Find MatchCase "&emsp;"
    Replace All " "
    Find MatchCase "&ensp;"
    Replace All " "
    Find MatchCase "&zwj;"
    Replace All ""
    Find MatchCase "&zwnj;"
    Replace All ""
    Find MatchCase "&lt;"
    Replace All "<"
    Find MatchCase "&gt;"
    Replace All "<"
    Find MatchCase "&amp;"
    Replace All "&"
    Find MatchCase "&quot;"
    Replace All "\""
    Find MatchCase "&mdash;"
    Replace All "—"
    Find MatchCase "&ndash;"
    Replace All "–"
    Find MatchCase "&shy;"
    Replace All "-"
    Find MatchCase "&circ;"
    Replace All "ˆ"
    Find MatchCase "&iexcl;"
    Replace All "¡"
    Find MatchCase "&brvbar;"
    Replace All "¦"
    Find MatchCase "&uml;"
    Replace All "¨"
    Find MatchCase "&macr;"
    Replace All "¯"
    Find MatchCase "&acute;"
    Replace All "´"
    Find MatchCase "&cedil;"
    Replace All "¸"
    Find MatchCase "&iquest;"
    Replace All "¿"
    Find MatchCase "&tilde;"
    Replace All "˜"
    Find MatchCase "&lsquo;"
    Replace All "‘"
    Find MatchCase "&rsquo;"
    Replace All "’"
    Find MatchCase "&sbquo;"
    Replace All "‚"
    Find MatchCase "&ldquo;"
    Replace All "“"
    Find MatchCase "&rdquo;"
    Replace All "”"
    Find MatchCase "&bdquo;"
    Replace All "„"
    Find MatchCase "&lsaquo;"
    Replace All "‹"
    Find MatchCase "&rsaquo;"
    Replace All "›"
    Find MatchCase "&lt;"
    Replace All "<"
    Find MatchCase "&gt;"
    Replace All ">"
    Find MatchCase "&plusmn;"
    Replace All "±"
    Find MatchCase "&laquo;"
    Replace All "«"
    Find MatchCase "&raquo;"
    Replace All "»"
    Find MatchCase "&times;"
    Replace All "×"
    Find MatchCase "&divide;"
    Replace All "÷"
    Find MatchCase "&cent;"
    Replace All "¢"
    Find MatchCase "&pound;"
    Replace All "£"
    Find MatchCase "&curren;"
    Replace All "¤"
    Find MatchCase "&yen;"
    Replace All "¥"
    Find MatchCase "&sect;"
    Replace All "§"
    Find MatchCase "&copy;"
    Replace All "©"
    Find MatchCase "&not;"
    Replace All "¬"
    Find MatchCase "&reg;"
    Replace All "®"
    Find MatchCase "&deg;"
    Replace All "°"
    Find MatchCase "&micro;"
    Replace All "µ"
    Find MatchCase "&para;"
    Replace All "¶"
    Find MatchCase "&middot;"
    Replace All "·"
    Find MatchCase "&dagger;"
    Replace All "†"
    Find MatchCase "&Dagger;"
    Replace All "‡"
    Find MatchCase "&permil;"
    Replace All "‰"
    Find MatchCase "&euro;"
    Replace All "€"
    Find MatchCase "&frac14;"
    Replace All "¼"
    Find MatchCase "&frac12;"
    Replace All "½"
    Find MatchCase "&frac34;"
    Replace All "¾"
    Find MatchCase "&sup1;"
    Replace All "¹"
    Find MatchCase "&sup2;"
    Replace All "²"
    Find MatchCase "&sup3;"
    Replace All "³"
    Find MatchCase "&aacute;"
    Replace All "á"
    Find MatchCase "&Aacute;"
    Replace All "Á"
    Find MatchCase "&acirc;"
    Replace All "â"
    Find MatchCase "&Acirc;"
    Replace All "Â"
    Find MatchCase "&agrave;"
    Replace All "à"
    Find MatchCase "&Agrave;"
    Replace All "À"
    Find MatchCase "&aring;"
    Replace All "å"
    Find MatchCase "&Aring;"
    Replace All "Å"
    Find MatchCase "&atilde;"
    Replace All "ã"
    Find MatchCase "&Atilde;"
    Replace All "Ã"
    Find MatchCase "&auml;"
    Replace All "ä"
    Find MatchCase "&Auml;"
    Replace All "Ä"
    Find MatchCase "&ordf;"
    Replace All "ª"
    Find MatchCase "&aelig;"
    Replace All "æ"
    Find MatchCase "&AElig;"
    Replace All "Æ"
    Find MatchCase "&ccedil;"
    Replace All "ç"
    Find MatchCase "&Ccedil;"
    Replace All "Ç"
    Find MatchCase "&eth;"
    Replace All "ð"
    Find MatchCase "&ETH;"
    Replace All "Ð"
    Find MatchCase "&eacute;"
    Replace All "é"
    Find MatchCase "&Eacute;"
    Replace All "É"
    Find MatchCase "&ecirc;"
    Replace All "ê"
    Find MatchCase "&Ecirc;"
    Replace All "Ê"
    Find MatchCase "&egrave;"
    Replace All "è"
    Find MatchCase "&Egrave;"
    Replace All "È"
    Find MatchCase "&euml;"
    Replace All "ë"
    Find MatchCase "&Euml;"
    Replace All "Ë"
    Find MatchCase "&fnof;"
    Replace All "ƒ"
    Find MatchCase "&iacute;"
    Replace All "í"
    Find MatchCase "&Iacute;"
    Replace All "Í"
    Find MatchCase "&icirc;"
    Replace All "î"
    Find MatchCase "&Icirc;"
    Replace All "Î"
    Find MatchCase "&igrave;"
    Replace All "ì"
    Find MatchCase "&Igrave;"
    Replace All "Ì"
    Find MatchCase "&iuml;"
    Replace All "ï"
    Find MatchCase "&Iuml;"
    Replace All "Ï"
    Find MatchCase "&ntilde;"
    Replace All "ñ"
    Find MatchCase "&Ntilde;"
    Replace All "Ñ"
    Find MatchCase "&oacute;"
    Replace All "ó"
    Find MatchCase "&Oacute;"
    Replace All "Ó"
    Find MatchCase "&ocirc;"
    Replace All "ô"
    Find MatchCase "&Ocirc;"
    Replace All "Ô"
    Find MatchCase "&ograve;"
    Replace All "ò"
    Find MatchCase "&Ograve;"
    Replace All "Ò"
    Find MatchCase "&ordm;"
    Replace All "º"
    Find MatchCase "&oslash;"
    Replace All "ø"
    Find MatchCase "&Oslash;"
    Replace All "Ø"
    Find MatchCase "&otilde;"
    Replace All "õ"
    Find MatchCase "&Otilde;"
    Replace All "Õ"
    Find MatchCase "&ouml;"
    Replace All "ö"
    Find MatchCase "&Ouml;"
    Replace All "Ö"
    Find MatchCase "&oelig;"
    Replace All "œ"
    Find MatchCase "&OElig;"
    Replace All "Œ"
    Find MatchCase "&scaron;"
    Replace All "š"
    Find MatchCase "&Scaron;"
    Replace All "Š"
    Find MatchCase "&szlig;"
    Replace All "ß"
    Find MatchCase "&thorn;"
    Replace All "þ"
    Find MatchCase "&THORN;"
    Replace All "Þ"
    Find MatchCase "&uacute;"
    Replace All "ú"
    Find MatchCase "&Uacute;"
    Replace All "Ú"
    Find MatchCase "&ucirc;"
    Replace All "û"
    Find MatchCase "&Ucirc;"
    Replace All "Û"
    Find MatchCase "&ugrave;"
    Replace All "ù"
    Find MatchCase "&Ugrave;"
    Replace All "Ù"
    Find MatchCase "&uuml;"
    Replace All "ü"
    Find MatchCase "&Uuml;"
    Replace All "Ü"
    Find MatchCase "&yacute;"
    Replace All "ý"
    Find MatchCase "&Yacute;"
    Replace All "Ý"
    Find MatchCase "&yuml;"
    Replace All "ÿ"
    Find MatchCase "&Yuml;"
    Replace All "Ÿ"
    But now, the challenge is the opposite: to strip out the plain text of the page and keep all tags and JavaScript code.
    It's because I need to send to other persons a page saved from a Web WhatsApp conversation, but removing personal data and chat.
    I plan to replace that with a fixed warning, like "Edited and removed personal data" where it was the page text.

    Because almost all HTML tags has "<" and ">" to begin and end a tag, I thought a regular expression like that:
    Search: ">.*?<"
    Replace: ">Edited and removed personal data<"

    But it's not working.
    Problem: It selects all occurrences of ">" and "<", even if there is no text between them.
    And replaces where there is no need to do that.

    If I search for ">.+?<", regular expression catches other tag inside, like this.
    From this code
    it selects

    I'm newbie with regular expressions and I suspect that the solution could be far complex than that.
    So, I ask some help to strip out my chats from the tags.
    Maybe the solution is better achieved by a macro. Or by scripting.
    What do you think?

    Here, a small piece of code to test suggestions: RegEx test and debugger



      Aug 21, 2020

      Hi Gabarito,

      you were very close :)

      The first character after ">" can be anything but "<" or a newline.
      Then it matches everything until the nearest "<".

      It supposes that all trailing spaces are trimmed to elimitate the "empty"  matches like this:

      But it might not be suffficient if "<" can appear inside the text.

      BR, Fleggy

      I am not sure if you really want to replace everything between any tags. This regex
      skips "empty" matches (spaces, tab, newlines and "-") but it replaces even the content of the tag <style> in your example.
      Shouldn't be the replace considered only for a list of particular tags? E.g. <span>, <title>, <div> and some others?

        Aug 21, 2020

        Thank you very much, Fleggy!

        Your solution works! 🙂🙂🙂.

        Later, I'm submit it to a real test code and see its performance and if it needs some adjusts.

        You are very advanced in RegExps. Congrats!
        I wouldn't ever had expected a solution that fast.

        Thanks again, man!


          Aug 21, 2020

          Saw your EDIT post now.
          I'll apply both RegExps later and see results.
          I'll be back to tell which one had worked better.

            Aug 21, 2020

            Because I need to keep style tags, your first solution is the best for my problem.
            This: ">\K[^<\r\n][^<]*+"

            Applying proposed solution on a real file, I found some other situations that could get better.
            There are a lot of instances of date-time.

            Like this:
            <div class="m61XR">29/07/2020</div>

            Or this:
            <div class="m61XR">07:46</div>

            These data can be keeped. No need to remove.
            Is it possible to write an expression to skip such format?


            Now, this is much more important.
            Sometimes, I get text inside double quotes. Like this:

            <div class="_3tBW6"><span class="_2iq-U" title="Text and more text ... (more text) ...
            that can cross ...
            more than ...
            two lines"><div class="zFnXi">

            I think that I can solve this applying two diferent RegExps:
            first to remove text inside ">" and "<➡️ solution is almost ready 
            second to remove text inside double quotes from ➡️  'title=" text ...">'

            What do you think?


              Aug 22, 2020

              Well, I think that the correct regex should be context aware. But I know very little about HTML...
              It is possible to use a negative lookahead to skip date/time (date can be in any common format):

              For the attribute title you can use:

              BR, Fleggy

                Aug 22, 2020

                fleggy wrote:
                Aug 22, 2020
                It is possible to use a negative lookahead to skip date/time (date can be in any common format):
                Your solution stated above has 3 little mistakes, IMHO.
                I'm not expert in RegExp and excuse me if I'm wrong, but I think the right expression is:

                Fixes are in red:

                I spent too much time to realize that.
                And I'm still don't fully understanding the negative neither lookahead thingie.

                fleggy wrote:
                Aug 22, 2020
                For the attribute title you can use:
                This solution worked very well.



                  Aug 23, 2020

                  Oh, sorry. I overlooked that and did tests just for common delimiters.
                  Negative lookahead (?!test pattern) is a test if the following text does not match a test pattern (time or date in your case). Lookahead (or lookarounds, generally speaking) does not change the current position in the text. Thus, if there is no time/date then the regex \K[^<\r\n][^<]*+ can continue from the current position. Otherwise (time/date found) the lookahead fails so the whole regex fails and the regex engine begins a new attemp to match at the next position.

                  BR, Fleggy