How to strip out text from HTML tags?

How to strip out text from HTML tags?

912
Advanced UserAdvanced User
912

    Aug 21, 2020#1

    I'm facing a task that is the exact opposite of what I already have.
    I mean, I have a macro copied from this forum, written by Mofi, that can strip HTML tags out from an HTML code.

    Here it is:

    Code: Select all

    InsertMode
    ColumnModeOff
    HexOff
    PerlReOn
    Top
    Find MatchCase RegExp "\r\n"
    IfFound
    Top
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r\n"
    Else
    Find MatchCase RegExp "\n"
    IfFound
    Top
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\n"
    Else
    Find MatchCase RegExp "\r"
    IfFound
    Top
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r"
    Else
    Find MatchCase RegExp "<br[ /]*>(?![\r\n])"
    Replace All "\r\n"
    EndIf
    EndIf
    EndIf
    Top
    Find MatchCase RegExp "<[^>]+>"
    Replace All ""
    TrimLeadingSpaces
    TrimTrailingSpaces
    Top
    Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
    Replace All ""
    Top
    Find MatchCase RegExp "\A\v+"
    Replace ""
    Find MatchCase RegExp "\v+\z"
    Replace ""
    Bottom
    InsertLine
    Top
    Find MatchCase "&nbsp;"
    Replace All " "
    Find MatchCase "&thinsp;"
    Replace All " "
    Find MatchCase "&emsp;"
    Replace All " "
    Find MatchCase "&ensp;"
    Replace All " "
    Find MatchCase "&zwj;"
    Replace All ""
    Find MatchCase "&zwnj;"
    Replace All ""
    Find MatchCase "&lt;"
    Replace All "<"
    Find MatchCase "&gt;"
    Replace All "<"
    Find MatchCase "&amp;"
    Replace All "&"
    Find MatchCase "&quot;"
    Replace All "\""
    Find MatchCase "&mdash;"
    Replace All "—"
    Find MatchCase "&ndash;"
    Replace All "–"
    Find MatchCase "&shy;"
    Replace All "-"
    Find MatchCase "&circ;"
    Replace All "ˆ"
    Find MatchCase "&iexcl;"
    Replace All "¡"
    Find MatchCase "&brvbar;"
    Replace All "¦"
    Find MatchCase "&uml;"
    Replace All "¨"
    Find MatchCase "&macr;"
    Replace All "¯"
    Find MatchCase "&acute;"
    Replace All "´"
    Find MatchCase "&cedil;"
    Replace All "¸"
    Find MatchCase "&iquest;"
    Replace All "¿"
    Find MatchCase "&tilde;"
    Replace All "˜"
    Find MatchCase "&lsquo;"
    Replace All "‘"
    Find MatchCase "&rsquo;"
    Replace All "’"
    Find MatchCase "&sbquo;"
    Replace All "‚"
    Find MatchCase "&ldquo;"
    Replace All "“"
    Find MatchCase "&rdquo;"
    Replace All "”"
    Find MatchCase "&bdquo;"
    Replace All "„"
    Find MatchCase "&lsaquo;"
    Replace All "‹"
    Find MatchCase "&rsaquo;"
    Replace All "›"
    Find MatchCase "&lt;"
    Replace All "<"
    Find MatchCase "&gt;"
    Replace All ">"
    Find MatchCase "&plusmn;"
    Replace All "±"
    Find MatchCase "&laquo;"
    Replace All "«"
    Find MatchCase "&raquo;"
    Replace All "»"
    Find MatchCase "&times;"
    Replace All "×"
    Find MatchCase "&divide;"
    Replace All "÷"
    Find MatchCase "&cent;"
    Replace All "¢"
    Find MatchCase "&pound;"
    Replace All "£"
    Find MatchCase "&curren;"
    Replace All "¤"
    Find MatchCase "&yen;"
    Replace All "¥"
    Find MatchCase "&sect;"
    Replace All "§"
    Find MatchCase "&copy;"
    Replace All "©"
    Find MatchCase "&not;"
    Replace All "¬"
    Find MatchCase "&reg;"
    Replace All "®"
    Find MatchCase "&deg;"
    Replace All "°"
    Find MatchCase "&micro;"
    Replace All "µ"
    Find MatchCase "&para;"
    Replace All "¶"
    Find MatchCase "&middot;"
    Replace All "·"
    Find MatchCase "&dagger;"
    Replace All "†"
    Find MatchCase "&Dagger;"
    Replace All "‡"
    Find MatchCase "&permil;"
    Replace All "‰"
    Find MatchCase "&euro;"
    Replace All "€"
    Find MatchCase "&frac14;"
    Replace All "¼"
    Find MatchCase "&frac12;"
    Replace All "½"
    Find MatchCase "&frac34;"
    Replace All "¾"
    Find MatchCase "&sup1;"
    Replace All "¹"
    Find MatchCase "&sup2;"
    Replace All "²"
    Find MatchCase "&sup3;"
    Replace All "³"
    Find MatchCase "&aacute;"
    Replace All "á"
    Find MatchCase "&Aacute;"
    Replace All "Á"
    Find MatchCase "&acirc;"
    Replace All "â"
    Find MatchCase "&Acirc;"
    Replace All "Â"
    Find MatchCase "&agrave;"
    Replace All "à"
    Find MatchCase "&Agrave;"
    Replace All "À"
    Find MatchCase "&aring;"
    Replace All "å"
    Find MatchCase "&Aring;"
    Replace All "Å"
    Find MatchCase "&atilde;"
    Replace All "ã"
    Find MatchCase "&Atilde;"
    Replace All "Ã"
    Find MatchCase "&auml;"
    Replace All "ä"
    Find MatchCase "&Auml;"
    Replace All "Ä"
    Find MatchCase "&ordf;"
    Replace All "ª"
    Find MatchCase "&aelig;"
    Replace All "æ"
    Find MatchCase "&AElig;"
    Replace All "Æ"
    Find MatchCase "&ccedil;"
    Replace All "ç"
    Find MatchCase "&Ccedil;"
    Replace All "Ç"
    Find MatchCase "&eth;"
    Replace All "ð"
    Find MatchCase "&ETH;"
    Replace All "Ð"
    Find MatchCase "&eacute;"
    Replace All "é"
    Find MatchCase "&Eacute;"
    Replace All "É"
    Find MatchCase "&ecirc;"
    Replace All "ê"
    Find MatchCase "&Ecirc;"
    Replace All "Ê"
    Find MatchCase "&egrave;"
    Replace All "è"
    Find MatchCase "&Egrave;"
    Replace All "È"
    Find MatchCase "&euml;"
    Replace All "ë"
    Find MatchCase "&Euml;"
    Replace All "Ë"
    Find MatchCase "&fnof;"
    Replace All "ƒ"
    Find MatchCase "&iacute;"
    Replace All "í"
    Find MatchCase "&Iacute;"
    Replace All "Í"
    Find MatchCase "&icirc;"
    Replace All "î"
    Find MatchCase "&Icirc;"
    Replace All "Î"
    Find MatchCase "&igrave;"
    Replace All "ì"
    Find MatchCase "&Igrave;"
    Replace All "Ì"
    Find MatchCase "&iuml;"
    Replace All "ï"
    Find MatchCase "&Iuml;"
    Replace All "Ï"
    Find MatchCase "&ntilde;"
    Replace All "ñ"
    Find MatchCase "&Ntilde;"
    Replace All "Ñ"
    Find MatchCase "&oacute;"
    Replace All "ó"
    Find MatchCase "&Oacute;"
    Replace All "Ó"
    Find MatchCase "&ocirc;"
    Replace All "ô"
    Find MatchCase "&Ocirc;"
    Replace All "Ô"
    Find MatchCase "&ograve;"
    Replace All "ò"
    Find MatchCase "&Ograve;"
    Replace All "Ò"
    Find MatchCase "&ordm;"
    Replace All "º"
    Find MatchCase "&oslash;"
    Replace All "ø"
    Find MatchCase "&Oslash;"
    Replace All "Ø"
    Find MatchCase "&otilde;"
    Replace All "õ"
    Find MatchCase "&Otilde;"
    Replace All "Õ"
    Find MatchCase "&ouml;"
    Replace All "ö"
    Find MatchCase "&Ouml;"
    Replace All "Ö"
    Find MatchCase "&oelig;"
    Replace All "œ"
    Find MatchCase "&OElig;"
    Replace All "Œ"
    Find MatchCase "&scaron;"
    Replace All "š"
    Find MatchCase "&Scaron;"
    Replace All "Š"
    Find MatchCase "&szlig;"
    Replace All "ß"
    Find MatchCase "&thorn;"
    Replace All "þ"
    Find MatchCase "&THORN;"
    Replace All "Þ"
    Find MatchCase "&uacute;"
    Replace All "ú"
    Find MatchCase "&Uacute;"
    Replace All "Ú"
    Find MatchCase "&ucirc;"
    Replace All "û"
    Find MatchCase "&Ucirc;"
    Replace All "Û"
    Find MatchCase "&ugrave;"
    Replace All "ù"
    Find MatchCase "&Ugrave;"
    Replace All "Ù"
    Find MatchCase "&uuml;"
    Replace All "ü"
    Find MatchCase "&Uuml;"
    Replace All "Ü"
    Find MatchCase "&yacute;"
    Replace All "ý"
    Find MatchCase "&Yacute;"
    Replace All "Ý"
    Find MatchCase "&yuml;"
    Replace All "ÿ"
    Find MatchCase "&Yuml;"
    Replace All "Ÿ"
    
    But now, the challenge is the opposite: to strip out the plain text of the page and keep all tags and JavaScript code.
    It's because I need to send to other persons a page saved from a Web WhatsApp conversation, but removing personal data and chat.
    I plan to replace that with a fixed warning, like "Edited and removed personal data" where it was the page text.

    Because almost all HTML tags has "<" and ">" to begin and end a tag, I thought a regular expression like that:
    Search: ">.*?<"
    Replace: ">Edited and removed personal data<"

    But it's not working.
    Problem: It selects all occurrences of ">" and "<", even if there is no text between them.
    And replaces where there is no need to do that.

    If I search for ">.+?<", regular expression catches other tag inside, like this.
    From this code
    <span></span><span></span>
    it selects
    ><span><

    I'm newbie with regular expressions and I suspect that the solution could be far complex than that.
    So, I ask some help to strip out my chats from the tags.
    Maybe the solution is better achieved by a macro. Or by scripting.
    What do you think?

    Here, a small piece of code to test suggestions: RegEx test and debugger

    Thanks.

    18672
    MasterMaster
    18672

      Aug 21, 2020#2

      Hi Gabarito,

      you were very close :)
      >\K[^<\r\n][^<]*+

      The first character after ">" can be anything but "<" or a newline.
      Then it matches everything until the nearest "<".

      It supposes that all trailing spaces are trimmed to elimitate the "empty"  matches like this:
      </end_tag>spaces...
      <next_tag>...

      But it might not be suffficient if "<" can appear inside the text.

      BR, Fleggy

      EDIT:
      I am not sure if you really want to replace everything between any tags. This regex
      >\K(?![\s\-]*+<)[^<]++
      skips "empty" matches (spaces, tab, newlines and "-") but it replaces even the content of the tag <style> in your example.
      Shouldn't be the replace considered only for a list of particular tags? E.g. <span>, <title>, <div> and some others?

      912
      Advanced UserAdvanced User
      912

        Aug 21, 2020#3

        Thank you very much, Fleggy!

        Your solution works! 🙂🙂🙂.

        Later, I'm submit it to a real test code and see its performance and if it needs some adjusts.

        You are very advanced in RegExps. Congrats!
        I wouldn't ever had expected a solution that fast.

        Thanks again, man!

        🤝👏👍👊🤝

          Aug 21, 2020#4

          Saw your EDIT post now.
          I'll apply both RegExps later and see results.
          I'll be back to tell which one had worked better.

            Aug 21, 2020#5

            Because I need to keep style tags, your first solution is the best for my problem.
            This: ">\K[^<\r\n][^<]*+"


            Applying proposed solution on a real file, I found some other situations that could get better.
            There are a lot of instances of date-time.

            Like this:
            <div class="m61XR">29/07/2020</div>

            Or this:
            <div class="m61XR">07:46</div>

            These data can be keeped. No need to remove.
            Is it possible to write an expression to skip such format?

            ------------------

            Now, this is much more important.
            Sometimes, I get text inside double quotes. Like this:

            <div class="_3tBW6"><span class="_2iq-U" title="Text and more text ... (more text) ...
            that can cross ...
            more than ...
            two lines"><div class="zFnXi">

            I think that I can solve this applying two diferent RegExps:
            first to remove text inside ">" and "<➡️ solution is almost ready 
            second to remove text inside double quotes from ➡️  'title=" text ...">'

            What do you think?

            18672
            MasterMaster
            18672

              Aug 22, 2020#6

              Well, I think that the correct regex should be context aware. But I know very little about HTML...
              It is possible to use a negative lookahead to skip date/time (date can be in any common format):
              >(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

              For the attribute title you can use:
              title="\K[^"]++

              BR, Fleggy

              912
              Advanced UserAdvanced User
              912

                Aug 22, 2020#7

                fleggy wrote:
                Aug 22, 2020
                It is possible to use a negative lookahead to skip date/time (date can be in any common format):
                >(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+
                Your solution stated above has 3 little mistakes, IMHO.
                I'm not expert in RegExp and excuse me if I'm wrong, but I think the right expression is:
                >(?!(?:\d\d:\d\d|\d\d([-./])\d\d\1\d\d/d\d|\d\d\d\d([-./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

                Fixes are in red:
                >(?!(?:\d\d:\d\d|\d\d([-\./])\d\d\1\d\d\/d\d|\d\d\d\d([-\./])\d\d\2\d\d)<)\K[^<\r\n][^<]*+

                I spent too much time to realize that.
                And I'm still don't fully understanding the negative neither lookahead thingie.



                fleggy wrote:
                Aug 22, 2020
                For the attribute title you can use:
                title="\K[^"]++
                This solution worked very well.

                Thanks.

                18672
                MasterMaster
                18672

                  Aug 23, 2020#8

                  Oh, sorry. I overlooked that and did tests just for common delimiters.
                  Negative lookahead (?!test pattern) is a test if the following text does not match a test pattern (time or date in your case). Lookahead (or lookarounds, generally speaking) does not change the current position in the text. Thus, if there is no time/date then the regex \K[^<\r\n][^<]*+ can continue from the current position. Otherwise (time/date found) the lookahead fails so the whole regex fails and the regex engine begins a new attemp to match at the next position.

                  BR, Fleggy