Stripping HTML tags

Stripping HTML tags

2
NewbieNewbie
2

    Mar 02, 2018#1

    Hello to everyone,

    I studied the help files, I searched the forum but I cannot find a menu command to strip all HTML tags from a file in one fell swoop.
    IDM's website has a bullet point under features for journalists/writers:
    • Delete HTML and extra white space characters
    If there is no such direct command (similar to trim trailing spaces) available in UE 24.20, can anyone here point me to a regular expression that I can use in UE to strip all HTML tags via Find & Replace?

    6,685587
    Grand MasterGrand Master
    6,685587

      Mar 03, 2018#2

      I asked myself that also after reading your post as I also don't know of any command to strip or remove all HTML tags from a file. In general it is best to open an HTML file in browser, press Ctrl+A and Ctrl+C and paste the displayed and copied text into a file.

      However, I asked IDM support by email about this point on their page. And here is the reply:
      IDM support wrote:We had to investigate as well to be plainly honest.

      This webpage is an older page that has not been updated in some time, and we actually plan to remove this line now that we have looked at this. But we believe this is actually referring to a user submitted macro found on our macros page:

      HTML Strip Macros by Gabe Anguiano

      But, this was submitted in 1999. As you recently reported there is an issue with older macros not being converted correctly currently. So at the moment it is likely this macro will not work correctly. The good news is, we have this internally corrected and that will be apart of UltraEdit v25 (and the UEStudio counterpart).
      There is one more good news for you. I have written long time ago for myself a macro to remove HTML tags. I have never published this macro as it was not designed for general usage. I quickly updated this macro and enhanced it. It is still not perfect for general usage, but it makes a quite good job on well-formatted HTML files.

      Here is the code of my macro to strip HTML tags:

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      PerlReOn
      Top
      Find MatchCase RegExp "\r\n"
      IfFound
      Top
      Find RegExp "<br[ /]*>(?![\r\n])"
      Replace All "\r\n"
      Else
      Find MatchCase RegExp "\n"
      IfFound
      Top
      Find RegExp "<br[ /]*>(?![\r\n])"
      Replace All "\n"
      Else
      Find MatchCase RegExp "\r"
      IfFound
      Top
      Find RegExp "<br[ /]*>(?![\r\n])"
      Replace All "\r"
      Else
      Find RegExp "<br[ /]*>(?![\r\n])"
      Replace All "\r\n"
      EndIf
      EndIf
      EndIf
      Top
      Find MatchCase RegExp "<[^>]+>"
      Replace All ""
      TrimLeadingSpaces
      TrimTrailingSpaces
      Top
      Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
      Replace All ""
      Top
      Find MatchCase RegExp "\A\v+"
      Replace ""
      Find MatchCase RegExp "\v+\z"
      Replace ""
      Bottom
      InsertLine
      Top
      Find MatchCase "&nbsp;"
      Replace All " "
      Find MatchCase "&thinsp;"
      Replace All " "
      Find MatchCase "&emsp;"
      Replace All " "
      Find MatchCase "&ensp;"
      Replace All " "
      Find MatchCase "&zwj;"
      Replace All ""
      Find MatchCase "&zwnj;"
      Replace All ""
      Find MatchCase "&lt;"
      Replace All "<"
      Find MatchCase "&gt;"
      Replace All "<"
      Find MatchCase "&amp;"
      Replace All "&"
      Find MatchCase "&quot;"
      Replace All """
      Find MatchCase "&mdash;"
      Replace All "—"
      Find MatchCase "&ndash;"
      Replace All "–"
      Find MatchCase "&shy;"
      Replace All "-"
      Find MatchCase "&circ;"
      Replace All "ˆ"
      Find MatchCase "&iexcl;"
      Replace All "¡"
      Find MatchCase "&brvbar;"
      Replace All "¦"
      Find MatchCase "&uml;"
      Replace All "¨"
      Find MatchCase "&macr;"
      Replace All "¯"
      Find MatchCase "&acute;"
      Replace All "´"
      Find MatchCase "&cedil;"
      Replace All "¸"
      Find MatchCase "&iquest;"
      Replace All "¿"
      Find MatchCase "&tilde;"
      Replace All "˜"
      Find MatchCase "&lsquo;"
      Replace All "‘"
      Find MatchCase "&rsquo;"
      Replace All "’"
      Find MatchCase "&sbquo;"
      Replace All "‚"
      Find MatchCase "&ldquo;"
      Replace All "“"
      Find MatchCase "&rdquo;"
      Replace All "”"
      Find MatchCase "&bdquo;"
      Replace All "„"
      Find MatchCase "&lsaquo;"
      Replace All "‹"
      Find MatchCase "&rsaquo;"
      Replace All "›"
      Find MatchCase "&lt;"
      Replace All "<"
      Find MatchCase "&gt;"
      Replace All ">"
      Find MatchCase "&plusmn;"
      Replace All "±"
      Find MatchCase "&laquo;"
      Replace All "«"
      Find MatchCase "&raquo;"
      Replace All "»"
      Find MatchCase "&times;"
      Replace All "×"
      Find MatchCase "&divide;"
      Replace All "÷"
      Find MatchCase "&cent;"
      Replace All "¢"
      Find MatchCase "&pound;"
      Replace All "£"
      Find MatchCase "&curren;"
      Replace All "¤"
      Find MatchCase "&yen;"
      Replace All "¥"
      Find MatchCase "&sect;"
      Replace All "§"
      Find MatchCase "&copy;"
      Replace All "©"
      Find MatchCase "&not;"
      Replace All "¬"
      Find MatchCase "&reg;"
      Replace All "®"
      Find MatchCase "&deg;"
      Replace All "°"
      Find MatchCase "&micro;"
      Replace All "µ"
      Find MatchCase "&para;"
      Replace All "¶"
      Find MatchCase "&middot;"
      Replace All "·"
      Find MatchCase "&dagger;"
      Replace All "†"
      Find MatchCase "&Dagger;"
      Replace All "‡"
      Find MatchCase "&permil;"
      Replace All "‰"
      Find MatchCase "&euro;"
      Replace All "€"
      Find MatchCase "&frac14;"
      Replace All "¼"
      Find MatchCase "&frac12;"
      Replace All "½"
      Find MatchCase "&frac34;"
      Replace All "¾"
      Find MatchCase "&sup1;"
      Replace All "¹"
      Find MatchCase "&sup2;"
      Replace All "²"
      Find MatchCase "&sup3;"
      Replace All "³"
      Find MatchCase "&aacute;"
      Replace All "á"
      Find MatchCase "&Aacute;"
      Replace All "Á"
      Find MatchCase "&acirc;"
      Replace All "â"
      Find MatchCase "&Acirc;"
      Replace All "Â"
      Find MatchCase "&agrave;"
      Replace All "à"
      Find MatchCase "&Agrave;"
      Replace All "À"
      Find MatchCase "&aring;"
      Replace All "å"
      Find MatchCase "&Aring;"
      Replace All "Å"
      Find MatchCase "&atilde;"
      Replace All "ã"
      Find MatchCase "&Atilde;"
      Replace All "Ã"
      Find MatchCase "&auml;"
      Replace All "ä"
      Find MatchCase "&Auml;"
      Replace All "Ä"
      Find MatchCase "&ordf;"
      Replace All "ª"
      Find MatchCase "&aelig;"
      Replace All "æ"
      Find MatchCase "&AElig;"
      Replace All "Æ"
      Find MatchCase "&ccedil;"
      Replace All "ç"
      Find MatchCase "&Ccedil;"
      Replace All "Ç"
      Find MatchCase "&eth;"
      Replace All "ð"
      Find MatchCase "&ETH;"
      Replace All "Ð"
      Find MatchCase "&eacute;"
      Replace All "é"
      Find MatchCase "&Eacute;"
      Replace All "É"
      Find MatchCase "&ecirc;"
      Replace All "ê"
      Find MatchCase "&Ecirc;"
      Replace All "Ê"
      Find MatchCase "&egrave;"
      Replace All "è"
      Find MatchCase "&Egrave;"
      Replace All "È"
      Find MatchCase "&euml;"
      Replace All "ë"
      Find MatchCase "&Euml;"
      Replace All "Ë"
      Find MatchCase "&fnof;"
      Replace All "ƒ"
      Find MatchCase "&iacute;"
      Replace All "í"
      Find MatchCase "&Iacute;"
      Replace All "Í"
      Find MatchCase "&icirc;"
      Replace All "î"
      Find MatchCase "&Icirc;"
      Replace All "Î"
      Find MatchCase "&igrave;"
      Replace All "ì"
      Find MatchCase "&Igrave;"
      Replace All "Ì"
      Find MatchCase "&iuml;"
      Replace All "ï"
      Find MatchCase "&Iuml;"
      Replace All "Ï"
      Find MatchCase "&ntilde;"
      Replace All "ñ"
      Find MatchCase "&Ntilde;"
      Replace All "Ñ"
      Find MatchCase "&oacute;"
      Replace All "ó"
      Find MatchCase "&Oacute;"
      Replace All "Ó"
      Find MatchCase "&ocirc;"
      Replace All "ô"
      Find MatchCase "&Ocirc;"
      Replace All "Ô"
      Find MatchCase "&ograve;"
      Replace All "ò"
      Find MatchCase "&Ograve;"
      Replace All "Ò"
      Find MatchCase "&ordm;"
      Replace All "º"
      Find MatchCase "&oslash;"
      Replace All "ø"
      Find MatchCase "&Oslash;"
      Replace All "Ø"
      Find MatchCase "&otilde;"
      Replace All "õ"
      Find MatchCase "&Otilde;"
      Replace All "Õ"
      Find MatchCase "&ouml;"
      Replace All "ö"
      Find MatchCase "&Ouml;"
      Replace All "Ö"
      Find MatchCase "&oelig;"
      Replace All "œ"
      Find MatchCase "&OElig;"
      Replace All "Œ"
      Find MatchCase "&scaron;"
      Replace All "š"
      Find MatchCase "&Scaron;"
      Replace All "Š"
      Find MatchCase "&szlig;"
      Replace All "ß"
      Find MatchCase "&thorn;"
      Replace All "þ"
      Find MatchCase "&THORN;"
      Replace All "Þ"
      Find MatchCase "&uacute;"
      Replace All "ú"
      Find MatchCase "&Uacute;"
      Replace All "Ú"
      Find MatchCase "&ucirc;"
      Replace All "û"
      Find MatchCase "&Ucirc;"
      Replace All "Û"
      Find MatchCase "&ugrave;"
      Replace All "ù"
      Find MatchCase "&Ugrave;"
      Replace All "Ù"
      Find MatchCase "&uuml;"
      Replace All "ü"
      Find MatchCase "&Uuml;"
      Replace All "Ü"
      Find MatchCase "&yacute;"
      Replace All "ý"
      Find MatchCase "&Yacute;"
      Replace All "Ý"
      Find MatchCase "&yuml;"
      Replace All "ÿ"
      Find MatchCase "&Yuml;"
      Replace All "Ÿ"
      
      Note 1: The space character in replace string for replacing all occurrences of &nbsp; is a no-break space with decimal code value 160 (hexadecimal A0) in Windows-1252 and Unicode. All other spaces in replaces strings are normal spaces. Your browser most likely copies the no-break space as normal space.

      Note 2: The macro command line Replace All """ must be Replace All "\"" for versions of UltraEdit for Windows > v24.20 and versions of UEStudio > v18.00.

      This macro was written by me to run on HTML files using Windows-1252 code page and therefore does not contain HTML entity replaces for full Unicode range. It also does not contain replaces or conversions for characters being URL encoded, decimal or hexadecimal HTML encoded, i.e %C3%A7 or &#231; or &#xE7; for character ç.

      It would be possible to convert this UltraEdit macro to an UltraEdit script which would make first part inserting a line ending after each <br> or <br /> without a line ending and the Perl regular expressions to delete multiple blank lines easier. A scripting solution could also convert other special encoded characters and real characters depending on encoding of the file. But for myself this macro enhanced today for files with UNIX or MAC line endings was always enough.

      Please note that this macro has no error detection as web browsers have. So any < or > in text in an HTML/XHTML file not correct encoded as &lt; and &gt; would result in wrong stripping the HTML tags.
      Best regards from an UC/UE/UES for Windows user from Austria

      2
      NewbieNewbie
      2

        Mar 03, 2018#3

        Mofi wrote:I asked myself that also after reading your post as I also don't know of any command to strip or remove all HTML tags from a file. In general it is best to open an HTML file in browser, press Ctrl+A and Ctrl+C and paste the displayed and copied text into a file.
        Thank you so much! I would call that a solution of sublime simplicity. It will work perfectly for me as I use this only for HTML files that I wrote myself.
        Like you indicated, I may lose my non-breaking spaces in the process, but I can easily live with that.

        And thank you for the other good news. 🙂
        However, for now, all I require is what you suggested above – open the HTML file in a browser, copy the text and paste it back into a fresh plain text file.

        And then of course you provided your own macro for stripping HTML. I copied that for now and hope to use it once I get more familiar with macros in general. I am also grateful that you took the time and care to explain in detail what I have to be aware of when running the macro.


        Here is a small introduction of myself as opening this subject/thread is my first post on the forum:

        After using an awful lot of editors I always find myself returning to UE. I started off with version 12, then 18, and I just ordered the upgrade to 24.20.
        (Running the trial version right now while I am waiting for the activation code from my online software merchant here in Germany.)
        My requirements are somewhat basic – mainly general writing and translation tasks, cleaning up quite large plain text files. And coding in HTML and CSS to be used for ePUB creation. I must say, though, that I learn to appreciate UE more and more each day.

        Atma