I asked myself that also after reading your post as I also don't know of any command to strip or remove all HTML tags from a file. In general it is best to open an HTML file in browser, press Ctrl+A and Ctrl+C and paste the displayed and copied text into a file.
However, I asked IDM support by email about this point on their page. And here is the reply:
IDM support wrote:We had to investigate as well to be plainly honest.
This webpage is an older page that has not been updated in some time, and we actually plan to remove this line now that we have looked at this. But we believe this is actually referring to a user submitted macro found on our
macros page:
HTML Strip Macros by Gabe Anguiano
But, this was submitted in 1999. As
you recently reported there is an issue with older macros not being converted correctly currently. So at the moment it is likely this macro will not work correctly. The good news is, we have this internally corrected and that will be apart of UltraEdit v25 (and the UEStudio counterpart).
There is one more good news for you. I have written long time ago for myself a macro to remove HTML tags. I have never published this macro as it was not designed for general usage. I quickly updated this macro and enhanced it. It is still not perfect for general usage, but it makes a quite good job on well-formatted HTML files.
Here is the code of my macro to strip HTML tags:
Code: Select all
InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Find MatchCase RegExp "\r\n"
IfFound
Top
Find RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
Else
Find MatchCase RegExp "\n"
IfFound
Top
Find RegExp "<br[ /]*>(?![\r\n])"
Replace All "\n"
Else
Find MatchCase RegExp "\r"
IfFound
Top
Find RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r"
Else
Find RegExp "<br[ /]*>(?![\r\n])"
Replace All "\r\n"
EndIf
EndIf
EndIf
Top
Find MatchCase RegExp "<[^>]+>"
Replace All ""
TrimLeadingSpaces
TrimTrailingSpaces
Top
Find MatchCase RegExp "(?:(?:\r\n){2}|\n{2}|\r{2})\K(?:(?:\r\n)+|\n+|\r+)"
Replace All ""
Top
Find MatchCase RegExp "\A\v+"
Replace ""
Find MatchCase RegExp "\v+\z"
Replace ""
Bottom
InsertLine
Top
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase " "
Replace All " "
Find MatchCase "‍"
Replace All ""
Find MatchCase "‌"
Replace All ""
Find MatchCase "<"
Replace All "<"
Find MatchCase ">"
Replace All "<"
Find MatchCase "&"
Replace All "&"
Find MatchCase """
Replace All """
Find MatchCase "—"
Replace All "—"
Find MatchCase "–"
Replace All "–"
Find MatchCase "­"
Replace All "-"
Find MatchCase "ˆ"
Replace All "ˆ"
Find MatchCase "¡"
Replace All "¡"
Find MatchCase "¦"
Replace All "¦"
Find MatchCase "¨"
Replace All "¨"
Find MatchCase "¯"
Replace All "¯"
Find MatchCase "´"
Replace All "´"
Find MatchCase "¸"
Replace All "¸"
Find MatchCase "¿"
Replace All "¿"
Find MatchCase "˜"
Replace All "˜"
Find MatchCase "‘"
Replace All "‘"
Find MatchCase "’"
Replace All "’"
Find MatchCase "‚"
Replace All "‚"
Find MatchCase "“"
Replace All "“"
Find MatchCase "”"
Replace All "”"
Find MatchCase "„"
Replace All "„"
Find MatchCase "‹"
Replace All "‹"
Find MatchCase "›"
Replace All "›"
Find MatchCase "<"
Replace All "<"
Find MatchCase ">"
Replace All ">"
Find MatchCase "±"
Replace All "±"
Find MatchCase "«"
Replace All "«"
Find MatchCase "»"
Replace All "»"
Find MatchCase "×"
Replace All "×"
Find MatchCase "÷"
Replace All "÷"
Find MatchCase "¢"
Replace All "¢"
Find MatchCase "£"
Replace All "£"
Find MatchCase "¤"
Replace All "¤"
Find MatchCase "¥"
Replace All "¥"
Find MatchCase "§"
Replace All "§"
Find MatchCase "©"
Replace All "©"
Find MatchCase "¬"
Replace All "¬"
Find MatchCase "®"
Replace All "®"
Find MatchCase "°"
Replace All "°"
Find MatchCase "µ"
Replace All "µ"
Find MatchCase "¶"
Replace All "¶"
Find MatchCase "·"
Replace All "·"
Find MatchCase "†"
Replace All "†"
Find MatchCase "‡"
Replace All "‡"
Find MatchCase "‰"
Replace All "‰"
Find MatchCase "€"
Replace All "€"
Find MatchCase "¼"
Replace All "¼"
Find MatchCase "½"
Replace All "½"
Find MatchCase "¾"
Replace All "¾"
Find MatchCase "¹"
Replace All "¹"
Find MatchCase "²"
Replace All "²"
Find MatchCase "³"
Replace All "³"
Find MatchCase "á"
Replace All "á"
Find MatchCase "Á"
Replace All "Á"
Find MatchCase "â"
Replace All "â"
Find MatchCase "Â"
Replace All "Â"
Find MatchCase "à"
Replace All "à"
Find MatchCase "À"
Replace All "À"
Find MatchCase "å"
Replace All "å"
Find MatchCase "Å"
Replace All "Å"
Find MatchCase "ã"
Replace All "ã"
Find MatchCase "Ã"
Replace All "Ã"
Find MatchCase "ä"
Replace All "ä"
Find MatchCase "Ä"
Replace All "Ä"
Find MatchCase "ª"
Replace All "ª"
Find MatchCase "æ"
Replace All "æ"
Find MatchCase "Æ"
Replace All "Æ"
Find MatchCase "ç"
Replace All "ç"
Find MatchCase "Ç"
Replace All "Ç"
Find MatchCase "ð"
Replace All "ð"
Find MatchCase "Ð"
Replace All "Ð"
Find MatchCase "é"
Replace All "é"
Find MatchCase "É"
Replace All "É"
Find MatchCase "ê"
Replace All "ê"
Find MatchCase "Ê"
Replace All "Ê"
Find MatchCase "è"
Replace All "è"
Find MatchCase "È"
Replace All "È"
Find MatchCase "ë"
Replace All "ë"
Find MatchCase "Ë"
Replace All "Ë"
Find MatchCase "ƒ"
Replace All "ƒ"
Find MatchCase "í"
Replace All "í"
Find MatchCase "Í"
Replace All "Í"
Find MatchCase "î"
Replace All "î"
Find MatchCase "Î"
Replace All "Î"
Find MatchCase "ì"
Replace All "ì"
Find MatchCase "Ì"
Replace All "Ì"
Find MatchCase "ï"
Replace All "ï"
Find MatchCase "Ï"
Replace All "Ï"
Find MatchCase "ñ"
Replace All "ñ"
Find MatchCase "Ñ"
Replace All "Ñ"
Find MatchCase "ó"
Replace All "ó"
Find MatchCase "Ó"
Replace All "Ó"
Find MatchCase "ô"
Replace All "ô"
Find MatchCase "Ô"
Replace All "Ô"
Find MatchCase "ò"
Replace All "ò"
Find MatchCase "Ò"
Replace All "Ò"
Find MatchCase "º"
Replace All "º"
Find MatchCase "ø"
Replace All "ø"
Find MatchCase "Ø"
Replace All "Ø"
Find MatchCase "õ"
Replace All "õ"
Find MatchCase "Õ"
Replace All "Õ"
Find MatchCase "ö"
Replace All "ö"
Find MatchCase "Ö"
Replace All "Ö"
Find MatchCase "œ"
Replace All "œ"
Find MatchCase "Œ"
Replace All "Œ"
Find MatchCase "š"
Replace All "š"
Find MatchCase "Š"
Replace All "Š"
Find MatchCase "ß"
Replace All "ß"
Find MatchCase "þ"
Replace All "þ"
Find MatchCase "Þ"
Replace All "Þ"
Find MatchCase "ú"
Replace All "ú"
Find MatchCase "Ú"
Replace All "Ú"
Find MatchCase "û"
Replace All "û"
Find MatchCase "Û"
Replace All "Û"
Find MatchCase "ù"
Replace All "ù"
Find MatchCase "Ù"
Replace All "Ù"
Find MatchCase "ü"
Replace All "ü"
Find MatchCase "Ü"
Replace All "Ü"
Find MatchCase "ý"
Replace All "ý"
Find MatchCase "Ý"
Replace All "Ý"
Find MatchCase "ÿ"
Replace All "ÿ"
Find MatchCase "Ÿ"
Replace All "Ÿ"
Note 1: The space character in replace string for replacing all occurrences of
is a no-break space with decimal code value 160 (hexadecimal A0) in Windows-1252 and Unicode. All other spaces in replaces strings are normal spaces. Your browser most likely copies the no-break space as normal space.
Note 2: The macro command line
Replace All """ must be
Replace All "\"" for versions of UltraEdit for Windows > v24.20 and versions of UEStudio > v18.00.
This macro was written by me to run on HTML files using Windows-1252 code page and therefore does not contain HTML entity replaces for full Unicode range. It also does not contain replaces or conversions for characters being
URL encoded,
decimal or
hexadecimal HTML encoded, i.e
%C3%A7 or
ç or
ç for character
ç.
It would be possible to convert this UltraEdit macro to an UltraEdit script which would make first part inserting a line ending after each
<br> or
<br /> without a line ending and the Perl regular expressions to delete multiple blank lines easier. A scripting solution could also convert other special encoded characters and real characters depending on encoding of the file. But for myself this macro enhanced today for files with UNIX or MAC line endings was always enough.
Please note that this macro has no error detection as web browsers have. So any
< or
> in text in an HTML/XHTML file not correct encoded as
< and
> would result in wrong stripping the HTML tags.