Script to convert special characters to HTML code

arminus · Dec 05, 2017#12017-12-05T13:26+00:00

I've searched around for a while but haven't (yet) found a solution for replacing, say all German Umlauts in a file with their respective HTML code (number or name). I suppose I could try to write a script, but before I attempt that, I wonder if there isn't some solution for that already...?

Mofi · Dec 05, 2017#22017-12-05T20:06+00:00

Let me first ask two questions:

What character encoding do you use for HTML files?

Do you use ASCII, ISO-8859-1, Windows-1252, UTF-8 or any other character encoding?

I use myself following macro for German text:

Code: Select all

IfExtIs "html"
Else
IfExtIs "htm"
Else
ExitMacro
EndIf
EndIf
InsertMode
ColumnModeOff
HexOff
UltraEditReOn
IfSel
Find MatchCase SelectText "ä"
Replace All "&auml;"
Find MatchCase SelectText "ö"
Replace All "&ouml;"
Find MatchCase SelectText "ü"
Replace All "&uuml;"
Find MatchCase SelectText "Ä"
Replace All "&Auml;"
Find MatchCase SelectText "Ö"
Replace All "&Ouml;"
Find MatchCase SelectText "Ü"
Replace All "&Uuml;"
Find MatchCase SelectText "ß"
Replace All "&szlig;"
Find MatchCase SelectText "„"
Replace All "&bdquo;"
Find MatchCase SelectText "“"
Replace All "&rdquo;"
Find MatchCase SelectText "€"
Replace All "&euro;"
Else
Find MatchCase "ä"
Replace All "&auml;"
Find MatchCase "ö"
Replace All "&ouml;"
Find MatchCase "ü"
Replace All "&uuml;"
Find MatchCase "Ä"
Replace All "&Auml;"
Find MatchCase "Ö"
Replace All "&Ouml;"
Find MatchCase "Ü"
Replace All "&Uuml;"
Find MatchCase "ß"
Replace All "&szlig;"
Find MatchCase "„"
Replace All "&bdquo;"
Find MatchCase "“"
Replace All "&rdquo;"
Find MatchCase "€"
Replace All "&euro;"
EndIf

And I use macros all stored in same macro file with following hotkeys:

ß ... ß
ä ... ä
ö ... ö
ü ... ü
Shift+ä ... Ä
Shift+ö ... Ö
Shift+ü ... Ü
Ctrl+Shift+Q ... "
Alt+Ctrl+E ... € ... €
Alt+Space ...  
Alt+RETURN ... <br>

The code of macro for ä can be seen on my post German umlaut problems after converting UTF-8 to ASCII. The others are written similar.

arminus · Dec 06, 2017#32017-12-06T07:00+00:00

That depends on the project. But IMHO, that's not relevant for my use case. Basically, I'd like to replace let's say 'ä' with 'ä', 'ö' with 'ö' and so forth. Or in broader terms, a script (template) with a search and replace of a set of characters/strings with another set might already help.

Mofi · Dec 07, 2017#42017-12-07T12:32+00:00

In real all what really matters on HTML, XHTML and XML writing is the used character encoding / code page for the file and the appropriate charset/encoding declaration.

The character set in memory of the browser is always Unicode independent on how the characters in interpreted file are encoded which is uploaded to a server and downloaded by the browsers. All browsers have to interpret the bytes in a file according to charset/encoding declaration in loaded file and convert the characters to Unicode on reading in the file to memory. The text displayed in browser window is also always Unicode encoded (UTF-16 or UTF-32, most likely always in Little Endian, depending on used programming language and string library of browser code).

The charset/encoding used for the file does nothing express about the text finally displayed by the browsers. It is possible to use character set ASCII for an HTML file and the text contains nevertheless characters in many languages including languages using a different set of characters than Latin alphabet.

For example an HTML file with a German text contains the characters € (Euro sign) and ß (Latin small letter sharp S). The HTML file could be encoded as:

ASCII:
Those two characters must be encoded in HTML file with € / ß or € / ß or € / ß.
These means the HTML file contains for those two characters the bytes (hexadecimal): 26 65 75 72 6F 3B / 26 73 7A 6C 69 67 3B or 26 23 38 33 36 34 3B / 26 23 32 32 33 3B or 26 23 78 32 30 41 43 3B / 26 23 78 44 46 3B.
Those HTML specific multi-byte encodings for the characters € and ß can be used of course with any encoding listed below too.
ISO-8859-1:
Character € is not available in ISO-8859-1 and must be encoded like in ASCII encoded HTML file on using ISO-8859-1 encoding for the HTML file.
Character ß is available with code point value 223 in this internationally standardized code page. So the HTML file can contain a single byte with hexadecimal value DF for this character.
Windows-1252:
Character € is available with code point value 128 in this code page defined by Microsoft. So the HTML file can contain a single byte with hexadecimal value 80 for this character.
Character ß is available with code point value 223 in code page Windows-1252. So the HTML file can contain a single byte with hexadecimal value DF for this character.
UTF-8:
Character € is available with code point value 8364 in Unicode. So the HTML file encoded with UTF-8 can contain the three bytes with the hexadecimal values E2 82 AC for this character.
Character ß is available with code point value 223 in Unicode. So the HTML file encoded with UTF-8 can contain the two bytes with the hexadecimal values C3 9F for this character.

So the smallest HTML files containing mainly German text can be produced using either Windows-1252 or ISO-8859-1. UTF-8 is usually used by many HTML writers not really knowing anything about character encoding. But for HTML files with mainly German text the UTF-8 encoding results in usually slightly larger files than using Windows-1252 or ISO-8859-1 depending on how many non ASCII characters the text contains with a code point value greater 127.

The conversion from encoding of file to Unicode in memory of the browsers should be faster on using a code page with fixed one byte per character as a quickly accessible table is used for this conversion as far as I know. UTF-8 to UTF-16/UTF-32 encoding requires a small piece of code executed on each byte in the file. But I have never tested with performance measurements which encoding can be converted fastest to Unicode by the browsers. I'm quite sure that the time difference is not noticeable for the user. There are other conditions which have a much bigger effect on how fast an HTML file can be processed by the browsers than the character encoding.

I created quickly a script to run Perl regular expressions to convert all non ASCII characters in an opened file to HTML entities. I used Named character references list of HTML5 stored in a JSON file and converted the JSON file to the attached script with some regular expression replaces and adding a few lines of scripting code.

The script should work for any version of UltraEdit and UEStudio supporting scripts.

Please note that the script does not convert ASCII characters to HTML entities. Characters like < & > if used in text must be still encoded manually. The script as is has no HTML language intellisense as it can be seen on looking on code. This means also that an ä inside a URL is converted to ä instead of being URL encoded with %C3%A4 as it should be done in this case. UltraEdit v24.xx and UEStudio v17.xx have built-in the HTML URI encode/decode function which can be used to encode/decode a selected URL. In toolbar/menu mode this feature is in HTML toolbar not visible by default after switching from ribbon mode to toolbar/menu mode.

Update: First version of script deleted. See my next post for an enhanced version of the script.

arminus · Dec 07, 2017#52017-12-07T13:30+00:00

Mofi wrote:I created quickly a script to run Perl regular expressions to convert all non ASCII characters in an opened file to HTML entities. I used Named character references list of HTML5 stored in a JSON file and converted the JSON file to the attached script with some regular expression replaces and adding a few lines of scripting code.

Stellar- this helps a lot, thank you very much!
Only thing I noticed: With a file in Windows-1252 I get "Invalid regular expression errors" - if the file is in UTF-8/16, everything is fine.

Well, I do have a follow-up question: As you said, running this on an entire file might be risky, so I tried to apply a change to the script so that it would only search and replace on the current selection by adding

Code: Select all

   UltraEdit.activeDocument.findReplace.selectText=true;

in line 2063. In that case nothing gets replaced, though.

Mofi · Dec 10, 2017#62017-12-10T14:09+00:00

I have enhanced the script to detect if the active file is a Unicode file and run only the replaces with a code point value less than 256 if the file is not Unicode encoded. It is impossible to search for a character with a code point value greater than 255 on a non Unicode file resulting in the error message about invalid Perl regular expression.

The JSON file contains lots of multiple entities for same code point value as I found out by sorting the list according to code point value for above enhancement. It does not make sense to have in entities string array multiple entities for same character. It was very time consuming to find out which entity to keep for a character with multiple entities by looking first in Character entity references in HTML 4, and if not present in this list at all, by opening second for each of these characters https://www.fileformat.info/info/unicode/char/xxxx/index.htm and studying the page to find out which entity to keep for a character. I hope I have made the best decision on each character with multiple entities in JSON file.

I have added also a second version of the script to run the replaces only on selected text.

arminus · Dec 11, 2017#72017-12-11T10:27+00:00

Thanks again, highly appreciated!

TXWizard · Dec 12, 2017#82017-12-12T01:44+00:00

This looks very nice, and probably usable with little or no work in Node.JS as well.