Scripts for converting non-ASCII characters to hexadecimal entities and entities back to characters

Samir · PostJul 30, 2024#12024-07-30T12:16+00:00

Hi all,

I need to create an UltraEdit script to convert characters within an open file.

1. Option is Unicode to hexadecimal entity like:

Code: Select all

Unicode "α" to "&#x03B1;" Hexa
Unicode "β" to "&#x03B2;" Hexa
Unicode "γ" to "&#x03B3;" Hexa
Unicode "δ" to "&#x03B4;" Hexa
Unicode "ε" to "&#x03B5;" Hexa
Unicode "ζ" to "&#x03B6;" Hexa
Unicode "À" to "&#x00C0;" Hexa
Unicode "Á" to "&#x00C1;" Hexa

2. Option is hexadecimal entity to Unicode like:

Code: Select all

Hexa "&#x03B1;" to "α" Unicode
Hexa "&#x03B2;" to "β" Unicode
Hexa "&#x03B3;" to "γ" Unicode
Hexa "&#x03B4;" to "δ" Unicode
Hexa "&#x03B5;" to "ε" Unicode
Hexa "&#x03B6;" to "ζ" Unicode
Hexa "&#x00C0;" to "À" Unicode
Hexa "&#x00C1;" to "Á" Unicode

List of characters to ignore during conversion:

Code: Select all

ignore_chars = ["&amp;", "&#x0026;", "&#38;", "&lt;", "&#x003C;", "&#60;", "&gt;", "&#x003E;", "&#62;", "&#39;", "&#x0027;", "&apos;"]

Can anyone help me?

Mofi · PostJul 30, 2024#22024-07-30T19:24+00:00

I do not understand why to convert in a UTF-8 encoded HTML/XHTML/XML file those characters to HTML entities in hexadecimal notation.

I suggest reading the forum topic Script to convert special characters to HTML code. There is also a ZIP file with two scripts to convert all characters in a file or just in a selection to named HTML entities. You can replace the named entities by the corresponding entities in hexadecimal notation.

For the reverse conversion just exchange the search and replace strings, i. e. asHtml5Entities[nEntityIndex+1],asHtml5Entities[nEntityIndex] instead of asHtml5Entities[nEntityIndex],asHtml5Entities[nEntityIndex+1] with just moved +1 from the replace to the search string.

Samir · PostJul 31, 2024#32024-07-31T08:58+00:00

Hello Mofi,

Thank you very much Mofi for your attention to this matter.

I have used the "ToHtmlEntities.zip" script you provided before and still do.

Now there are some tasks that require hexadecimal characters for which I need a script like this.

I've tried making a script for the one I'm attaching, and adding the input and output to the simplefile as well. The given script is not working properly, some characters are converting correctly and some are converting wrongly, if you can solve this, it will be greatly appreciated.

Mofi · PostJul 31, 2024#42024-07-31T18:43+00:00

The attached ZIP file contains the revised script working with UltraEdit for Windows v2024.0.0.35 on the input test file for both conversions.

The corrections are:

The definitions of the functions hexDecimalToUnicode and unicodeToHexDecimal are moved from inside the main code to above main code. That makes the code better readable in my opinion. The two replace functions are still defined as anonymous functions inside these two functions. I would define them in my script also outside of the two functions as I do not like anonymous functions and function definitions inside functions, but it is your script.
The function parameter text of these two functions is renamed to textToConvert for not getting this function parameter false positive syntax highlighted as known keyword. The name is also better in my opinion as it describes what is done with the string passed to the functions.
The list of entity strings to ignore is reduced to those strings from your list which can be matched by the regular expression in the replace function at all.
The regular expression for finding entities in hexadecimal notation was wrong for the further processing in the anonymous replace function. There was passed to the anonymous replace function just the four hexadecimal digits because of the marking round brackets if an entity in hexadecimal notation with exactly four digits was found by the replace function. The entire entity with &#x at the beginning, the hexadecimal digits, and the semicolon at the end must be passed to the anonymous replace function for being correct processed further.
The code of the anonymous replace function for converting an entity in hexadecimal notation to the appropriate Unicode encoded character is reduced to a minimum. More is not necessary.
char was defined in the ECMAScript specifications 1 till 3 as reserved keyword for future use as it can be read at Lexical grammar - Keywords. That is the reason for being syntax highlighted as keyword by the wordfile javascript.uew of UltraEdit. It is still recommended not using the word char as variable name or function parameter name, especially if this is a completely wrong name for what is passed to the anonymous replace function of function hexDecimalToUnicode.
BTW: You see here why anonymous functions are ugly as it is hard to describe in which anonymous function is something to change or was changed by another developer.
The names hexEntity and nonAsciiChar are definitely better for every reader of the code.
Lines in main code not needed at all are removed from main code.
There is used now UltraEdit.getString instead of UltraEdit.getValue for prompting the user for the type of conversion as in this case nothing must be first deleted like the 0 on using UltraEdit.getValue.
There is checked first if the user has really entered 1 or 2 before making anything with active file which gives the script user the possibility to cancel the script execution on having it started by mistake by entering nothing or anything other than 1 or 2.
A text file being not UTF-8, UTF-16 LE or UTF-16 BE encoded must be explicitly converted from non Unicode to UTF-8. The encoding property of an UltraEdit document object is a read-only property as described in help of UltraEdit on page Scripting commands. It is not possible to assign a different integer value to this property and expect that this results in a conversion of the characters in the entire file.

Samir · PostAug 02, 2024#52024-08-02T05:04+00:00

I used "UEStudio v12.20.0.1004", and checked in the input sample file given that the updated script output is the same as before, I attached the input and output file, the correct output should be like this file "correct_output1.xml / correct_output2.xml".

Mofi · PostAug 02, 2024#62024-08-02T08:19+00:00

UEStudio v12.20.0.1004 is not full Unicode aware. This version of UEStudio does not use wide character arrays for JavaScript strings (= UTF-16 encoded strings) like UltraEdit for Windows v24.00 and UEStudio v17.00 and newer versions. The conversion with a version of UEStudio from 2012 must be done as my script AllToHtmlEntities.js do it by running Perl regular expressions on active file instead of loading the entire file contents as JavaScript string into memory of the JavaScript engine embedded in UEStudio v12.20.0.1004. The topic UltraEdit.clipboardContent not supporting Chinese characters? has more detailed information about working with scripts on Unicode encoded files with a version of UltraEdit for Windows < 24.00 and a version of UEStudio < v17.00.