The JavaScript interpreter interprets any script file always as
char stream which means 1 byte is 1 character. As all JavaScript keywords, escape sequences, etc. consist of only ASCII characters with a code value between 0 and 127, the JavaScript interpreter does not even need to know which code page was used by the author of the script for characters with a code value >= 128 (
char is unsigned) or < 0 (
char is signed).
Using UTF-8 encoding or ASCII Escaped Unicode encoding makes it possible to use Unicode characters in JavaScript scripts, but the script developer must take into account that those characters are interpreted also as sequence of
char without conversion to Unicode on running the script.
A JavaScript String object can store characters as
char stream or as
wchar stream (wide character - Unicode) since JavaScript 1.5 as documented on page
Values, variables, and literals. What I could not find out up to now is how to define the type of the character stream in JavaScript. In C/C++/C# and other programming languages this is very easy as those languages require a type for a variable, array or object. But JavaScript is different as variables and arrays are without a type until a value is assigned to a variable. This is problematic for strings in my point of view as it is not really possible to define the type of the character array:
char or
wchar.
I don't know if it is possible at all to get the information from a JavaScript String object if the string value is stored as
char or
wchar array. In the few scripts I wrote up to now which must work on ANSI files as on Unicode files, I checked type of string by using
charCodeAt() function on every character of the string. If one character has a value greater than 255, I knew the string is a Unicode string, otherwise the string is an ANSI string.
And the same problem exists on all functions of UltraEdit with string arguments. It is not really possible to find out which encoding was used for the characters in the JavaScript String object. It looks like UltraEdit interprets the string values always as a
char stream with same encoding as active file has.
The clipboard is something special as there are several
standard clipboard formats like CF_LOCALE, CF_OEMTEXT, CF_TEXT and CF_UNICODETEXT.
The application has to know which encoding is used for the text on copying the text to the clipboard for using the right format. UltraEdit uses most likely (I don't know the code) the format CF_UNICODETEXT on copying text from a Unicode file (UTF-16 LE, UTF-16 BE, UTF-8 and ASCII Escaped Unicode opened in Unicode mode), and CF_TEXT + CF_LOCALE on copying text from an ANSI respectively single byte encoded file.
On the other hand the application has to know in which format the clipboard contains text on paste and again which encoding has the file into which the text should be pasted to run the appropriate conversion before inserting the bytes from clipboard into the file.
Now with all that information, let us look on
Code: Select all
var sText = UltraEdit.clipboardContent;
UltraEdit.clipboardContent = "my string";
There is a problem here as the format of the clipboard - mainly CF_UNICODETEXT versus CF_TEXT - and the code page (CF_LOCALE) on CF_TEXT cannot be evaluated here by the developer of the script with additional code on assigning the clipboard content to a string respectively set format/locale on copying a string to the clipboard.
What I could see so far is that
sTest is a string of type wchar if clipboard is of format CF_UNICODETEXT. And the format is just CF_TEXT and therefore CF_LOCALE is according to current input language (most likely, not really tested) on copying something to clipboard by assigning a string value to
UltraEdit.clipboardContent.
JackTing wrote:An interesting trick on ANSI/ASCII file and 'New' file is:
If this is the only contents, save and reopen it, UltraEdit will treat it as UTF-8 no BOM, and show them correctly.
This is right. The initial ASCII/ANSI file into which UTF-8 encoded characters are written is interpreted as UTF-8 encoded file on next opening. But that works only if the ASCII/ANSI file did not contain already characters with a code value greater 127, i.e. non ASCII characters as otherwise the text file would contain characters of mixed encoding after writing or pasting the UTF-8 encoded characters .
What could be helpful for all users of UltraEdit writing scripts for languages using mainly Unicode files and therefore Unicode characters would be:
- For command write:
An optional number parameter specifying the encoding of the string argument like 65001 for UTF-8.
UltraEdit could correctly convert the string to the encoding of the file with this additional information. And this additional parameter would make it possible to define Unicode characters in JavaScript strings in the script file saved as UTF-8 encoded file without BOM as the script author would just need to add on usuage of UltraEdit.activeDocument.write() and UltraEdit.outputWindow.write() the value 65001 as second parameter additionally to the string.
- For better clipboardContent support:
A special function like UltraEdit.copyToClipboard(sString,nEncoding) which copies the string encoded according to the encoding parameter to active clipboard using format CF_UNICODETEXT when nEncoding has the value 1200 (UTF-16 LE), or 1201 (UTF-16 BE), or 65000 (UTF-7), or 65001 (UTF-8) and of course the appropriate conversion to Unicode as requested by the Unicode clipboard format, and otherwise CF_TEXT with CF_LOCALE according to nEncoding.
A special function like UltraEdit.getFromClipboard(nEncoding) which returns a string from active clipboard (format/locale can be determined by UE automatically) with a conversion according to the specified encoding so that the JavaScript string contains for example the UTF-8 byte stream of the Unicode clipboard content with nEncoding having value 65001.
I think, those 4 enhancements would be already a big help on writing scripts which must work with Unicode characters. Of course there are some other commands like
UltraEdit.activeDocument.columnInsert which have also a string as function argument and therefore would need a similar extension by an optional encoding parameter to be able to use UTF-8 encoded strings.