UltraEdit.clipboardContent not supporting Chinese characters?

mloadstar · Mar 26, 2011#12011-03-26T05:16+00:00

If the content is consisting of only ASCII characters, everything is ok:

UltraEdit.clipboardContent = "hello, world";
UltraEdit.activeDocument.paste();

But if the content contains Chinese characters like

Code: Select all

UltraEdit.clipboardContent = "你好";
UltraEdit.activeDocument.paste();

then I get nothing. Has anyone an idea why?

My version: UEStudio 09.30.0.1002

Mofi · Mar 26, 2011#22011-03-26T20:10+00:00

To create a script file with the 2 lines you posted, it is necessary to use Unicode encoding for the Javascript file itself. But the Javascript interpreter does not support Unicode encoded files. Therefore running a Unicode encoded script with those 2 lines results in error message

An error occurred on line 1:

Script failed.

displayed in the output window which must be manually opened. The Javascript interpreter embedded in UltraEdit supports only ASCII/ANSI files, no Unicode encoded files including UTF-8 encoding. According to Javascript documentations Unicode strings can be defined with 2 different methods:

var sUnicodeText1 = String.fromCharCode(0x4F60,0x597D);
var sUnicodeText2 = "\u4F60\u597D";

But I don't know if that works in UltraEdit. Javascript and also the UltraEdit script functions are mainly designed for working with single byte text strings and not for working with Unicode strings. Working with Unicode strings in Javascript is beyond my knowledge about Javascript.

JackTing · Apr 15, 2014#32014-04-15T06:54+00:00

You have to save the script with UTF-8 encoding and without BOM to write Chinese characters in your scripts. "without BOM" is important, files with BOM will make script not runable.

You don't have to change them to the unreadable format. I've many scripts with native Chinese characters in them, and can execute correctly.

Mofi · May 24, 2014#42014-05-24T16:10+00:00

The two lines of code posted by mloadstar with the Chinese characters are a good example for what I have explained at Creating a Perl regular expression string with ANSI/Unicode characters.

Saving those 2 lines of script code as UTF-8 encoded file without BOM results in a *.js file containing:

Code: Select all

UltraEdit.clipboardContent = "ä½ å¥½";
UltraEdit.activeDocument.paste();

The string ä½ å¥½ is inserted into the file instead of 你好 on execution of this UTF-8 encoded script on any Unicode file.

What works is using following script code pasted into a UTF-8 encoded script file stored without BOM:

Code: Select all

if (UltraEdit.document.length > 0)  // Is any file opened?
{
   // Define environment for this script.
   UltraEdit.insertMode();
   if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
   else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
   UltraEdit.unixReOn();  // Can be any regular expression engine.
   // At least once the replace parameters should be defined.
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
   {
      UltraEdit.activeDocument.findReplace.searchInColumn=false;
   }
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=false;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   // 2 Unicode characters should be written into the file. Therefore write
   // first 2 ASCII characters into the file and move caret back to initial
   // position.
   UltraEdit.activeDocument.write("##");
   UltraEdit.activeDocument.key("LEFT ARROW");
   UltraEdit.activeDocument.key("LEFT ARROW");
   // Now use the Replace command of UltraEdit which supports UTF-8 encoded
   // strings to replace the 2 ASCII characters by the 2 Unicode characters.
   UltraEdit.activeDocument.findReplace.replace("##","你好");
}

I have not yet found a better way to insert Unicode characters via an UltraEdit script into a Unicode file.

Well, it would be possible to switch to hex edit mode and insert the Unicode characters as byte stream according to the Unicode encoding of the file (UTF-8, UTF-16 LE, UTF-16 BE or ASCII Escaped Unicode). But that is even more unpractical than the workaround solution above.

JackTing · May 30, 2014#52014-05-30T08:43+00:00

Dear Mofi:

I spent some time on testing this issue, and found some thing very interesting.

I save the code that mloadstar posted as an ANSI/ASCII encoded file, and another by UTF-8 no BOM encoded.
Let's call them CBPaste.js and CBPasteU.js

And I have 4 file, 1st is ANSI/ASCII encoded, 2nd is UTF-8 encoded, and 3rd is 'UTF-16' encoded, 4th is a 'New' opend file, so we have 8 conbinations:
the result for CBPaste.js (ANSI/ASCII) "Looks" the same, all are "你好".

If you switch them to hex mode, you'll find the inserted data of each is different.
The inserted data of ANSI file is "你好" in ANSI (BIG5) encoded.
The inserted data of UTF-8 file is "你好" in UTF-8 encoded.
The inserted data of UTF-16 file is "你好" in UTF-16 encoded.
The inserted data of 'New' file is "你好" in ANSI (BIG5) encoded.
So far, we may treat them as correct (and we know if you've regular expression in your script, they may run incorrectly).

The result for CBPasteU.js (UTF-8 no BOM) "Looks" the same, too. They are "雿末" or just like what you posted (in non-Chinese platform).
We know that's incorrect, but actually they are "你好" in UTF-8 encoded (from the script file).
So what's in ANSI/ASCII file and 'New' file is UTF-8 encoded.
What's in UTF-8 file is "Double UTF-8 encoded".
And what's UTF-16 file is "UTF-8" encoded then "UTF-16" encoded.

An interesting trick on ANSI/ASCII file and 'New' file is:
If this is the only contents, save and reopen it, UltraEdit will treat it as UTF-8 no BOM, and show them correctly.

It looks like UltraEdit.clipboardContent is always treated as ASCII (or just binary data no encoding information).
Then, I think this can be fixed if UltraEdit knows the encoding of the data in clipboard(), and do not "Double" encoded.

"Double encoded" actually happen in older version of UltraEdit.
The behavior of function UltraEdit.activeDocument.ASCIIToUnicode() in old version of UltraEdit such as UE13 will Double encoded if the data file is unicoded already.
And new version will do nothing if it already in Unicode encoded.

But the interesting thing is if we do "Copy/Cut/Paste" between files manually, the pasted result will always be OK. They will not be Double encoded (even the source are in different encoding).
(Note: There're some Unicode characters can not be converted to ASCII/BIG5, then will be a '?')

So, why don't we report this condition to IDM, and ask for a patch (maybe) or a good solution for it.
Maybe there still some other functions that do not act as we expected, but let's correct them one by one.

Mofi · May 30, 2014#62014-05-30T15:23+00:00

The JavaScript interpreter interprets any script file always as char stream which means 1 byte is 1 character. As all JavaScript keywords, escape sequences, etc. consist of only ASCII characters with a code value between 0 and 127, the JavaScript interpreter does not even need to know which code page was used by the author of the script for characters with a code value >= 128 (char is unsigned) or < 0 (char is signed).

Using UTF-8 encoding or ASCII Escaped Unicode encoding makes it possible to use Unicode characters in JavaScript scripts, but the script developer must take into account that those characters are interpreted also as sequence of char without conversion to Unicode on running the script.

A JavaScript String object can store characters as char stream or as wchar stream (wide character - Unicode) since JavaScript 1.5 as documented on page Values, variables, and literals. What I could not find out up to now is how to define the type of the character stream in JavaScript. In C/C++/C# and other programming languages this is very easy as those languages require a type for a variable, array or object. But JavaScript is different as variables and arrays are without a type until a value is assigned to a variable. This is problematic for strings in my point of view as it is not really possible to define the type of the character array: char or wchar.

I don't know if it is possible at all to get the information from a JavaScript String object if the string value is stored as char or wchar array. In the few scripts I wrote up to now which must work on ANSI files as on Unicode files, I checked type of string by using charCodeAt() function on every character of the string. If one character has a value greater than 255, I knew the string is a Unicode string, otherwise the string is an ANSI string.

And the same problem exists on all functions of UltraEdit with string arguments. It is not really possible to find out which encoding was used for the characters in the JavaScript String object. It looks like UltraEdit interprets the string values always as a char stream with same encoding as active file has.

The clipboard is something special as there are several standard clipboard formats like CF_LOCALE, CF_OEMTEXT, CF_TEXT and CF_UNICODETEXT.

The application has to know which encoding is used for the text on copying the text to the clipboard for using the right format. UltraEdit uses most likely (I don't know the code) the format CF_UNICODETEXT on copying text from a Unicode file (UTF-16 LE, UTF-16 BE, UTF-8 and ASCII Escaped Unicode opened in Unicode mode), and CF_TEXT + CF_LOCALE on copying text from an ANSI respectively single byte encoded file.

On the other hand the application has to know in which format the clipboard contains text on paste and again which encoding has the file into which the text should be pasted to run the appropriate conversion before inserting the bytes from clipboard into the file.

Now with all that information, let us look on

Code: Select all

var sText = UltraEdit.clipboardContent;
UltraEdit.clipboardContent = "my string";

There is a problem here as the format of the clipboard - mainly CF_UNICODETEXT versus CF_TEXT - and the code page (CF_LOCALE) on CF_TEXT cannot be evaluated here by the developer of the script with additional code on assigning the clipboard content to a string respectively set format/locale on copying a string to the clipboard.

What I could see so far is that sTest is a string of type wchar if clipboard is of format CF_UNICODETEXT. And the format is just CF_TEXT and therefore CF_LOCALE is according to current input language (most likely, not really tested) on copying something to clipboard by assigning a string value to UltraEdit.clipboardContent.

JackTing wrote:An interesting trick on ANSI/ASCII file and 'New' file is:
If this is the only contents, save and reopen it, UltraEdit will treat it as UTF-8 no BOM, and show them correctly.

This is right. The initial ASCII/ANSI file into which UTF-8 encoded characters are written is interpreted as UTF-8 encoded file on next opening. But that works only if the ASCII/ANSI file did not contain already characters with a code value greater 127, i.e. non ASCII characters as otherwise the text file would contain characters of mixed encoding after writing or pasting the UTF-8 encoded characters .

What could be helpful for all users of UltraEdit writing scripts for languages using mainly Unicode files and therefore Unicode characters would be:

For command write:

An optional number parameter specifying the encoding of the string argument like 65001 for UTF-8.

UltraEdit could correctly convert the string to the encoding of the file with this additional information. And this additional parameter would make it possible to define Unicode characters in JavaScript strings in the script file saved as UTF-8 encoded file without BOM as the script author would just need to add on usuage of UltraEdit.activeDocument.write() and UltraEdit.outputWindow.write() the value 65001 as second parameter additionally to the string.
For better clipboardContent support:

A special function like UltraEdit.copyToClipboard(sString,nEncoding) which copies the string encoded according to the encoding parameter to active clipboard using format CF_UNICODETEXT when nEncoding has the value 1200 (UTF-16 LE), or 1201 (UTF-16 BE), or 65000 (UTF-7), or 65001 (UTF-8) and of course the appropriate conversion to Unicode as requested by the Unicode clipboard format, and otherwise CF_TEXT with CF_LOCALE according to nEncoding.

A special function like UltraEdit.getFromClipboard(nEncoding) which returns a string from active clipboard (format/locale can be determined by UE automatically) with a conversion according to the specified encoding so that the JavaScript string contains for example the UTF-8 byte stream of the Unicode clipboard content with nEncoding having value 65001.

I think, those 4 enhancements would be already a big help on writing scripts which must work with Unicode characters. Of course there are some other commands like UltraEdit.activeDocument.columnInsert which have also a string as function argument and therefore would need a similar extension by an optional encoding parameter to be able to use UTF-8 encoded strings.

JackTing · Jun 04, 2014#72014-06-04T09:18+00:00

Dear Mofi:

Thanks for the sharing.
As I do more testing on the supported JavaScript of UltraEdit, I find actually the JavaScript of UltraEdit supported is neither complete nor up to date (It should be a JavaScript style of scripting language).

The simplest example is String object.
As my experience before I remember that the JavaScript will treat Unicode character as one character, but not multiple bytes. Today, I try to write some code to make sure what happen to the UltraEdit.clipboardContent, here's my sample code:

Code: Select all

var str0 = new String("你好嗎");
UltraEdit.activeDocument.write(str0.length);
for (i in str0)
    UltraEdit.activeDocument.write(i);

What I found about these four lines are:

The attribute 'length' do not convert to string automatically, in UltraEdit.activeDocument.write(). To have the value print out I've to change the code to str0.length.toString(), but not just str0.length. Or I've to append them to some string. In JavaScript of browser document.write() will convert it to string automatically.
The length of String "你好嗎" is 9 in UltraEdit, but that's 3 in IE and Chrome.

Then I checked what's different between copy (manually) and assignment of UltraEdit.clipboardContent by the following 4 lines. And before the script executed, I select "你好嗎" and copy it manually.

Code: Select all

var str0 = new String("你好嗎");
UltraEdit.activeDocument.write("length B:"+UltraEdit.clipboardContent.length+"\r\n");
UltraEdit.clipboardContent = str0;
UltraEdit.activeDocument.write("length A:"+UltraEdit.clipboardContent.length+"\r\n");

The result is

Code: Select all

length B:3
length A:5

That is very strange.