Unicode data corrupted when sent to an array

thesleeve · Feb 17, 2012#12012-02-17T17:01+00:00

Hi everyone,

This is my first post on an UltraEdit forum. I'm writing a script to take a long excerpt of Unicode text (in Japanese), break it into sentences, and load each sentence as a string into an array. I will process the array later.

Everything works just fine, up until the point when I check the contents of the array. It seems that some of the characters are getting corrupted. I'm guessing there's some sort of formatting issue, but I really don't know how to solve the problem.

First, here's my JavaScript code:

Code: Select all

// Ask the user how each sentence entry ends (typically, this is the Japanese period character)
var strEntryTerminator = UltraEdit.getString("What ends each sentence?",1);

// Report what the user inputted to the debug window
UltraEdit.outputWindow.write("Entry terminator is" + strEntryTerminator);

// Use DOS-style line terminator for Windows Notepad Unicode .txt files
var lineTerminator = "\r\n";

// Establish our search string for the loop condition
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.mode=0; //Replace all in current file
UltraEdit.activeDocument.findReplace.replaceAll=true; //Replace all instances

// Remove all line terminators in the file, making the data one continuous line of text
UltraEdit.activeDocument.findReplace.replace(lineTerminator, "");

// Remove multiple spaces (up to ten) so that there is a maximum of one space between text
var SpaceDeletion = 1;
while (SpaceDeletion < 10) {
		UltraEdit.activeDocument.findReplace.replace("  ", " ");
    SpaceDeletion ++;
  }

// Replace a period plus a space with a period (removing leading spaces from entries)
UltraEdit.activeDocument.findReplace.replace(strEntryTerminator + " ", strEntryTerminator);

// Replace a period with a period plus a terminator, which will put each sentence on its own line
UltraEdit.activeDocument.findReplace.replace(strEntryTerminator, strEntryTerminator + lineTerminator);

// Select all data in the document
UltraEdit.activeDocument.selectAll();

// Selection becomes variable
var mySelection = UltraEdit.activeDocument.selection;

// Split lines at lineTerminator and load them into an array
var resultArr = new Array();
resultArr = mySelection.split(lineTerminator);

// Display total number of records in debug window
resultLength = resultArr.length;
UltraEdit.outputWindow.write(resultLength + " total entries");

// Write array values in debug window
for (var i = 0; i < resultArr.length; i++) {
UltraEdit.outputWindow.write("Value: " + i + " \"" + resultArr[i]);
}

Everything seems to work just fine. However, some characters are corrupted in the process.

For example, it my input text is this:

Code: Select all

ブラックホール（英語：black hole）とは、きわめて高密度で大質量で、きわめて強い重力のために、物質だけでなく光さえも脱出できない天体のこと[1]。
きわめて強い重力のために光さえも抜け出せなくなった時空の領域、とされている。
「ブラック・ホール」（黒い穴）という名は、アメリカの物理学者ジョン・ホイーラーが1967年にこうした天体を呼ぶために編み出した[2]。
それ以前は「collapsar[3] コラプサー」（崩壊した星）などと呼ばれていた。

... the output window shows this:

Code: Select all

Running script: C:\Program Files\IDM Computer Solutions\UltraEdit\scripts\JapaneseDocumentToSRS.js
========================================================================================================
Entry terminator is縲・
5 total entries
Value: 0 "・ｽu・ｽ・ｽ・ｽb・ｽN・ｽz・ｽ[・ｽ・ｽ・ｽi・ｽp・ｽ・ｽFblack hole・ｽj・ｽﾆは、・ｽ・ｽ・ｽ・ｽﾟて搾ｿｽ・ｽ・ｽ・ｽx・ｽﾅ大質・ｽﾊで、・ｽ・ｽ・ｽ・ｽﾟて具ｿｽ・ｽ・ｽ・ｽd・ｽﾍのゑｿｽ・ｽﾟに、・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽﾅなゑｿｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽE・ｽo・ｽﾅゑｿｽ・ｽﾈゑｿｽ・ｽV・ｽﾌのゑｿｽ・ｽ・ｽ[1]・ｽB
Value: 1 "・ｽ・ｽ・ｽ・ｽﾟて具ｿｽ・ｽ・ｽ・ｽd・ｽﾍのゑｿｽ・ｽﾟに鯉ｿｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽo・ｽ・ｽ・ｽﾈゑｿｽ・ｽﾈゑｿｽ・ｽ・ｽ・ｽ・ｽ・ｽ・ｽﾌ領茨ｿｽA・ｽﾆゑｿｽ・ｽ・ｽﾄゑｿｽ・ｽ・ｽB
Value: 2 "縲後ヶ繝ｩ繝・け繝ｻ繝帙・繝ｫ縲搾ｼ磯ｻ偵＞遨ｴ・峨→縺・≧蜷阪・縲√い繝｡繝ｪ繧ｫ縺ｮ迚ｩ逅・ｭｦ閠・ず繝ｧ繝ｳ繝ｻ繝帙う繝ｼ繝ｩ繝ｼ縺・967蟷ｴ縺ｫ縺薙≧縺励◆螟ｩ菴薙ｒ蜻ｼ縺ｶ縺溘ａ縺ｫ邱ｨ縺ｿ蜃ｺ縺励◆[2]縲・
Value: 3 "・ｽ・ｽ・ｽ・ｽﾈ前・ｽﾍ「collapsar[3] ・ｽR・ｽ・ｽ・ｽv・ｽT・ｽ[・ｽv・ｽi・ｽ・ｽ・ｽｵゑｿｽ・ｽ・ｽ・ｽj・ｽﾈどと呼ばゑｿｽﾄゑｿｽ・ｽ・ｽ・ｽB
Value: 4 "

Any ideas?

Feb 17, 2012#22012-02-17T18:27+00:00

OK, I've greatly narrowed down the cause of the problem.

I created a very simple script just to see if the data is being stored correctly as a string.

Code: Select all

var mySelection = UltraEdit.activeDocument.selection;
UltraEdit.outputWindow.write(typeof(mySelection) + mySelection);

This in essence takes the highlighted text in the document and displays to the user the type of data and the value of the data.

When I highlight あいうえお, the output window shows:

Code: Select all

stringあいうえお

Success! No problem there.
However, when I try some other characters...　なにぬねの,
the output window shows something like:

Code: Select all

stringåŒ–ã

Obviously, some of the characters are encoding correctly, and some are not. I have no idea what is happening.

Any clues?

Feb 17, 2012#32012-02-17T18:35+00:00

I'm currently thinking maybe the document is using a certain Unicode encoding format, but the JavaScript interface somehow is using a different Unicode encoding format, and at some point in the data transfer, bits are being truncated or something. This makes some characters "out of range" so they are corrupted in the process.

Any idea on how we can tell UltraEdit's scripting interface how it should handle and store string data? That would probably solve the problem.

Mofi · Feb 18, 2012#42012-02-18T17:38+00:00

UltraEdit converts all Unicode file formats to UTF-16 Little Endian on load. So every character in any Unicode file is kept in memory of UltraEdit with 2 bytes per character, to be more precise, with an unsigned 16-bit value (unsigned short int). String variables are completely managed by the JavaScript engine. Unfortunately the documentation about the String object on the Mozilla Developer Network is a little bit poor regarding Unicode strings.

More about Unicode support by JavaScript can be found at Values, variables, and literals - Unicode, but nothing String related.

As you can read on the very technical page Mozilla internal string guide for C++ programmers, JavaScript supports 8-bit (ANSI) and 16-bit (Unicode) strings. But how to use the Unicode variant in scripts is not explained on that page.

I have never needed for myself to code a script which works on a Unicode file. I tried several times for questioners to find out how to deal with Unicode strings in JavaScript scripts, but the results were pure. The only script function where I have had success on working with Unicode strings was the HexCopy function which is written for working on binary data streams and produce the correct result also for Unicode strings with 16-bit values for every character. But this function is of no use for modifying text in Unicode encoded text files.

Summarized: I don't have any idea how to reformat a Unicode text file with a JavaScript script when string variables must be used too.

Correction: I had suddenly an idea how to work with Unicode strings within a script, see Script or macro to identify unicode codepage data. But I still don't know how to get Unicode strings into string variables without conversion to ANSI strings.

However, perhaps you can use the user clipboards and replaces as I have used in the reference topic to work on a Unicode file. I looked on your code and I think it is possible to code it using a user clipboard and normal replaces plus 1 UltraEdit regular expression replace.

Code: Select all

if (UltraEdit.document.length > 0)
{
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   UltraEdit.activeDocument.top();

   // Ask the user how each sentence entry ends (typically, this is the Japanese period character)
   // Entered character is inserted at top of the file and cut to user clipboard 9.
   UltraEdit.getString("What ends each sentence?",0);
   UltraEdit.selectClipboard(9);
   UltraEdit.activeDocument.selectToTop();
   UltraEdit.activeDocument.cut();

   // Use DOS-style line terminator for Windows Notepad Unicode .txt files
   var sLineTerminator = "^p";

   // Define all properties for the replace commands below.
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   UltraEdit.ueReOn();

   // Remove all line terminators in the file, making the data one continuous line of text
   UltraEdit.activeDocument.findReplace.replace(sLineTerminator, "");

   // Remove multiple spaces (up to ten) so that there is a maximum of one space between text
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.replace("  +", " ");

   // Replace a period plus a space with a period (removing leading spaces from entries)
   UltraEdit.activeDocument.findReplace.regExp=false;
   UltraEdit.activeDocument.findReplace.replace("^c ", "^c");

   // Replace a period with a period plus a terminator, which will put each sentence on its own line
   UltraEdit.activeDocument.findReplace.replace("^c", "^c" + sLineTerminator);

   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(0);
}

thesleeve · Feb 21, 2012#52012-02-21T16:31+00:00

Thanks, mofi.

That's extremely helpful. I had figured that the JavaScript engine was basically encoding the strings in its own way, but I was hoping to have some control over that process in order to work around this issue. Oh well, it looks like that's all abstracted from the user and it's not possible to influence how strings are stored in memory.

Your idea of using the user clipboard is great! Thanks very much for putting in the time to write up that example script. I'm going to give this a shot and see if it works to solve my problem.

Thanks again!
TheSleeve