Unicode data corrupted when sent to an array

Unicode data corrupted when sent to an array

4
NewbieNewbie
4

    Feb 17, 2012#1

    Hi everyone,

    This is my first post on an UltraEdit forum. I'm writing a script to take a long excerpt of Unicode text (in Japanese), break it into sentences, and load each sentence as a string into an array. I will process the array later.

    Everything works just fine, up until the point when I check the contents of the array. It seems that some of the characters are getting corrupted. I'm guessing there's some sort of formatting issue, but I really don't know how to solve the problem.

    First, here's my JavaScript code:

    Code: Select all

    // Ask the user how each sentence entry ends (typically, this is the Japanese period character)
    var strEntryTerminator = UltraEdit.getString("What ends each sentence?",1);
    
    // Report what the user inputted to the debug window
    UltraEdit.outputWindow.write("Entry terminator is" + strEntryTerminator);
    
    // Use DOS-style line terminator for Windows Notepad Unicode .txt files
    var lineTerminator = "\r\n";
    
    // Establish our search string for the loop condition
    UltraEdit.activeDocument.top();
    UltraEdit.activeDocument.findReplace.mode=0; //Replace all in current file
    UltraEdit.activeDocument.findReplace.replaceAll=true; //Replace all instances
    
    // Remove all line terminators in the file, making the data one continuous line of text
    UltraEdit.activeDocument.findReplace.replace(lineTerminator, "");
    
    // Remove multiple spaces (up to ten) so that there is a maximum of one space between text
    var SpaceDeletion = 1;
    while (SpaceDeletion < 10) {
    		UltraEdit.activeDocument.findReplace.replace("  ", " ");
        SpaceDeletion ++;
      }
    
    // Replace a period plus a space with a period (removing leading spaces from entries)
    UltraEdit.activeDocument.findReplace.replace(strEntryTerminator + " ", strEntryTerminator);
    
    // Replace a period with a period plus a terminator, which will put each sentence on its own line
    UltraEdit.activeDocument.findReplace.replace(strEntryTerminator, strEntryTerminator + lineTerminator);
    
    // Select all data in the document
    UltraEdit.activeDocument.selectAll();
    
    // Selection becomes variable
    var mySelection = UltraEdit.activeDocument.selection;
    
    // Split lines at lineTerminator and load them into an array
    var resultArr = new Array();
    resultArr = mySelection.split(lineTerminator);
    
    // Display total number of records in debug window
    resultLength = resultArr.length;
    UltraEdit.outputWindow.write(resultLength + " total entries");
    
    // Write array values in debug window
    for (var i = 0; i < resultArr.length; i++) {
    UltraEdit.outputWindow.write("Value: " + i + " \"" + resultArr[i]);
    }
    
    Everything seems to work just fine. However, some characters are corrupted in the process.

    For example, it my input text is this:

    Code: Select all

    ブラックホール(英語:black hole)とは、きわめて高密度で大質量で、きわめて強い重力のために、物質だけでなく光さえも脱出できない天体のこと[1]。
    きわめて強い重力のために光さえも抜け出せなくなった時空の領域、とされている。
    「ブラック・ホール」(黒い穴)という名は、アメリカの物理学者ジョン・ホイーラーが1967年にこうした天体を呼ぶために編み出した[2]。
    それ以前は「collapsar[3] コラプサー」(崩壊した星)などと呼ばれていた。
    ... the output window shows this:

    Code: Select all

    Running script: C:\Program Files\IDM Computer Solutions\UltraEdit\scripts\JapaneseDocumentToSRS.js
    ========================================================================================================
    Entry terminator is縲・
    5 total entries
    Value: 0 "・スu・ス・ス・スb・スN・スz・ス[・ス・ス・スi・スp・ス・スFblack hole・スj・スニは、・ス・ス・ス・ス゚て搾ソス・ス・ス・スx・スナ大質・スハで、・ス・ス・ス・ス゚て具ソス・ス・ス・スd・スヘのゑソス・ス゚に、・ス・ス・ス・ス・ス・ス・ス・ス・スナなゑソス・ス・ス・ス・ス・ス・ス・ス・スE・スo・スナゑソス・スネゑソス・スV・スフのゑソス・ス・ス[1]・スB
    Value: 1 "・ス・ス・ス・ス゚て具ソス・ス・ス・スd・スヘのゑソス・ス゚に鯉ソス・ス・ス・ス・ス・ス・ス・ス・ス・ス・スo・ス・ス・スネゑソス・スネゑソス・ス・ス・ス・ス・ス・スフ領茨ソスA・スニゑソス・ス・ストゑソス・ス・スB
    Value: 2 "縲後ヶ繝ゥ繝・け繝サ繝帙・繝ォ縲搾シ磯サ偵>遨エ・峨→縺・≧蜷阪・縲√い繝。繝ェ繧ォ縺ョ迚ゥ逅・ュヲ閠・ず繝ァ繝ウ繝サ繝帙う繝シ繝ゥ繝シ縺・967蟷エ縺ォ縺薙≧縺励◆螟ゥ菴薙r蜻シ縺カ縺溘a縺ォ邱ィ縺ソ蜃コ縺励◆[2]縲・
    Value: 3 "・ス・ス・ス・スネ前・スヘ「collapsar[3] ・スR・ス・ス・スv・スT・ス[・スv・スi・ス・ス・スオゑソス・ス・ス・スj・スネどと呼ばゑソストゑソス・ス・ス・スB
    Value: 4 "
    
    Any ideas?

      Feb 17, 2012#2

      OK, I've greatly narrowed down the cause of the problem.

      I created a very simple script just to see if the data is being stored correctly as a string.

      Code: Select all

      var mySelection = UltraEdit.activeDocument.selection;
      UltraEdit.outputWindow.write(typeof(mySelection) + mySelection);
      This in essence takes the highlighted text in the document and displays to the user the type of data and the value of the data.

      When I highlight あいうえお, the output window shows:

      Code: Select all

      stringあいうえお
      Success! No problem there.
      However, when I try some other characters... なにぬねの,
      the output window shows something like:

      Code: Select all

      string化ã
      Obviously, some of the characters are encoding correctly, and some are not. I have no idea what is happening.

      Any clues?

        Feb 17, 2012#3

        I'm currently thinking maybe the document is using a certain Unicode encoding format, but the JavaScript interface somehow is using a different Unicode encoding format, and at some point in the data transfer, bits are being truncated or something. This makes some characters "out of range" so they are corrupted in the process.

        Any idea on how we can tell UltraEdit's scripting interface how it should handle and store string data? That would probably solve the problem.

        6,675585
        Grand MasterGrand Master
        6,675585

          Feb 18, 2012#4

          UltraEdit converts all Unicode file formats to UTF-16 Little Endian on load. So every character in any Unicode file is kept in memory of UltraEdit with 2 bytes per character, to be more precise, with an unsigned 16-bit value (unsigned short int). String variables are completely managed by the JavaScript engine. Unfortunately the documentation about the String object on the Mozilla Developer Network is a little bit poor regarding Unicode strings.

          More about Unicode support by JavaScript can be found at Values, variables, and literals - Unicode, but nothing String related.

          As you can read on the very technical page Mozilla internal string guide for C++ programmers, JavaScript supports 8-bit (ANSI) and 16-bit (Unicode) strings. But how to use the Unicode variant in scripts is not explained on that page.

          I have never needed for myself to code a script which works on a Unicode file. I tried several times for questioners to find out how to deal with Unicode strings in JavaScript scripts, but the results were pure. The only script function where I have had success on working with Unicode strings was the HexCopy function which is written for working on binary data streams and produce the correct result also for Unicode strings with 16-bit values for every character. But this function is of no use for modifying text in Unicode encoded text files.

          Summarized: I don't have any idea how to reformat a Unicode text file with a JavaScript script when string variables must be used too.


          Correction: I had suddenly an idea how to work with Unicode strings within a script, see Script or macro to identify unicode codepage data. But I still don't know how to get Unicode strings into string variables without conversion to ANSI strings.

          However, perhaps you can use the user clipboards and replaces as I have used in the reference topic to work on a Unicode file. I looked on your code and I think it is possible to code it using a user clipboard and normal replaces plus 1 UltraEdit regular expression replace.

          Code: Select all

          if (UltraEdit.document.length > 0)
          {
             UltraEdit.insertMode();
             UltraEdit.columnModeOff();
             UltraEdit.activeDocument.hexOff();
             UltraEdit.activeDocument.top();
          
             // Ask the user how each sentence entry ends (typically, this is the Japanese period character)
             // Entered character is inserted at top of the file and cut to user clipboard 9.
             UltraEdit.getString("What ends each sentence?",0);
             UltraEdit.selectClipboard(9);
             UltraEdit.activeDocument.selectToTop();
             UltraEdit.activeDocument.cut();
          
             // Use DOS-style line terminator for Windows Notepad Unicode .txt files
             var sLineTerminator = "^p";
          
             // Define all properties for the replace commands below.
             UltraEdit.activeDocument.findReplace.mode=0;
             UltraEdit.activeDocument.findReplace.matchCase=true;
             UltraEdit.activeDocument.findReplace.matchWord=false;
             UltraEdit.activeDocument.findReplace.regExp=false;
             UltraEdit.activeDocument.findReplace.searchDown=true;
             UltraEdit.activeDocument.findReplace.searchInColumn=false;
             UltraEdit.activeDocument.findReplace.preserveCase=false;
             UltraEdit.activeDocument.findReplace.replaceAll=true;
             UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
             UltraEdit.ueReOn();
          
             // Remove all line terminators in the file, making the data one continuous line of text
             UltraEdit.activeDocument.findReplace.replace(sLineTerminator, "");
          
             // Remove multiple spaces (up to ten) so that there is a maximum of one space between text
             UltraEdit.activeDocument.findReplace.regExp=true;
             UltraEdit.activeDocument.findReplace.replace("  +", " ");
          
             // Replace a period plus a space with a period (removing leading spaces from entries)
             UltraEdit.activeDocument.findReplace.regExp=false;
             UltraEdit.activeDocument.findReplace.replace("^c ", "^c");
          
             // Replace a period with a period plus a terminator, which will put each sentence on its own line
             UltraEdit.activeDocument.findReplace.replace("^c", "^c" + sLineTerminator);
          
             UltraEdit.clearClipboard();
             UltraEdit.selectClipboard(0);
          }

          4
          NewbieNewbie
          4

            Feb 21, 2012#5

            Thanks, mofi.

            That's extremely helpful. I had figured that the JavaScript engine was basically encoding the strings in its own way, but I was hoping to have some control over that process in order to work around this issue. Oh well, it looks like that's all abstracted from the user and it's not possible to influence how strings are stored in memory.

            Your idea of using the user clipboard is great! Thanks very much for putting in the time to write up that example script. I'm going to give this a shot and see if it works to solve my problem.

            Thanks again!
            TheSleeve