Script or macro to identify unicode codepage data

dictdoc · Feb 16, 2012#12012-02-16T13:17+00:00

Hello,
I have a large file in UTF8 format with around 200 thousand plus strings which are in different scripts (code-blocks/code-pages):Latin, Arabic, Devanagari, Chinese, Japanese.

I need to extract from the file only the following:

All strings having basic Latin characters: 0021-007E,
all strings in the Devanagari range: 0900 to 097F,
and store them in two separate files.

Many thanks in advance. I have never tried character identification in UltraEdit and hence the request. At present I sort the file (which is around 2,000,000 records) which is very painful.

A sample file is given below:

Code: Select all

wanavati
wanowrie
wapcos
warada
warangal
ward no
warishnagar
warispura
warlees
warnali
waroda
warshiya
warud
wasdi
washermenpet
washim
wasimal
wathar
wayangwade
wazirpur
webworld
wecors
wester
westgodavari
wests
wharnsby
wheelers
whitefield
whitefields
winchester
wind
windermere
winze
wireles
wkshp
वनवाडी
वयांगवाडे
वरोडा
वर्कशॉप
वसडी
वसिमल
वानावती
वायरलेस
वारंगल
वारधा
वारिसनगर
वारिसपूरा
वारूड
वार्नाली
वार्शिया
वाशरमेनपेट
वाशिम
विंड
विनचेस्टर
विन्झे
विन्डरमियर
वॅस्टगोदावरी
वॅस्टर
वेकोर्स
वेधर
वेपकोस
वेबवर्ल्ड
वेस्ट्स
वॉर्ड नं
व्हाइटफील्ड्स
व्हाइटफ़ील्ड
व्हार्न्सबी
व्हीलर्ज़
वज़ीरपुर

Many thanks for any help.

Mofi · Feb 18, 2012#22012-02-18T19:25+00:00

First, please don't write subjects completely in upper case letters. I have converted them always to lower case. Nobody else on this board writes subjects completely in upper case.

Second, it is also the first time that I wrote a script evaluating the character values. However, here is a script which worked for your example. It is most likely not very fast because it analyzes the words line by line.

I have not added code to open a new edit window getting automatically active to avoid display updates on active document on script start while script runs if document windows are maximized. You can do that if you want which would make the script much faster. But I wanted to see what is going on and the display updates were no problem on your small example file.

I did not have any better idea on how to work with Unicode strings than using the clipboards, see Unicode data corrupted when sent to an array for the reason.

Code: Select all

if (UltraEdit.document.length > 0)
{
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   // Make sure that last line of file has a line termination.
   UltraEdit.activeDocument.bottom();
   if (UltraEdit.activeDocument.isColNumGt(1))
   {
      UltraEdit.activeDocument.insertLine();
      if (UltraEdit.activeDocument.isColNumGt(1))
      {
         UltraEdit.activeDocument.deleteToStartOfLine();
      }
   }
   UltraEdit.activeDocument.top();

   UltraEdit.selectClipboard(7);  // For Devanagari words.
   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(8);  // For basic Latin words.
   UltraEdit.clearClipboard();
   UltraEdit.selectClipboard(9);  // For analyzing the character values of a word.

   // Analyze the words in the lines line by line from first to last line.
   while (!UltraEdit.activeDocument.isEof())
   {
      // Get number of active line because needed later.
      var nLineNum = UltraEdit.activeDocument.currentLineNum;

      // Select everything in the current line.
      UltraEdit.activeDocument.startSelect();
      UltraEdit.activeDocument.key("END");
      UltraEdit.activeDocument.endSelect();
      // There should be no blank line in the file or the script
      // would fail because of no check for nothing selected.

      // Copy selected word (or string) to user clipboard 9.
      UltraEdit.activeDocument.copy();

      // Get character value of first character.
      var nCharValue = UltraEdit.clipboardContent.charCodeAt(0);

      if (nCharValue >= 0x0900 && nCharValue <= 0x097F)      // Devanagari character?
      {
         // Are all other characters also Devanagari characters?
         for (var nCharIndex = 1; nCharIndex < UltraEdit.clipboardContent.length; nCharIndex++)
         {
            nCharValue = UltraEdit.clipboardContent.charCodeAt(nCharIndex);
            if (nCharValue < 0x0900 || nCharValue > 0x097F) break;
            // if ((nCharValue != 0x0020) && (nCharValue < 0x0900 || nCharValue > 0x097F)) break;
         }
         if (nCharIndex == UltraEdit.clipboardContent.length)
         {
            // Yes, select the entire line with line termination
            // and append this line to user clipboard 7.
            UltraEdit.activeDocument.selectLine();
            UltraEdit.selectClipboard(7);
            UltraEdit.activeDocument.copyAppend();
            UltraEdit.selectClipboard(9);
         }
      }
      else if (nCharValue >= 0x0021 && nCharValue <= 0x007E) // Basic Latin character?
      {
         // Are all other characters also basic Latin characters?
         for (var nCharIndex = 1; nCharIndex < UltraEdit.clipboardContent.length; nCharIndex++)
         {
            nCharValue = UltraEdit.clipboardContent.charCodeAt(nCharIndex);
            if (nCharValue < 0x0021 || nCharValue > 0x007E) break;
            // if (nCharValue < 0x0020 || nCharValue > 0x007E) break;
         }
         if (nCharIndex == UltraEdit.clipboardContent.length)
         {
            // Yes, select the entire line with line termination
            // and append this line to user clipboard 8.
            UltraEdit.activeDocument.selectLine();
            UltraEdit.selectClipboard(8);
            UltraEdit.activeDocument.copyAppend();
            UltraEdit.selectClipboard(9);
         }
      }
      // Move caret to start of next line with discarding selection.
      UltraEdit.activeDocument.gotoLine(++nLineNum,1);
   }
   UltraEdit.clearClipboard();  // Clear user clipboard 9.
   UltraEdit.activeDocument.top();

   UltraEdit.selectClipboard(8);
   if (UltraEdit.clipboardContent.length)  // Any basic Latin word found?
   {
      UltraEdit.newFile();                 // Create new file and paste
      UltraEdit.activeDocument.paste();    // all lines with basic Latin
      UltraEdit.clearClipboard();          // words into this file. Then
      UltraEdit.activeDocument.top();      // delete the user clipboard 8.
   }
   else UltraEdit.messageBox("No basic Latin word found.");

   UltraEdit.selectClipboard(7);           // Same as above for Devanagari.
   if (UltraEdit.clipboardContent.length)
   {
      UltraEdit.newFile();
      UltraEdit.activeDocument.ASCIIToUnicode();
      UltraEdit.activeDocument.paste();
      UltraEdit.clearClipboard();
      UltraEdit.activeDocument.top();
   }
   else UltraEdit.messageBox("No Devanagari word found.");

   UltraEdit.selectClipboard(0);  // Switch back to Windows clipboard.
}

By the way: There is one Devanagari and one basic Latin string with a space character. That lines are ignored by this script according to your rules. I have added as comment the 2 other IF conditions inside the 2 FOR loops which would allow spaces in both character ranges except as first character of a line.

Feb 18, 2012#32012-02-18T20:01+00:00

I had suddenly an idea. The Perl regular expression engine supports character ranges with hexadecimal values within a character set. With \xdd an ANSI character can be defined by value and with \x{dddd} a Unicode character can be defined by value. Used within a square bracket it is now easy to search for strings within a specific character value range. With some additional expressions the Perl regular expression can be used to find 1 or more lines containing only characters of a specified set.

Using a Perl regular expression Find to find the lines containing only characters of a specified value range the task can be fulfilled by a script (or macro) much faster than by the script above. The space character is in this script a valid character in both code pages because in both Perl regular expression character set definitions included.

Code: Select all

if (UltraEdit.document.length > 0)
{
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   // Make sure that last line of file has a line termination.
   UltraEdit.activeDocument.bottom();
   if (UltraEdit.activeDocument.isColNumGt(1))
   {
      UltraEdit.activeDocument.insertLine();
      if (UltraEdit.activeDocument.isColNumGt(1))
      {
         UltraEdit.activeDocument.deleteToStartOfLine();
      }
   }
   UltraEdit.activeDocument.top();

   UltraEdit.perlReOn();
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.searchInColumn=false;

   UltraEdit.selectClipboard(7);  // For Devanagari words.
   UltraEdit.clearClipboard();

   // Find 1 or more lines with Devanagari words (spaces allowed).
   var sSearchExp = "(:?^[ \\x{0900}-\\x{097F}]+\\r\\n){1,}";
   while( UltraEdit.activeDocument.findReplace.find(sSearchExp))
   {
      UltraEdit.activeDocument.copyAppend();
   }

   UltraEdit.activeDocument.top();
   UltraEdit.selectClipboard(8);  // For basic Latin words.
   UltraEdit.clearClipboard();
   // Find 1 or more lines with basic Latin words (spaces allowed).
   var sSearchExp = "(:?^[\\x{0020}-\\x{007F}]+\\r\\n){1,}";
   while( UltraEdit.activeDocument.findReplace.find(sSearchExp))
   {
      UltraEdit.activeDocument.copyAppend();
   }
   UltraEdit.activeDocument.top();

   if (UltraEdit.clipboardContent.length)  // Any basic Latin word found?
   {
      UltraEdit.newFile();                 // Create new file and paste
      UltraEdit.activeDocument.paste();    // all lines with basic Latin
      UltraEdit.clearClipboard();          // words into this file. Then
      UltraEdit.activeDocument.top();      // delete the user clipboard 8.
   }
   else UltraEdit.messageBox("No basic Latin word found.");

   UltraEdit.selectClipboard(7);           // Same as above for Devanagari.
   if (UltraEdit.clipboardContent.length)
   {
      UltraEdit.newFile();
      UltraEdit.activeDocument.ASCIIToUnicode();
      UltraEdit.activeDocument.paste();
      UltraEdit.clearClipboard();
      UltraEdit.activeDocument.top();
   }
   else UltraEdit.messageBox("No Devanagari word found.");

   UltraEdit.selectClipboard(0);  // Switch back to Windows clipboard.
}

dictdoc · Feb 19, 2012#42012-02-19T06:21+00:00

Dear Mofi,
The second solution worked brilliantly and real fast and I can now segregate English and Devanagari. With a bit of tweaking I can allocate code-blocks and pipe them out to different files which could mean that the tool could easily segregate to different files different code-pages.
I will test it out on the huge file and get back to you with the speed.
Many thanks once more for the brilliant solution. U r the best
Sorry for UpperCasing. Had forgotten the rules.