Script or macro to identify unicode codepage data

Script or macro to identify unicode codepage data

24
Basic UserBasic User
24

    Feb 16, 2012#1

    Hello,
    I have a large file in UTF8 format with around 200 thousand plus strings which are in different scripts (code-blocks/code-pages):Latin, Arabic, Devanagari, Chinese, Japanese.

    I need to extract from the file only the following:

    All strings having basic Latin characters: 0021-007E,
    all strings in the Devanagari range: 0900 to 097F,
    and store them in two separate files.

    Many thanks in advance. I have never tried character identification in UltraEdit and hence the request. At present I sort the file (which is around 2,000,000 records) which is very painful.

    A sample file is given below:

    Code: Select all

    wanavati
    wanowrie
    wapcos
    warada
    warangal
    ward no
    warishnagar
    warispura
    warlees
    warnali
    waroda
    warshiya
    warud
    wasdi
    washermenpet
    washim
    wasimal
    wathar
    wayangwade
    wazirpur
    webworld
    wecors
    wester
    westgodavari
    wests
    wharnsby
    wheelers
    whitefield
    whitefields
    winchester
    wind
    windermere
    winze
    wireles
    wkshp
    वनवाडी
    वयांगवाडे
    वरोडा
    वर्कशॉप
    वसडी
    वसिमल
    वानावती
    वायरलेस
    वारंगल
    वारधा
    वारिसनगर
    वारिसपूरा
    वारूड
    वार्नाली
    वार्शिया
    वाशरमेनपेट
    वाशिम
    विंड
    विनचेस्टर
    विन्झे
    विन्डरमियर
    वॅस्टगोदावरी
    वॅस्टर
    वेकोर्स
    वेधर
    वेपकोस
    वेबवर्ल्ड
    वेस्ट्स
    वॉर्ड नं
    व्हाइटफील्ड्स
    व्हाइटफ़ील्ड
    व्हार्न्सबी
    व्हीलर्ज़
    वज़ीरपुर
    Many thanks for any help.

    6,605548
    Grand MasterGrand Master
    6,605548

      Feb 18, 2012#2

      First, please don't write subjects completely in upper case letters. I have converted them always to lower case. Nobody else on this board writes subjects completely in upper case.

      Second, it is also the first time that I wrote a script evaluating the character values. However, here is a script which worked for your example. It is most likely not very fast because it analyzes the words line by line.

      I have not added code to open a new edit window getting automatically active to avoid display updates on active document on script start while script runs if document windows are maximized. You can do that if you want which would make the script much faster. But I wanted to see what is going on and the display updates were no problem on your small example file.

      I did not have any better idea on how to work with Unicode strings than using the clipboards, see Unicode data corrupted when sent to an array for the reason.

      Code: Select all

      if (UltraEdit.document.length > 0)
      {
         UltraEdit.insertMode();
         UltraEdit.columnModeOff();
         UltraEdit.activeDocument.hexOff();
         // Make sure that last line of file has a line termination.
         UltraEdit.activeDocument.bottom();
         if (UltraEdit.activeDocument.isColNumGt(1))
         {
            UltraEdit.activeDocument.insertLine();
            if (UltraEdit.activeDocument.isColNumGt(1))
            {
               UltraEdit.activeDocument.deleteToStartOfLine();
            }
         }
         UltraEdit.activeDocument.top();
      
         UltraEdit.selectClipboard(7);  // For Devanagari words.
         UltraEdit.clearClipboard();
         UltraEdit.selectClipboard(8);  // For basic Latin words.
         UltraEdit.clearClipboard();
         UltraEdit.selectClipboard(9);  // For analyzing the character values of a word.
      
         // Analyze the words in the lines line by line from first to last line.
         while (!UltraEdit.activeDocument.isEof())
         {
            // Get number of active line because needed later.
            var nLineNum = UltraEdit.activeDocument.currentLineNum;
      
            // Select everything in the current line.
            UltraEdit.activeDocument.startSelect();
            UltraEdit.activeDocument.key("END");
            UltraEdit.activeDocument.endSelect();
            // There should be no blank line in the file or the script
            // would fail because of no check for nothing selected.
      
            // Copy selected word (or string) to user clipboard 9.
            UltraEdit.activeDocument.copy();
      
            // Get character value of first character.
            var nCharValue = UltraEdit.clipboardContent.charCodeAt(0);
      
            if (nCharValue >= 0x0900 && nCharValue <= 0x097F)      // Devanagari character?
            {
               // Are all other characters also Devanagari characters?
               for (var nCharIndex = 1; nCharIndex < UltraEdit.clipboardContent.length; nCharIndex++)
               {
                  nCharValue = UltraEdit.clipboardContent.charCodeAt(nCharIndex);
                  if (nCharValue < 0x0900 || nCharValue > 0x097F) break;
                  // if ((nCharValue != 0x0020) && (nCharValue < 0x0900 || nCharValue > 0x097F)) break;
               }
               if (nCharIndex == UltraEdit.clipboardContent.length)
               {
                  // Yes, select the entire line with line termination
                  // and append this line to user clipboard 7.
                  UltraEdit.activeDocument.selectLine();
                  UltraEdit.selectClipboard(7);
                  UltraEdit.activeDocument.copyAppend();
                  UltraEdit.selectClipboard(9);
               }
            }
            else if (nCharValue >= 0x0021 && nCharValue <= 0x007E) // Basic Latin character?
            {
               // Are all other characters also basic Latin characters?
               for (var nCharIndex = 1; nCharIndex < UltraEdit.clipboardContent.length; nCharIndex++)
               {
                  nCharValue = UltraEdit.clipboardContent.charCodeAt(nCharIndex);
                  if (nCharValue < 0x0021 || nCharValue > 0x007E) break;
                  // if (nCharValue < 0x0020 || nCharValue > 0x007E) break;
               }
               if (nCharIndex == UltraEdit.clipboardContent.length)
               {
                  // Yes, select the entire line with line termination
                  // and append this line to user clipboard 8.
                  UltraEdit.activeDocument.selectLine();
                  UltraEdit.selectClipboard(8);
                  UltraEdit.activeDocument.copyAppend();
                  UltraEdit.selectClipboard(9);
               }
            }
            // Move caret to start of next line with discarding selection.
            UltraEdit.activeDocument.gotoLine(++nLineNum,1);
         }
         UltraEdit.clearClipboard();  // Clear user clipboard 9.
         UltraEdit.activeDocument.top();
      
         UltraEdit.selectClipboard(8);
         if (UltraEdit.clipboardContent.length)  // Any basic Latin word found?
         {
            UltraEdit.newFile();                 // Create new file and paste
            UltraEdit.activeDocument.paste();    // all lines with basic Latin
            UltraEdit.clearClipboard();          // words into this file. Then
            UltraEdit.activeDocument.top();      // delete the user clipboard 8.
         }
         else UltraEdit.messageBox("No basic Latin word found.");
      
         UltraEdit.selectClipboard(7);           // Same as above for Devanagari.
         if (UltraEdit.clipboardContent.length)
         {
            UltraEdit.newFile();
            UltraEdit.activeDocument.ASCIIToUnicode();
            UltraEdit.activeDocument.paste();
            UltraEdit.clearClipboard();
            UltraEdit.activeDocument.top();
         }
         else UltraEdit.messageBox("No Devanagari word found.");
      
         UltraEdit.selectClipboard(0);  // Switch back to Windows clipboard.
      }
      By the way: There is one Devanagari and one basic Latin string with a space character. That lines are ignored by this script according to your rules. I have added as comment the 2 other IF conditions inside the 2 FOR loops which would allow spaces in both character ranges except as first character of a line.

        Feb 18, 2012#3

        I had suddenly an idea. The Perl regular expression engine supports character ranges with hexadecimal values within a character set. With \xdd an ANSI character can be defined by value and with \x{dddd} a Unicode character can be defined by value. Used within a square bracket it is now easy to search for strings within a specific character value range. With some additional expressions the Perl regular expression can be used to find 1 or more lines containing only characters of a specified set.

        Using a Perl regular expression Find to find the lines containing only characters of a specified value range the task can be fulfilled by a script (or macro) much faster than by the script above. The space character is in this script a valid character in both code pages because in both Perl regular expression character set definitions included.

        Code: Select all

        if (UltraEdit.document.length > 0)
        {
           UltraEdit.insertMode();
           UltraEdit.columnModeOff();
           UltraEdit.activeDocument.hexOff();
           // Make sure that last line of file has a line termination.
           UltraEdit.activeDocument.bottom();
           if (UltraEdit.activeDocument.isColNumGt(1))
           {
              UltraEdit.activeDocument.insertLine();
              if (UltraEdit.activeDocument.isColNumGt(1))
              {
                 UltraEdit.activeDocument.deleteToStartOfLine();
              }
           }
           UltraEdit.activeDocument.top();
        
           UltraEdit.perlReOn();
           UltraEdit.activeDocument.findReplace.mode=0;
           UltraEdit.activeDocument.findReplace.matchCase=true;
           UltraEdit.activeDocument.findReplace.matchWord=false;
           UltraEdit.activeDocument.findReplace.regExp=true;
           UltraEdit.activeDocument.findReplace.searchDown=true;
           UltraEdit.activeDocument.findReplace.searchInColumn=false;
        
           UltraEdit.selectClipboard(7);  // For Devanagari words.
           UltraEdit.clearClipboard();
        
           // Find 1 or more lines with Devanagari words (spaces allowed).
           var sSearchExp = "(:?^[ \\x{0900}-\\x{097F}]+\\r\\n){1,}";
           while( UltraEdit.activeDocument.findReplace.find(sSearchExp))
           {
              UltraEdit.activeDocument.copyAppend();
           }
        
           UltraEdit.activeDocument.top();
           UltraEdit.selectClipboard(8);  // For basic Latin words.
           UltraEdit.clearClipboard();
           // Find 1 or more lines with basic Latin words (spaces allowed).
           var sSearchExp = "(:?^[\\x{0020}-\\x{007F}]+\\r\\n){1,}";
           while( UltraEdit.activeDocument.findReplace.find(sSearchExp))
           {
              UltraEdit.activeDocument.copyAppend();
           }
           UltraEdit.activeDocument.top();
        
           if (UltraEdit.clipboardContent.length)  // Any basic Latin word found?
           {
              UltraEdit.newFile();                 // Create new file and paste
              UltraEdit.activeDocument.paste();    // all lines with basic Latin
              UltraEdit.clearClipboard();          // words into this file. Then
              UltraEdit.activeDocument.top();      // delete the user clipboard 8.
           }
           else UltraEdit.messageBox("No basic Latin word found.");
        
           UltraEdit.selectClipboard(7);           // Same as above for Devanagari.
           if (UltraEdit.clipboardContent.length)
           {
              UltraEdit.newFile();
              UltraEdit.activeDocument.ASCIIToUnicode();
              UltraEdit.activeDocument.paste();
              UltraEdit.clearClipboard();
              UltraEdit.activeDocument.top();
           }
           else UltraEdit.messageBox("No Devanagari word found.");
        
           UltraEdit.selectClipboard(0);  // Switch back to Windows clipboard.
        }

        24
        Basic UserBasic User
        24

          Feb 19, 2012#4

          Dear Mofi,
          The second solution worked brilliantly and real fast and I can now segregate English and Devanagari. With a bit of tweaking I can allocate code-blocks and pipe them out to different files which could mean that the tool could easily segregate to different files different code-pages.
          I will test it out on the huge file and get back to you with the speed.
          Many thanks once more for the brilliant solution. U r the best
          Sorry for UpperCasing. Had forgotten the rules.