Scripts don't work on (with | in) files with Cyrillic names

Scripts don't work on (with | in) files with Cyrillic names

11327
MasterMaster
11327

    Apr 16, 2015#1

    For me scripts don't work on files with names containing Cyrillic symbols
    Scenario:

    Windows 7th SP1 Eng
    Location Russia
    Language for non-Unicode Programs - Russia
    Date/Time format - English (USA)
    Code page/Locale in UE - 1251

    What I tried:

    Change Location, load in safe mode, load only basic services and system drivers via msconfig - no effect. The only change Language for non-Unicode Programs to English helped, but this isn't acceptable. I have asked IDM support, but they can't reproduce this behavior. Has anyone had the same problem? Any suggestions would be appreciated.
    It's impossible to lead us astray for we don't care even to choose the way.

    6,602548
    Grand MasterGrand Master
    6,602548

      Apr 17, 2015#2

      Could you post a small example script and the name of the file(s) with Cyrillic letters.
      Best regards from an UC/UE/UES for Windows user from Austria

      11327
      MasterMaster
      11327

        Apr 17, 2015#3

        Dear Mofi!

        Thanks for reply! Requested info is in attachment.
        Test.zip (405 Bytes)   110
        It's impossible to lead us astray for we don't care even to choose the way.

        6,602548
        Grand MasterGrand Master
        6,602548

          Apr 18, 2015#4

          I could reproduce this issue and I now know when this issue occurs. But I cannot suggest a real workaround which always work nor do I have a really good idea for IDM Computer Solutions, Inc. how to fix this issue so that this problem does not occur anymore in future.

          I describe what I have done to reproduce this issue and what is the reason for script not working on active file with a file name containing Unicode characters not present in code page selected currently in UltraEdit. Please refer also to the attached image file being a collage of several screen shots.
          1. First I extracted the files from ZIP file on my German Windows XP to C:\Windows\Temp (NTFS partition).

            Next I selected in region and language settings of Windows XP Russian as language version for non Unicode aware applications and restarted Windows as requested by Windows after this change. All other region and language settings were not changed by me and were therefore still set for German.

            Then I started UltraEdit and opened the text file with unreadable file name by drag and drop from my favorite file manager Total Commander to UltraEdit.

            I executed the UltraEdit script and it worked.

            As it can be seen on attached image by comparing file name displayed on file tab and at top in title bar of UE main window, the short file name was passed to UltraEdit because the file name contained at least 1 character not existing in Windows system code page for non Unicode aware applications. UltraEdit takes the short file name and uses this file name also internally, but determines long file name in Unicode and displays the Unicode file name on the file tab.
          2. I can't read Cyrillic, but I was quite sure that the displayed file name was not a Cyrillic word.

            So I opened the ZIP file once again and this time, with Russian instead of German set for all non Unicode aware applications, the file name was displayed different in the ZIP file. I extracted the text file with Cyrillic name once again to C:\Windows\Temp and had it now with correct name on my hard disk.

            I opened the file name directly from within UltraEdit with Find - Open resulting in same text displayed on file tab and in title bar.

            I executed the script and it worked again.
          3. I opened C:\Windows\Temp also in Windows Explorer to see how Explorer displays the file names.

            The text file with Cyrillic name stored in the ZIP file with hexadecimal values 92 A5 E1 E2 2E 74 78 74 using OEM 866 code page (Russian in console windows) was first extracted with code page OEM 850 (Latin I in console Windows) resulting in file with name ÆÑßÔ.txt in Windows 1252 code page (Latin I in GUI Windows) consisting of the bytes C6 D1 DF D4 2E 74 78 74.

            Later the same file was extracted with using OEM 866 code page as it was compressed into the archive and is displayed in Windows GUI applications with Windows 1251 code page (Russian) as file with name Тест.txt consisting of the bytes D2 E5 F1 F2 2E 74 78 74.
          4. Now I opened in UltraEdit Advanced - Set Code Page/Locale and changed code page to 1252 (ANSI - Latin I) and locale to English US. This means that for UltraEdit only the code page is the one for Western European and North American countries while for all other Windows applications the code page for single byte encoded characters is 1251 (ANSI - Cyrillic).

            I used next File - Revert to Saved and executed the script once again.

            This time it did not change anything in file. But the output window did also not display any error.

            JavaScript strings can be single byte encoded strings (a char string in C/C++) or Unicode strings (a wchar string (CString, QString, etc. depending on used library) in C/C++). All I have found until now about Unicode strings in JavaScript is that Unicode strings are supported. I have nothing found about how to work with Unicode strings in JavaScript.

            How can a Unicode string be defined in a JavaScript file? How does conversion from a code page to Unicode and back work in JavaScript? How to control this conversion? If there is a JavaScript guru reading this, please enlighten me. All I know is that Unicode characters can be assigned to a JavaScript string by using fromCharCode() function of the JavaScript String object.

            Why I'm writing about JavaScript String and supported character encodings?

            Well, as I could see that the script does not work anymore, I added to script as first line the command

            Code: Select all

            UltraEdit.messageBox(UltraEdit.activeDocument.path);
            Now I could see which file name was stored for active file in the JavaScript documents array: C:\Windows\Temp\Oano.txt instead of C:\Windows\Temp\Тест.txt.

            As there is no file Oano.txt opened in UltraEdit, the file related commands do nothing.
          5. Where does Oano.txt as file name come from?

            The string Тест.txt consists of 8 bytes with the hexadecimal values D2 E5 F1 F2 2E 74 78 74 with Windows 1251 code page as defined in Windows language settings.

            This can be seen by using as display font in UltraEdit for example Arial with script (code page) Cyrillic selected.
          6. But the same 8 bytes with Windows 1252 code page as selected in UltraEdit now just for UltraEdit build the string Òåñò.txt.

            This can be seen by opening View - Set Font once again and select as script (code page) now Western.

            This string converted to ASCII (just bytes with values 0 to 127 decimal) results in file name Oano.txt.
          7. So I closed C:\Windows\Temp\Тест.txt without saving in UltraEdit and opened this file again by drag and drop from Total Commander.

            As it is visible in title of main window of UltraEdit, the file was opened using short file name which consists always only of ASCII characters. Now the script worked again on file without changing anything else.
          8. I restored now my standard configuration by selecting German in Windows language settings again for non Unicode aware applications (really bad setting name as in real it selects the system code pages for single byte encoded strings in Windows GUI and console applications), restarting Windows and restoring uedit32.ini from a backup to reset all font, code page and local changes I made before in UltraEdit.

            I wanted to know if this issue is reproducible also with the standard settings for Western European and North American countries which would make it easier for IDM Computer Solutions, Inc. to reproduce it.

            So I opened file C:\Windows\TempÆÑßÔ.txt using File - Open in UltraEdit.

            This different opening method results in same string displayed as file name on file tab as well as in title bar of main UltraEdit window.
          9. I executed the script and as expected there was no problem as code page of file name matches with code page used for UltraEdit (1252).
          10. Then I selected at Advanced - Set Code Page/Locale the code page 1251 (ANSI - Cyrillic) and locale Russian.

            Running the script again on file results in nothing changed on file and getting in message box as file name displayed ?N?O.txt which of course is not even a valid file name.

            The characters Æ and ß do not exist in code page 1251 and are therefore replaced by a question mark and Ñ and Ô are converted to ASCII characters N and O.
          Summary:
          1. There is never a problem with script execution on files with file name (with path) consisting only of ASCII characters as the characters with code values 0 to 127 decimal (0 to 7F hexadecimal) are identical in all code pages and all text encodings, i.e. file name is C:\Windows\Temp\Test.txt

            That's one reason why I use only ASCII characters for directory and file names and additionally avoid spaces, any brackets and some other characters listed on last help page output in a command prompt window on running command cmd /? simply because I want my life with computers as easy as possible.
          2. There is never a problem with script execution on files opened with short 8.3 file name because the path and name of the file in this format consists also only of ASCII characters.

            Windows uses short file name automatically on passing a file name via drag and drop to an already running application containing in file name or path a character not existing in system code page configured for non Unicode aware applications.

            C:\Windows\Temp\D70C~1.TXT instead of Windows 1252 or Unicode file name C:\Windows\Temp\ÆÑßÔ.txt
            C:\Windows\Temp\5C5C~1.TXT instead of Windows 1251 or Unicode file name C:\Windows\Temp\Тест.txt

            But this solution is not good if the name of the file should be written into the file as it can be seen. Who wants to see in a file D70C~1.TXT or 5C5C~1.TXT? On the other hand if path property of file would contain the long file name in Unicode, it would be also a problem to insert the file name into the file if the file is not encoded as Unicode file.
          3. There is never a problem with script execution on files having characters in file name with path existing in code page selected in Windows region and language settings and this code page is also set for UltraEdit.

            It does not matter which code page is set for the contents of active file via encoding/status bar item or using View - Set Code Page. This is not important here on this issue with file name conversion from Unicode to ANSI/ASCII. Just the code page for UltraEdit application itself is of importance as set at Advanced - Set Code Page/Locale.

            C:\Windows\Temp\ÆÑßÔ.txt with Windows 1252 code page set directly in UltraEdit or indirectly via the Windows region and language setting.
            C:\Windows\Temp\Тест.txt with Windows 1251 code page set directly in UltraEdit or indirectly via the Windows region and language setting.
          4. But if a file is opened in UltraEdit using Unicode file name and there are characters in file name (with path) not existing in Windows system code page respectively the code page currently selected for UltraEdit application, a script with commands for this file won't do anything.

            Using UltraEdit.document[x] instead of UltraEdit.activeDocument also does not work in this case.
          You should report this by email to IDM support. A link to this topic should be enough for IDM.

          This issue can be reproduced easily by
          1. extracting file ÆÑßÔ.txt (Тест.txt with Russian locales) and file test.js to a temporary folder,
          2. opening the text file in UltraEdit with File - Open,
          3. selecting in UltraEdit at Advanced - Set Code Page/Locale the code page 1251 (ANSI - Cyrillic) and locale Russian (while code page 1252 and English US are the defaults for IDM),
          4. adding the script file to the list of scripts,
          5. executing the script, best after adding as first line additionally the command

            Code: Select all

            UltraEdit.messageBox(UltraEdit.activeDocument.path);
            to better see what is the problem here.
          As I have written at the beginning, I have no suggestion how to deal with this most likely very rare case of file name containing characters not existing in code page/locale selected for UltraEdit.

          PS: It took my 45 minutes to make all the tests and screen shots, but about 4 hours to make the collage and write this text and verify all written once again. I hope it is not too difficult to understand for programmers knowing the basics about code pages and text encoding. I do not have the hope that what I wrote is can be understood by non programmers never read anything about how text is stored on computers. For those user I suggest to read power tip Working with Unicode in UltraEdit/UEStudio with a brief introduction on text encoding and the two articles referenced in this power tip written by Tim Bray and Joel Spolksy.
          TestCyrillicScript.png (25.7KiB)
          Collage of screen shots made on examining this issue.
          Best regards from an UC/UE/UES for Windows user from Austria

          11327
          MasterMaster
          11327

            Apr 19, 2015#5

            Thank you Mofi for such deep diving into this problem and investigation! But I notice some strange things: for me scripts work in files with Cyrillic names only when I set Advanced->Set Code Page/Locale->System Installed Code Pages to 1252 (Latin 1) and not works if 1251 code page is set, even if I create file with Cyrillic name in UE 8O 8O 8O. I have found that after reading your post (about 1252).
            It's impossible to lead us astray for we don't care even to choose the way.

            6,602548
            Grand MasterGrand Master
            6,602548

              Apr 20, 2015#6

              Hm, that is interesting. I have to check if behavior with Cyrillic file name on using individual code page settings and Russian Windows region and language settings is different on Windows 7/8 x64 in comparison to Windows XP x86.

              Some extra information regarding Advanced - Set Code Page/Locale:
              1. If either of the two settings is set to "System Default" Locale/CodePage, UltraEdit loads the code page as well as the locale setting from Windows region and language settings on closing the dialog with button OK. Therefore on next opening of this dialog both settings are set according to Windows region and language settings, not just the one explicitly set to use system default.
              2. If at least one of the two settings is set to "C" Default Locale/CodePage - Previously Used and the other is not "System Default" Locale/CodePage, both settings are set to "C" Default Locale/CodePage - Previously Used on closing the dialog with button OK.
              3. Setting code page and locale manually to something individual works only when both are set to none of the first 2 list items. A manual selection of both settings makes it also possible to define something not really compatible like using code page 1252 with locale Russian.
              Best regards from an UC/UE/UES for Windows user from Austria

              11327
              MasterMaster
              11327

                Apr 20, 2015#7

                Once more strange thing:

                I have Windows 7 SP1 x64 English version without Russian interface (with English one), but with Russian keyboard.

                If I set Advanced - Set Code Page/Locale to "System Default" Locale/CodePage, UE set code page to 1252, locale to English-US, but in regional settings Locale is set to Russian, and Location to Russia 8O 8O 8O
                It's impossible to lead us astray for we don't care even to choose the way.