Auto-detect encoding of files while find and replace in files

Auto-detect encoding of files while find and replace in files

1

    Oct 19, 2018#1

    Hi,

    Is it possible to set the Auto-detect option for encoding of the file(s) that I want to find and replace in? I mean in a script of course? If yes, how? I can't find the script command parameter.

    Also, can the script file be a UTF-8 file? I want to find and replace special characters, which will not be represented in an ASCII file.

    Thanks in advance for your help.

    Piotr

    6,613548
    Grand MasterGrand Master
    6,613548

      Oct 20, 2018#2

      I answer the second question first.

      UltraEdit for Windows v24.00 and UEStudio v17.00 and all later versions are full Unicode aware applications supporting scripts being encoded with:
      1. ANSI using system code page for GUI applications or
      2. UTF-8 without or with BOM or
      3. UTF-16 Little Endian without or with BOM.
      This means also on execution of the script that Unicode characters
      • are read correct from script file and written into JavaScript strings independent on encoding of script file and
      • can be read from an opened file into a JavaScript string and
      • can be written from a JavaScript string into opened file and
      • can be copied from JavaScript string to active clipboard and
      • can be read from active clipboard and written into a JavaScript string.
      But it is also possible to run finds/replaces/find in files/replace in files with characters not supported by system code page in previous versions of UltraEdit supporting only ANSI encoded scripts by using Perl regular expression and encode those characters with their hexadecimal values. I wrote the script UnicodeStringToPerlRegExp.js for converting a Unicode character sequence or UTF-8 byte stream or even a ANSI character stream into a Perl regular expression string. So a Perl regular expression replace in files can search for bytes of UTF-8 encoded characters and replace them by bytes of a UTF-8 encoded replace string. For example UTF-8 encoded ä can be searched with \xC3\xA4 and replaced by UTF-8 encoded Ω using \xE2\x84\xA6 without using the Use encoding at all.

      A script file can be also UTF-8 encoded with UE < v24.00 and UES < 17.00 on having no BOM (byte order mark) and UTF-8 encoded characters exist only in comments or finds/replace strings. Please see topic UltraEdit.clipboardContent not supporting Chinese characters? for more details on what is possible regarding to Unicode characters in UltraEdit scripts with UE < v24.00 and UES < v17.00.

      The first question was interesting as nobody has asked that before. I created quickly a script which creates an ANSI, a UTF-8 and a UTF-16 encoded file with just a few very short lines on execution with two characters with a code value greater decimal 127. Then I tested multiple simple, non regular expression Replace in Files all executed manually with Use encoding set to Auto-detect to look how those replaces work on the three different encoded files.

      I was astonished to see with UE v25.20.0.88 that characters were replaced by UTF-8 encoded characters in very small ANSI encoded file resulting in having finally ANSI and UTF-8 encoded characters in that file. But this happens only on very small ANSI encoded file with just 208 bytes. The same characters in same ANSI encoded character block in a larger ANSI encoded file with 119 KB were correct replaced and are encoded in ANSI after replace. It looks like the Auto-detect encoding setting of Find/Replace in Files needs a certain amount of bytes in a file to correct detect if a file is ANSI and not UTF-8 encoded.

      The UTF-8 encoding of UTF-8 encoded file with just 211 bytes and no BOM was always correct detected and updated by the Replace in Files executed manually by me.

      And UTF-16 LE encoded file with BOM was also always correct updated by all Replace in Files.

      I have to find out with more experiments which amount of bytes is required by UltraEdit to detect ANSI encoding in small ANSI encoded files on running a Find/Replace in Files with enabled Use encoding set to Auto-detect.

      It was also interesting for me that the small ANSI encoded file with just 208 bytes was opened always as ANSI and never as UTF-8 encoded. So Replace in Files encoding auto-detection works a bit different than the encoding auto-detection on opening a file which was not expected by me.

      Next I recorded the Replace in Files executed manually with UE v25.20.0.88 into a macro and played the recorded macro after restoring the different encoded files back to original contents. That worked as expected and produced the same file contents as the manually executed Replace in Files before.

      I looked on macro code and could see value -2 for option Auto-detect.

      So I modified the initially created script and added the Replace in Files with exactly the same options as used manually before and recorded into the macro. The two encoding options were written by me into the script file as:

      Code: Select all

      UltraEdit.frInFiles.useEncoding=true;
      UltraEdit.frInFiles.encoding=-2;
      But that was no good idea because of UltraEdit crashed on script execution on executing the first UltraEdit.frInFiles.replace with those parameters. I restarted UltraEdit and executed the script again and UltraEdit crashed again. Of course I will report this crash by email to IDM support.

      Conclusion: Encoding option Auto-detect on usage of option Use encoding is currently not possible in an UltraEdit script, only in an UltraEdit macro or manually.

        Jan 23, 2019#3

        The issue resulting in a crash of UltraEdit on using in an UltraEdit/UEStudio script the code below is fixed with UltraEdit for Windows v25.20.0.156 and UEStudio v18.20.0.40.

        Code: Select all

        UltraEdit.frInFiles.useEncoding=true;
        UltraEdit.frInFiles.encoding=-2;
        But Replace in Files with search string Baeume haben Aeste (English: trees have branches) and replace string Bäume haben Äste is still not producing the correct results on running it with auto-detection of encoding from within a script on Windows-1252 (ANSI), UTF-8 and UTF-16 encoded files containing the searched string and script file is UTF-8 or UTF-16 encoded. The replace is done only on UTF-16 encoded file. Nothing is replaced in Windows-1252 and UTF-8 encoded files in this case.

        The same Replace in Files on same three different encoded files works with UltraEdit v25.20.0.156 and UEStudio v18.20.0.40 on running it manually or from within a macro for example with:

        Code: Select all

        PerlReOn
        ReplInFiles Log UseEncoding -2 "C:\Temp\\" "EncodingTest_*.txt" "Baeume haben Aeste" "Bäume haben Äste"
        
        UltraEdit for Windows < v25.20.0.156 and UEStudio < v18.20.0.40 do this ASCII to ANSI/UTF-8/UTF-16 replace not correct for ANSI and UTF-8 encoded files even on execution of this Replace in Files manually or from within a macro.

        Here is my UTF-8 or UTF-16 test script to produce the three different encoded files and running replace in files to convert strings containing only ASCII characters to ANSI or Unicode characters. Please note that this script can be executed only with UltraEdit for Windows v25.20.0.156 or UEStudio v18.20.0.40 or newer versions without resulting in a crash of UltraEdit or UEStudio.

        Code: Select all

        var g_sDirectory="C:\\Temp\\";
        var g_sFileName="EncodingTest_";
        var g_sFileExt=".txt";
        
        var g_sCorrect = "correct";
        var g_sLine2 = "Line 2: ";
        var g_sLine3 = "Line 3: ";
        var g_sLine4 = "Line 4: ";
        var g_sLine5 = "Line 5: ";
        var g_sResult = "Results for file: ";
        var g_sWrong = "wrong";
        
        function WriteSaveClose (sEncoding)
        {
           UltraEdit.activeDocument.unixMacToDos();
           var sArticle = (sEncoding == "ANSI") ? "an " : "a ";
           UltraEdit.activeDocument.write("This is " + sArticle + sEncoding+ " encoded file with DOS line endings.\r\n");
           UltraEdit.activeDocument.write("» Baeume haben Aeste. (Trees have branches.)«\r\n");
           UltraEdit.activeDocument.write("» One quarter is 1/4 and one half is 1/2. «\r\n");
           UltraEdit.activeDocument.write("Ohm sign 'Omega' has Unicode code value U+2126.\r\n");
           UltraEdit.activeDocument.write("(c) by Mofi for free usage by UltraEdit/UEStudio users.\r\n");
           UltraEdit.saveAs(g_sDirectory + g_sFileName + sEncoding + g_sFileExt);
           UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
        }
        
        function EvaluateReplaces (sEncoding)
        {
           var sFileName = g_sDirectory + g_sFileName + sEncoding + g_sFileExt;
           UltraEdit.open(sFileName);
           UltraEdit.activeDocument.top();
           UltraEdit.activeDocument.findReplace.mode=0;
           UltraEdit.activeDocument.findReplace.matchCase=true;
           UltraEdit.activeDocument.findReplace.matchWord=false;
           UltraEdit.activeDocument.findReplace.regExp=false;
           UltraEdit.activeDocument.findReplace.searchDown=true;
           UltraEdit.activeDocument.findReplace.searchInColumn=false;
           var sResultLine2 = g_sLine2 + ((UltraEdit.activeDocument.findReplace.find("» Bäume haben Äste. (Trees have branches.)«")) ? g_sCorrect : g_sWrong);
           var sResultLine3 = g_sLine3 + ((UltraEdit.activeDocument.findReplace.find(">> One quarter is ¼ and one half is ½. <<")) ? g_sCorrect : g_sWrong);
           var sResultLine4 = g_sLine4;
           if (sEncoding == "ANSI")
           {
              sResultLine4 += (UltraEdit.activeDocument.findReplace.find("Ω")) ? g_sWrong : g_sCorrect;
           }
           else
           {
              sResultLine4 += (UltraEdit.activeDocument.findReplace.find("Ω")) ? g_sCorrect : g_sWrong;
           }
           var sResultLine5 = g_sLine5 + ((UltraEdit.activeDocument.findReplace.find("©")) ? g_sCorrect : g_sWrong);
           UltraEdit.outputWindow.write(g_sResult + sFileName);
           UltraEdit.outputWindow.write(sResultLine2);
           UltraEdit.outputWindow.write(sResultLine3);
           UltraEdit.outputWindow.write(sResultLine4);
           UltraEdit.outputWindow.write(sResultLine5);
        }
        
        // Define environment for this script.
        UltraEdit.insertMode();
        if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
        else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
        
        // Close all files except this script file with file extension .js.
        for (var nDocIndex = UltraEdit.document.length - 1; nDocIndex >= 0; nDocIndex--)
        {
           if (UltraEdit.document[nDocIndex].path.search(/\.js$/i) < 0)
           {
              UltraEdit.closeFile(UltraEdit.document[nDocIndex].path,2);
           }
        }
        
        // Create an ANSI encoded file using code page 1252 in my case, save and close it.
        UltraEdit.newFile();
        if(UltraEdit.activeDocument.codePage == 65001)
        {
           UltraEdit.activeDocument.UTF8ToASCII();
        }
        else if(UltraEdit.activeDocument.codePage == 1200)
        {
           UltraEdit.activeDocument.unicodeToASCII();
        }
        WriteSaveClose("ANSI");
        
        // Create a UTF-8 encoded file, save and close it.
        UltraEdit.newFile();
        if(UltraEdit.activeDocument.codePage != 65001)
        {
           if(UltraEdit.activeDocument.codePage == 1200)
           {
              UltraEdit.activeDocument.unicodeToASCII();
           }
           UltraEdit.activeDocument.ASCIIToUTF8();
        }
        WriteSaveClose("UTF-8");
        
        
        // Create a UTF-16 encoded file, save and close it.
        UltraEdit.newFile();
        if(UltraEdit.activeDocument.codePage != 1200)
        {
           if(UltraEdit.activeDocument.codePage == 65001)
           {
              UltraEdit.activeDocument.UTF8ToASCII();
           }
           UltraEdit.activeDocument.ASCIIToUnicode();
        }
        WriteSaveClose("UTF-16");
        
        UltraEdit.ueReOn();
        UltraEdit.frInFiles.useEncoding=true;
        UltraEdit.frInFiles.encoding=-2;
        UltraEdit.frInFiles.directoryStart=g_sDirectory;
        UltraEdit.frInFiles.searchInFilesTypes=g_sFileName + "*" + g_sFileExt;
        UltraEdit.frInFiles.filesToSearch=0;
        UltraEdit.frInFiles.ignoreHiddenSubs=false;
        UltraEdit.frInFiles.logChanges=true;
        UltraEdit.frInFiles.matchCase=true;
        UltraEdit.frInFiles.matchWord=false;
        UltraEdit.frInFiles.openMatchingFiles=false;
        UltraEdit.frInFiles.preserveCase=false;
        UltraEdit.frInFiles.searchSubs=false;
        UltraEdit.frInFiles.regExp=false;
        UltraEdit.frInFiles.replace("Baeume haben Aeste","Bäume haben Äste");
        UltraEdit.frInFiles.replace("» One quarter is 1/4 and one half is 1/2. «",">> One quarter is ¼ and one half is ½. <<");
        UltraEdit.frInFiles.replace("Omega","Ω");
        UltraEdit.frInFiles.replace("(c)","©");
        
        UltraEdit.outputWindow.clear();
        EvaluateReplaces("ANSI");
        UltraEdit.outputWindow.write("");
        EvaluateReplaces("UTF-8");
        UltraEdit.outputWindow.write("");
        EvaluateReplaces("UTF-16");
        UltraEdit.outputWindow.showStatus=false;
        UltraEdit.outputWindow.showWindow(true);
        
        I reported this script specific Replace in Files with auto-detect encoding issue to IDM support by email.

        I think, this issue is caused by the fact that the searched and found string contains only ASCII characters. So UltraEdit can't determine the character encoding used in the file just on searched/found string on ANSI and UTF-8 (without BOM) encoded file. The encoding detection is much easier on UTF-16 encoded file (with BOM). However, the difference between manual/macro execution and script execution was nevertheless unexpected by me. The four Replace in Files work also for a UTF-8 encoded file on having a BOM. But UTF-8 encoded files have usually no BOM.

        The script can be used to create the three EncodingTest_*.txt files in directory C:\Temp by commenting out with a block comment all lines starting with line containing UltraEdit.ueReOn(); to end of file.

        Then a macro can be executed with UltraEdit for Windows v25.00 or any newer version with following lines:

        Code: Select all

        InsertMode
        ColumnModeOff
        HexOff
        UltraEditReOn
        ReplInFiles MatchCase Log UseEncoding -2 "C:\Temp\\" "EncodingTest_*.txt" "Baeume haben Aeste" "Bäume haben Äste"
        ReplInFiles MatchCase Log UseEncoding -2 "C:\Temp\\" "EncodingTest_*.txt" "» One quarter is 1/4 and one half is 1/2. «" ">> One quarter is ¼ and one half is ½. <<"
        ReplInFiles MatchCase Log UseEncoding -2 "C:\Temp\\" "EncodingTest_*.txt" "Omega" "Ω"
        ReplInFiles MatchCase Log UseEncoding -2 "C:\Temp\\" "EncodingTest_*.txt" "(c)" "©"
        
        The script can be used to verify the results of the macro execution by by restoring initial content and commenting out the lines from the for loop closing files to line after last UltraEdit.frInFiles.replace.

        The results written by the script for the Replace in Files executed by the macro are:

        Code: Select all

        Results for file: C:\Temp\EncodingTest_ANSI.txt
        Line 2: correct
        Line 3: correct
        Line 4: correct
        Line 5: correct
        
        Results for file: C:\Temp\EncodingTest_UTF-8.txt
        Line 2: correct
        Line 3: correct
        Line 4: correct
        Line 5: correct
        
        Results for file: C:\Temp\EncodingTest_UTF-16.txt
        Line 2: correct
        Line 3: correct
        Line 4: correct
        Line 5: correct
        
        So everything is fine on execution from within a macro or manually. I created the macro by quick recording the four Replace in Files on manual execution and verified the results also after manual execution.
        Best regards from an UC/UE/UES for Windows user from Austria