Using Perl regular expression replace fails on UTF-8 documents (fixed)

Using Perl regular expression replace fails on UTF-8 documents (fixed)

11
Basic UserBasic User
11

    Nov 22, 2009#1

    Hello!

    I've recently tried to dig into UTF-8 format (thanks to this thread, was a great read) and everything seems to work fine, except using RegEx.

    I'm using a script when saving documents that determines wheter it is a script document of mine, including a particular header string stating the time the file was created and edited last.
    The script works fine on ASCII/ANSI documents, however this regex search fails on UTF-8 documents and i fail to see why.

    The header part looks like this:

    Code: Select all

    // File written by Jochen "Khuri" Höhmann <mailadress>
    // Copyright 2009
    //
    // File        : test.php
    // Begin       : 2009.11.22 13:45:32
    // Last Update : 2009.11.22 13:47:54
    This header (and some possible variations) is added using templates.
    Now when i save a document, the following script is executed.

    Code: Select all

    var cline = UltraEdit.activeDocument.currentLineNum;
    var crow = UltraEdit.activeDocument.currentColumnNum;
    if (typeof(UltraEdit.activeDocumentIdx) == "undefined") crow++;
    UltraEdit.activeDocument.top();
    UltraEdit.activeDocument.findReplace.mode = 0;
    var is_privdoc = UltraEdit.activeDocument.findReplace.find("File written by Jochen \"Khuri\" Höhmann <mailadress>");
    if(is_privdoc == true) {
    	var time = new Date();
    	var currdate = time.getFullYear()+'.'+(((time.getMonth() +1).toString().length > 1) ? (time.getMonth() +1) : '0'+(time.getMonth() +1))+'.'+((time.getDate().toString().length > 1) ? time.getDate() : '0'+time.getDate())+' '+((time.getHours().toString().length > 1) ? time.getHours() : '0'+time.getHours())+':'+((time.getMinutes().toString().length > 1) ? time.getMinutes() : '0'+time.getMinutes())+':'+((time.getSeconds().toString().length > 1) ? time.getSeconds() : '0'+time.getSeconds());
    	UltraEdit.perlReOn();
    	UltraEdit.activeDocument.findReplace.regExp = true;
    	if(UltraEdit.activeDocument.findReplace.find("// Copyright "+time.getFullYear()) == false) {
    		UltraEdit.activeDocument.findReplace.replace("\\/\\/ Copyright (?:\\d{4})","// Copyright "+time.getFullYear());
    	}
    	if(UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : [\n\r]","// Last Update : "+currdate+"\n") == false) {
    		UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : (?:\\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2})","// Last Update : "+currdate);
    	}
    }
    UltraEdit.activeDocument.gotoLine(cline,crow);
    UltraEdit.save();
    Both lines

    Code: Select all

    UltraEdit.activeDocument.findReplace.replace("\\/\\/ Copyright (?:\\d{4})","// Copyright "+time.getFullYear());
    and
    UltraEdit.activeDocument.findReplace.replace("\\/\\/ Last Update : (?:\\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2})","// Last Update : "+currdate);
    fail to work on UTF-8 documents.
    I assume it might have to do with the way digits are handled in UTF files? Yet so far i fail to figure out what's wrong here...

    Any help is greatly appreciated. Thanks in advance! :)

    236
    MasterMaster
    236

      Nov 22, 2009#2

      I don't use the scripting engine of UE, so I'm not sure what's going on.

      Two things to consider, though: First, no need to escape the forward slash. It has no special meaning in regex (if used in a JavaScript string instead of a JavaScript regex object where it does have a special meaning). I don't think that this is breaking your regex - but try that first.

      Second, try [0-9] instead of \d and see if that changes things. If it does, then there might well be something wrong with the regex engine in UE's JavaScript implementation. Bego is the JavaScript expert around here, so I'm curious about what he thinks about this.

      6,686585
      Grand MasterGrand Master
      6,686585

        Nov 23, 2009#3

        Something strange is going on here which I must further analyze. It looks like the replace command finds the strings, but does not replace them correct for Unicode files. In the meantime you can use this script.

        Code: Select all

        var cline = UltraEdit.activeDocument.currentLineNum;
        var crow = UltraEdit.activeDocument.currentColumnNum;
        if (typeof(UltraEdit.activeDocumentIdx) == "undefined") crow++;
        UltraEdit.activeDocument.top();
        UltraEdit.perlReOn();
        UltraEdit.activeDocument.findReplace.mode=0;
        UltraEdit.activeDocument.findReplace.matchCase=false;
        UltraEdit.activeDocument.findReplace.matchWord=false;
        UltraEdit.activeDocument.findReplace.regExp=false;
        UltraEdit.activeDocument.findReplace.searchAscii=false;
        UltraEdit.activeDocument.findReplace.searchDown=true;
        UltraEdit.activeDocument.findReplace.searchInColumn=false;
        var is_privdoc = UltraEdit.activeDocument.findReplace.find("File written by Jochen \"Khuri\" Höhmann <mailadress>");
        if(is_privdoc == true) {
           var time = new Date();
           var currdate = time.getFullYear()+'.'+(((time.getMonth() +1).toString().length > 1) ? (time.getMonth() +1) : '0'+(time.getMonth() +1))+'.'+((time.getDate().toString().length > 1) ? time.getDate() : '0'+time.getDate())+' '+((time.getHours().toString().length > 1) ? time.getHours() : '0'+time.getHours())+':'+((time.getMinutes().toString().length > 1) ? time.getMinutes() : '0'+time.getMinutes())+':'+((time.getSeconds().toString().length > 1) ? time.getSeconds() : '0'+time.getSeconds());
           var curryear = currdate.substr(0,4);
           UltraEdit.activeDocument.findReplace.regExp=true;
           UltraEdit.activeDocument.findReplace.preserveCase=false;
           UltraEdit.activeDocument.findReplace.replaceAll=false;
           UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
           if(UltraEdit.activeDocument.findReplace.find("// Copyright "+curryear) == false) {
              UltraEdit.activeDocument.findReplace.find("// Copyright \\d{4}");
              UltraEdit.activeDocument.write("// Copyright "+curryear);
           }
           if(UltraEdit.activeDocument.findReplace.replace("// Last Update : [\\n\\r]","// Last Update : "+currdate+"\n") == false) {
              UltraEdit.activeDocument.findReplace.find("// Last Update : \\d{4}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}:\\d{2}");
              UltraEdit.activeDocument.write("// Last Update : "+currdate);
           }
        }
        UltraEdit.activeDocument.gotoLine(cline,crow);
        UltraEdit.save();
        Edit: I reported the problem by email after deeper analyzing it and got already a reply that IDM support could reproduce the issue.replaceInAllOpen=false;

        11
        Basic UserBasic User
        11

          Nov 25, 2009#4

          Whew, kinda glad it's a bug then and not me. Spend quite some time trying it out over and over again, but the normal UE replace function is not working either as you stated Mofi. Have to keep that in mind as I'm using regular expressions quite often. So let's hope IDM fixes this soon.

          Anyhow, thanks for your script replacement Mofi, it works fine :)

          6,686585
          Grand MasterGrand Master
          6,686585

            Jun 02, 2013#5

            While original script by Khuri executed with UE v18.20.0.1028 still failed to make the replaces, the script works with UE v19.00.0.1026 and later versions.

            In UE v19.00 the Perl regular expression inside UltraEdit was updated and it looks like with this update many Perl regular expression Find and Replace issues like this one are fixed.

            PS: The first public release of UE v19.00 was v19.00.0.1022 which does not make the replace 100% correct. With the next hotfix release v19.00.0.1026 the original script by Khuri results in same correct output as my script with the workaround.