Tapatalk

Weird behavior with Unix regular expression

Weird behavior with Unix regular expression

413
Basic UserBasic User
413

    Jul 18, 2010#1

    I usually use regex coach that helps visualize a regex. But, sometimes in UEStudio I'm getting different and/or wrong results using the same regex.
    For example regex:

    Code: Select all

    ^( *)([ a-zA-Z$()]+(\)|else)) *{$
    and to code to test it against:

    Code: Select all

    <?php
    $blah = 0;
    if ($blah) {
    }
    else{
    }
        if ($blah) {
        }
        else{
        }
    ?>
    It find all the "if" and "else" lines ok, but sometimes it doesn't select the whole line.
    And when I try use this regex in replace, the result is even weirder:
    Replacement string:

    Code: Select all

    \1\2\n\1{
    (the idea it should move "{" from the end of the new line and add spaces if necessary)
    The expected result:

    Code: Select all

    <?php
    $blah = 0;
    if ($blah)
    {
    }
    else
    {
    }
      if ($blah)
      {
      }
      else
      {
      }
    ?>
    Actual result:

    Code: Select all

    <?php
    $blah = 0;
    if ($blah)
    {
    }
    else{
    
    {
    }
     if ($blah)
     {
        }
      else{
     {
        }
    ?>
    Now, when I try put entire line in a group:

    Code: Select all

    ^(( *)([ a-zA-Z$()]+(\)|else))) *{$
    UES doesn't find anything at all.


    EDIT
    When I tested it using instructions above I got a weird bug after used "replace all" and then "undo" - it totally destroyed the original code. This happened to me several times in different accusations, but I blamed it on incorrect handling pasted Unicode text into non-Unicode file and then undo, this time is a totally different bug, yet same result.

    6,685587
    Grand MasterGrand Master
    6,685587

      Jul 18, 2010#2

      First, when writing about a possible bug, it is always extremely important to know which version of the application you are using. So which version of UEStudio do you use?

      Second, is your regular expression string executed with the legacy Unix or the Perl compatible regular expression engine? That is absolutely not clear from what you have written. I suppose that your expression is executed with the legacy Unix regex engine because using the Perl engine would result in the error message "You have entered an invalid regular expression!" when you would try to execute the replace with the regex string below.

      I see lots of problems when I look on the regular expression string ^( *)([ a-zA-Z$()]+(\)|else)) *{$

      The first problem is ( *) followed by ([ a-zA-Z$()]+(\)|else). The first expression should match 0 or more spaces and tags that string for use in the replace string. Of course "0 or more" means the tagged string can be also empty. The second expression starts with a character range definition which includes also the space character and this string must have at least 1 character. So you have defined an expression with overlapping character ranges and further defined that the first part can take as less characters (spaces) as possible. Yes, * means match "as less as" and not "as much as" possible. Therefore I would not be surprised that the first tagged string is always empty and the second tagged string contains all the preceding spaces of a line.

      The second problem is (\)|else). You actually just want an OR expression, but you put this OR expression into tagging parenthesis. But there is already a parenthesis pair around that part of the expression to tag the found string. Tag a string inside a tagged string? Sorry, no, that's not a good idea.

      The third problem is the OR expression itself. The expression [ a-zA-Z$()]+ matches everything which is a letter in any case (a-z or A-Z would be enough when running a replace with option Match Case not checked), the dollar sign and both parenthesis. The OR expression \)|else is for a closing parenthesis OR word else. But all the characters of these two strings are already matched by the previous character range expression. So what is this OR expression for?

      The fourth problem is once again the expression for 0 or more spaces. The second expression part matches already spaces, too.

      For the legacy Unix engine that would be all. But for the Perl engine { is not acceptable here and is the reason why the Perl engine returns that this is an invalid regex string. { has a special meaning and therefore must be escaped with a backslash.

      In other words the Unix regular expression string ^[ a-zA-Z$()]+{$ matches the same as your expression with 4 overlapping expressions. Of course you don't want this expression because you need the preceding spaces and want to delete also the trailing spaces when moving { to next line.

      Well, when you use ^( *)([a-z][ a-z$()]+){$ as search string and \1\2\r\n\1{ as replace string (not case sensitive) you can move all { at end of a line to next line. But you have to use additionally Format - Trim Trailing Spaces to delete the remaining trailing spaces.

      Or alternatively you switch to more powerful Perl regular expression engine and use as search string ^( *)(\w.*?) *\{$ and as replace string again \1\2\r\n\1{. That makes exactly the replace you want. Why?

      ^ ... start of line.

      ( *) ... 0 or more spaces and tag that string.

      (\w.*?) ... after the spaces a word character must follow. This makes sure that previous expression matches all preceding spaces. After the word character any character except the new line characters can follow 0 or more times and tag that string. The special trick here is the question mark. It informs the Perl regex engine that the preceding expression should match as less characters as possible to be true (= non greedy) and additionally fulfill the next part of the expression. That question mark is the reason why this part of the expression does not match the spaces preceding { at end of line. That's a special Perl feature. The legacy Unix engine can't do that.

       *\{$ ... 0 or more spaces preceding { at end of a line.

      Finally be careful when inserting line terminators with a replace command. The line terminator type in the replace string must match the line terminator type currently used by the opened file. Normally DOS files are edited. Sometimes also UNIX or MAC files are edited with a temporary conversion to DOS while loaded in UltraEdit. For such files \r\n must be used in Unix and Perl regular expression search/replace strings and ^p in non regex or UltraEdit regular expression strings. Only in the case that a UNIX or MAC file is loaded without temporary conversion of all line terminations in the file from UNIX (LF only) or MAC (CR only) to DOS (CR and LF), it is correct to use just \n or just \r in the Unix or Perl regex strings.

      PS: Have you ever looked on the Artistic Style Formatter tool available in the tools toolbar and documented in help of UEStudio? It could be very helpful for you on reformatting code lines to the coding style you prefer.
      Best regards from an UC/UE/UES for Windows user from Austria

      413
      Basic UserBasic User
      413

        Jul 18, 2010#3

        Thank you very much for such detailed explanation. It sure helped a lot understanding regex and the differences between Unix and Perl type.

        Yes, I was using "Pnix" type of regex, not "Perl".
        Using UES v10.10.0.1012 on Windows 7 Ultimate x64
        Also worth mention, I use UNIX type line breaks in all new documents.

        Perhaps it's because "legacy" library ("Unix" type) doesn't like overlapping in the expressions it didn't highlight whole line. With the Perl the same string with escaped { bracket worked flawlessly.
        It also appears the bug that destroys original text after "replace all" + "undo", only can be reproduced when using "Unix" type of regex.

        As of "Artistic Style Formatter" tool - sure thing I didn't know about it...it truly has potentials save me a lot of time on formatting, had it on the toolbar but as the remaining 90% of the icons never been pressed on :D

        Thank you again for all your help.

        6,685587
        Grand MasterGrand Master
        6,685587

          Jul 19, 2010#4

          Depending on the settings at File Handling - DOS/Unix/Mac Handling even new UNIX files can be DOS files while editing.

          If configuration setting Default file type for new files is set to Unix and setting Unix/Mac file detection/conversion is set to either Never prompt to convert files to DOS format or Prompt to convert if file is not DOS format, the new Unix file really has only LF as line terminator. But with Automatically convert to DOS format selected and of course Save file as input format (Unix/Mac/DOS) checked the new Unix file has DOS line endings and is just saved on disk with Unix line endings.

          Because I rarely edit really large files (and which are DOS files when I do that) and therefore nearly always edit files with a temporary file I use the settings Automatically convert to DOS format and Save file as input format (Unix/Mac/DOS). So even my Unix files have DOS line terminators when opened in UltraEdit and that makes replaces, scripts, macros and data exchange via clipboard with other Windows applications more easy. The files are always DOS files when opened in UltraEdit.
          Best regards from an UC/UE/UES for Windows user from Austria