Perl regex replace picky about surrounding characters?

Perl regex replace picky about surrounding characters?

22
Basic UserBasic User
22

    Nov 14, 2007#1

    I was doing some Perl regex replacement tonight and noticed something odd. Perhaps I've missed something here...

    I start with a file containing the following on one line:

    Joe Goodman

    I do a replace using Perl regex:
    Find what: (\w*)\s(\w*)
    Replace with: \2 \1

    The result is Goodman Joe as expected.

    Now I undo, and try this replacement using Perl regex:
    Find what: (\w*)\s(\w*)
    Replace with: "\2" "\1"

    The result is "Joe" "Goodman""" """" "", not expected.

    So I undo and try this replacement using Perl regex:
    Find what: (\w*)\s(\w*)
    Replace with: x\2x x\1x

    The result is xGoodmanx xJoex, again as expected.

    Now I undo and try this one:
    Find what: (\w*)\s(\w*)
    Replace with: (\2) (\1)

    The result is (Goodman) (Joe)() ()() (), not expected. Perhaps the leading and trailing parenthesis for each replacement need to be escaped?

    So I undo and try this one:
    Find what: (\w*)\s(\w*)
    Replace with: \(\2\) \(\1\)

    The result is (Goodman) (Joe)() ()() (), not expected.

    What am I doing wrong? I'm currently using UE 13.20+2. Any suggestions are welcome.

    Thanks,
    Tom

    262
    MasterMaster
    262

      Nov 14, 2007#2

      You do nothing wrong as far as I see. I can confirm the error. Please report it to IDM support (e-mail address at the top of this page).

      In the meantime, switch to the legacy Unix regular expression engine. It will handle your expressions above as expected.

      22
      Basic UserBasic User
      22

        Nov 15, 2007#3

        jorrasdk, thanks for checking this out. I'll report it as a bug and switch to plain Unix regex as you suggested.

        Tom

          Nov 17, 2007#4

          jorrasdk, the folks at IDM pointed me to the root cause of the issue: in Perl compatible regex, \s doesn't just comprise normal whitespace -- it also includes CR and LF characters. By changing my search from:

          (\w*)\s(\w*)
          (any number of word characters)(Perl whitespace)(any number of word characters)

          to

          (\w+)\s(\w+)
          (one or more word characters)(Perl whitespace)(one or more word characters)

          the Replace All search works as intended. I could have also specified tabs and spaces instead of \s, but either way works fine.

          Just wanted to post this in case it is helpful to anyone else.

          Tom

          119
          Power UserPower User
          119

            Nov 21, 2007#5

            Curious. I'm running UE v13.20+2 and do not get different results depending on what literal text I use in the replacement. When my replacement text is "x\2x x\1x"I get

            xGoodmanx xJoexxx xxxx xx

            which is consistent, and what one should expect (for a DOS file with CRLF line endings).

            22
            Basic UserBasic User
            22

              Nov 22, 2007#6

              One thing that seems to be overlooked is that (\w*)\s(\w*) essentially says match a space with optional trailing and\or leading \w character. In other words it will match a naked space OR it will match a space surrounded by anything else.
              For example:

              Joe Goodman & @

              Search for (\w*)\s(\w*)
              Replace with "\2" "\1"

              Replace all gives:
              "Goodman" "Joe""" ""&"" ""@

              The first search selects Joe Goodman
              and does the expected replace to give

              "Goodman" "Joe" & @
              The regex next matches " &
              - Note neither of the adjacent characters is a \w but that doesn't matter, they are optional.
              The replace continues through the string until it runs out of spaces.

              Or, try a simply the string & @ and it will match,
              or even a single space on a line and again it will match.
              However, if you use (\w+)\s(\w+) these bad matches no longer occur because you insist on at least one \w leading and following.
              I think the \s matching \r\n is a bit of a red herring.
              The watch out is to remember that using * as a quantifier means match anything or nothing either is OK.

              Cheers,
              Jane

              119
              Power UserPower User
              119

                Nov 22, 2007#7

                "Goodman" "Joe" & @
                The regex next matches " &
                With apologies for pedantry, you mean to say that the next match occurs at the space character between the " and & characters. The characters themselves are not included in the match.

                I agree with you that the real problem is the use of the * (zero-or-more) quantifier, not that \s matches newlines. The behavior of \s just exposed the error in the find expression.

                22
                Basic UserBasic User
                22

                  Nov 22, 2007#8

                  Well spotted (Mr. pedantic :wink: ). Actually I had previewed/submitted the post and went back and had changed
                  The regex next matches to The regex next matches at
                  as well as a couple of other things, but when I went to submit I had timed out (it took me a while to pound that in and make sure all the quotes were right), and my system was hanging so I left it.
                  One further thought on the \s would be to just use [ \t] which is sometimes called "horizontal whitespace" (GNU extension) rather that \s which means [ \t\r\n\f].
                  I think [ \t] is also equivalent to [[:blank]] as a character set in the UltraEdit Perl regex implementation.

                  Cheers,
                  Jane