Perl regex replace picky about surrounding characters?

tsmith35 · Nov 14, 2007#12007-11-14T06:44+00:00

I was doing some Perl regex replacement tonight and noticed something odd. Perhaps I've missed something here...

I start with a file containing the following on one line:

Joe Goodman

I do a replace using Perl regex:
Find what: (\w*)\s(\w*)
Replace with: \2 \1

The result is Goodman Joe as expected.

Now I undo, and try this replacement using Perl regex:
Find what: (\w*)\s(\w*)
Replace with: "\2" "\1"

The result is "Joe" "Goodman""" """" "", not expected.

So I undo and try this replacement using Perl regex:
Find what: (\w*)\s(\w*)
Replace with: x\2x x\1x

The result is xGoodmanx xJoex, again as expected.

Now I undo and try this one:
Find what: (\w*)\s(\w*)
Replace with: (\2) (\1)

The result is (Goodman) (Joe)() ()() (), not expected. Perhaps the leading and trailing parenthesis for each replacement need to be escaped?

So I undo and try this one:
Find what: (\w*)\s(\w*)
Replace with: \(\2\) \(\1\)

The result is (Goodman) (Joe)() ()() (), not expected.

What am I doing wrong? I'm currently using UE 13.20+2. Any suggestions are welcome.

Thanks,
Tom

jorrasdk · Nov 14, 2007#22007-11-14T12:58+00:00

You do nothing wrong as far as I see. I can confirm the error. Please report it to IDM support (e-mail address at the top of this page).

In the meantime, switch to the legacy Unix regular expression engine. It will handle your expressions above as expected.

tsmith35 · Nov 15, 2007#32007-11-15T06:45+00:00

jorrasdk, thanks for checking this out. I'll report it as a bug and switch to plain Unix regex as you suggested.

Tom

Nov 17, 2007#42007-11-17T02:58+00:00

jorrasdk, the folks at IDM pointed me to the root cause of the issue: in Perl compatible regex, \s doesn't just comprise normal whitespace -- it also includes CR and LF characters. By changing my search from:

(\w*)\s(\w*)
(any number of word characters)(Perl whitespace)(any number of word characters)

to

(\w+)\s(\w+)
(one or more word characters)(Perl whitespace)(one or more word characters)

the Replace All search works as intended. I could have also specified tabs and spaces instead of \s, but either way works fine.

Just wanted to post this in case it is helpful to anyone else.

Tom

mjcarman · Nov 21, 2007#52007-11-21T23:22+00:00

Curious. I'm running UE v13.20+2 and do not get different results depending on what literal text I use in the replacement. When my replacement text is "x\2x x\1x"I get

xGoodmanx xJoexxx xxxx xx

which is consistent, and what one should expect (for a DOS file with CRLF line endings).

Jane · Nov 22, 2007#62007-11-22T00:50+00:00

One thing that seems to be overlooked is that (\w*)\s(\w*) essentially says match a space with optional trailing and\or leading \w character. In other words it will match a naked space OR it will match a space surrounded by anything else.
For example:

Joe Goodman & @

Search for (\w*)\s(\w*)
Replace with "\2" "\1"

Replace all gives:
"Goodman" "Joe""" ""&"" ""@

The first search selects Joe Goodman
and does the expected replace to give

"Goodman" "Joe" & @
The regex next matches " &
- Note neither of the adjacent characters is a \w but that doesn't matter, they are optional.
The replace continues through the string until it runs out of spaces.

Or, try a simply the string & @ and it will match,
or even a single space on a line and again it will match.
However, if you use (\w+)\s(\w+) these bad matches no longer occur because you insist on at least one \w leading and following.
I think the \s matching \r\n is a bit of a red herring.
The watch out is to remember that using * as a quantifier means match anything or nothing either is OK.

Cheers,
Jane

mjcarman · Nov 22, 2007#72007-11-22T01:13+00:00

"Goodman" "Joe" & @
The regex next matches " &

With apologies for pedantry, you mean to say that the next match occurs at the space character between the " and & characters. The characters themselves are not included in the match.

I agree with you that the real problem is the use of the * (zero-or-more) quantifier, not that \s matches newlines. The behavior of \s just exposed the error in the find expression.

Jane · Nov 22, 2007#82007-11-22T02:31+00:00

Well spotted (Mr. pedantic

). Actually I had previewed/submitted the post and went back and had changed
The regex next matches to The regex next matches at
as well as a couple of other things, but when I went to submit I had timed out (it took me a while to pound that in and make sure all the quotes were right), and my system was hanging so I left it.
One further thought on the \s would be to just use [ \t] which is sometimes called "horizontal whitespace" (GNU extension) rather that \s which means [ \t\r\n\f].
I think [ \t] is also equivalent to [[:blank]] as a character set in the UltraEdit Perl regex implementation.

Cheers,
Jane