How to find and remove duplicates in same line?

sitrucz · Dec 18, 2015#12015-12-18T17:04+00:00

I need to replace this line:

MOA***M15*M15*M15~

with:

MOA***M15~

How can I write a generic expression as M15 could change. It could be N121*N121*N121*N121 and so forth. The count could also change. For example it could be duplicated 2, 3, 4, or more times.

Thanks.

Ovg · Dec 18, 2015#22015-12-18T17:40+00:00

Try:

Find what: MOA\*{3}(([A-Z]\d+)\*?){1,}~
Replace with: MOA***\1~

Regular expressions - Perl

sitrucz · Dec 18, 2015#32015-12-18T17:54+00:00

Thanks for looking into it.

It finds all of these matches in my test file below but the only valid one is the last.

MOA***N657~
MOA***M15~
MOA***N657~
MOA***N657*M15~
MOA***N30~
MOA***N30~
MOA***N30~
MOA***M15*M15*M15~

Do you have any other suggestions?

Ovg · Dec 18, 2015#42015-12-18T18:14+00:00

Find what: MOA\*{3}(((M|N)\d+)\**){3,}~

Mofi · Dec 18, 2015#52015-12-18T19:31+00:00

Use as Perl regular expression search string (\*[MN]\d+)(?:\1)+~ and as replace string \1~

Explanation:

(...) ... a capturing group. The string found by this expression is back-referenced twice, one times with \1 in search string and one more times also with \1 in replace string.

Perl regular expression supports back-referencing in search and replace string while the UltraEdit and Unix regular expression engines support it only in replace string.

\* ... asterisk interpreted as literal character because of escaping it with a backslash.

[MN] ... either M or N specified in a character set; case depends on Match case option.

\d+ ... any digit 1 or more times.

(?:\1)+ ... a non-capturing group with an expression or fixed string which must be applied 1 or more times whereby the expression is in this case just the fixed string found already before with expression \*[MN]\d+ in the capturing group.

~ ... simply this character.

Well, the non-capturing group is in this case not really needed. It just helps making it easier to read the expression.
The search string (\*[MN]\d+)\1+~ works, too.

Dec 18, 2015#62015-12-18T19:46+00:00

Ovg, some hints:

Never use a capturing group within a capturing group. The behavior on back-referencing is undefined in this case. One or more not nested capturing groups within a non-capturing group is okay as also several even nested non-capturing groups within a capturing group. But don't use a search string with a capturing group within a capturing group.

It can be used always + instead of {1,} independent on what is left the multiplier. Both are a greedy multiplier for one or more times.

It can be used always ? instead of {0,1} meaning 0 or 1 times. ? is often called optional multiplier as the string or expression left can optionally exist (one times), but must not exist at all.

Note: ? after a not escaped ( changes meaning of the group depending on the next character(s) from a capturing group to a non-capturing group, a positive or negative lookahead or lookbehind or a pattern modifier.

sitrucz · Dec 18, 2015#72015-12-18T20:40+00:00

Use as Perl regular expression search string (\*[MN]\d+)(?:\1)+~ and as replace string \1~

Thank you very much that works. Your explanation will be beneficial to my future "attempts" at finding these complicated streams.

I did not know backreferencing using the Unix engine in Find did not work. Should I just be using Perl regex engine for the most part? It seems to be the most flexible.

Sitrucz

Ovg · Dec 19, 2015#82015-12-19T05:52+00:00

Mofi, thanks for your suggestions!

Mofi · Dec 19, 2015#92015-12-19T11:09+00:00

sitrucz wrote:Should I just be using Perl regex engine for the most part? It seems to be the most flexible.

The Perl regular expression engine is the most powerful. But this also means Perl regex syntax is the most difficult to learn. What to learn first depends on how often you need to do complex regex finds/replaces and how often simple regex finds/replaces are needed for a task. However, with Perl you can really do everything which can be done also with the other engines, with the exception of including selected text or text in clipboard in search/replace strings which is supported only on non regex or UltraEdit regex finds/replaces.