Mark lines with duplicates in first 60 columns...

Ghoulardi · Oct 30, 2009#12009-10-30T16:44+00:00

I can achieve this with the new sort "remove duplicates" upto column 60. But I dont want to sort, I need to leave file the same.

I have a file with an average column length of 180. and 26,000 lines.

I want to mark every line that has a duplicate within the first 60 columns.

Here is an example.

Imagine the letters in my example are within the first 60 columns, and the numbers are column 61 and up.

the duplicate line in my example is the 3rd and 6th lines, both the same as 1st line. I want them marked with <<>> or something.

before....

aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
aaaa 1111 2222 3333 4444
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
aaaa 1111 2222 3333 4444

after...

aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
<<>>aaaa 1111 2222 3333 4444
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
<<>>aaaa 1111 2222 3333 4444

I hope that makes sense.

Is there a way to do this? I've searched the forums for a bit and haven't found the answer yet so sorry if I missed it.

I'm using UE v15.20.0.1020.

Mofi · Oct 31, 2009#22009-10-31T13:54+00:00

This is no problem using a macro. The macro is a simplified version of what I have posted at How do I remove duplicate lines?

The macro first inserts at start of every non blank line the string #MOFI_RULES# as marker string for start of the line.

Then a loop is executed. Inside the loop a Perl regular expression search is used to find the next line starting with the marker string #MOFI_RULES# followed by up to 60 characters except line terminating characters. If such a string could not be found anymore from current cursor position to bottom of the file, the loop is exited.

If a string is found, it is copied to clipboard 9 and the cursor is moved to the start of the next line.

A simple, non regular expression replace all command is now used to insert your marker string <<<>>> at start of all lines starting also with the marker string #MOFI_RULES# and with the same up to 60 characters as the line above the cursor. The marker string #MOFI_RULES# prevents this replace command for finding the up to 60 characters anywhere else than at start of a line. A regular expression replace can't be used here because ^c (clipboard content) is only supported by the UltraEdit regular expression engine and the clipboard content would be interpreted also as UltraEdit regular expression string if it would contain UltraEdit regular expression characters. That's the reason why the search and replace all command can't be a regular expression replace.

Your marker string <<<>>> inserted at start of lines with the first 60 characters duplicate to another line prevents such lines for being taken into account on further searches.

Finally the marker string #MOFI_RULES# is removed from all lines and Windows clipboard is activated again.

The macro property Continue if search string not found must be checked for this macro.

InsertMode
ColumnModeOff
HexOff
PerlReOn
Bottom
IfColNumGt 1
"
"
EndIf
Top
Find RegExp "^([^\r\n])"
Replace All "#MOFI_RULES#\1"
Clipboard 9
Loop
Find RegExp "^#MOFI_RULES#.{1,60}"
IfNotFound
ExitLoop
EndIf
Copy
Key DOWN ARROW
Key HOME
Find MatchCase "^c"
Replace All "<<<>>>^c"
IfFound
Find MatchCase "<<<>>>#MOFI_RULES#"
Replace All "<<<>>>"
EndIf
EndLoop
Top
Find MatchCase "#MOFI_RULES#"
Replace All ""
ClearClipboard
Clipboard 0

The red marked code is only needed if the file has also lines with less than 60 characters and those lines start with the same characters as another line with more characters which is located above. For example for a file content like the following

aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
aaaa 1111 2222
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
aaaa 1111 2222 3333 4444
aaaa 1111 2222

the macro above without the red marked code would produce

aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
aaaa 1111 2222
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
<<<>>><<<>>>aaaa 1111 2222 3333 4444
<<<>>>aaaa 1111 2222

As you can see the 6th line is marked twice as duplicate which is not 100% correct. The red marked code removes the marker string #MOFI_RULES# from lines already marked with <<<>>> to prevent marking such lines more than once. If your file does not contain lines with less than 60 characters you can omit the red marked code which makes the macro faster.

Ghoulardi · Nov 01, 2009#32009-11-01T00:55+00:00

MOFI, Thanks so much. It is exactly what I asked for.

BUT, I made a mistake when asking. I must not have been thinking, I'm really sorry but I actually need to mark the duplicate along with the original occurrence.

Would it be to much to ask if you could modify it so it will do that?

SO just like you did, but also marking the original...

before....

aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
aaaa 1111 2222 3333 4444
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
aaaa 1111 2222 3333 4444

after...

<<>>aaaa 1111 2222 3333 4444
bbbb 1111 2222 3333 4444
<<>>aaaa 1111 2222 3333 4444
cccc 1111 2222 3333 4444
dddd 1111 2222 3333 4444
<<>>aaaa 1111 2222 3333 4444

Thanks again MOFI!!

Mofi · Nov 01, 2009#42009-11-01T11:51+00:00

No problem. The 2 green marked lines must be inserted to mark also the first line where duplicates are found. The entire IfFound ... EndIf condition block is now not optional anymore, only the first find and replace all command within the condition code block.

InsertMode
ColumnModeOff
HexOff
PerlReOn
Bottom
IfColNumGt 1
"
"
EndIf
Top
Find RegExp "^([^\r\n])"
Replace All "#MOFI_RULES#\1"
Clipboard 9
Loop
Find RegExp "^#MOFI_RULES#.{1,60}"
IfNotFound
ExitLoop
EndIf
Copy
Key DOWN ARROW
Key HOME
Find MatchCase "^c"
Replace All "<<<>>>^c"
IfFound
Find MatchCase "<<<>>>#MOFI_RULES#"
Replace All "<<<>>>"
Find MatchCase Up "#MOFI_RULES#"
Replace "<<<>>>"
EndIf
EndLoop
Top
Find MatchCase "#MOFI_RULES#"
Replace All ""
ClearClipboard
Clipboard 0

Ghoulardi · Nov 02, 2009#52009-11-02T15:49+00:00

Perfect!

Thanks so much for your time again Mofi.

You're one of the best parts of Ultraedit.

scoobman3 · Oct 18, 2010#62010-10-18T19:37+00:00

This macro works exactly as advertised in UltraEdit 15 and 16, but I can't get it to work in UltraEdit 12.10. Is there something different that needs to be done in that version or is it "user error" that's making it not run. I see that it is adding the marker string and then deleting it, but it is not marking any duplicates.

TIA,
scoobman3

Mofi · Oct 19, 2010#72010-10-19T05:30+00:00

The problem with this macro and UE v12.10 is a bug with the Perl regular expression engine introduced with UE v12.00 and used in this macro. This macro as posted above requires at least UE v13.00 to work because of the bug. The problem is that the expression {1,60} is not available in the UltraEdit or Unix regular expression engine. So all lines must have at least 60 characters on a line. However, that can be guaranteed with additional macro code and than the UltraEdit regular expression engine can be used. Here is the macro code which worked with UE v12.10 for the example. The blue colored parts are the changes made in comparison to the macro above to get the same result using the UltraEdit regular expression engine.

InsertMode
ColumnModeOff
HexOff
UnixReOff
Bottom
IfColNumGt 1
"
"
EndIf
Top
Key END
Loop
IfColNumGt 60
ExitLoop
Else
" "
EndIf
EndLoop
ColumnModeOn
ColumnInsert " "
ColumnModeOff
Top
Find RegExp "%^([~^p]^)"
Replace All "#MOFI_RULES#^1"
Clipboard 9
Loop
Find RegExp "%#MOFI_RULES#????????????????????????????????????????????????????????????"
IfNotFound
ExitLoop
EndIf
Copy
Key DOWN ARROW
Key HOME
Find MatchCase "^c"
Replace All "<<<>>>^c"
IfFound
Find MatchCase "<<<>>>#MOFI_RULES#"
Replace All "<<<>>>"
Find MatchCase Up "#MOFI_RULES#"
Replace "<<<>>>"
EndIf
EndLoop
Top
Find MatchCase "#MOFI_RULES#"
Replace All ""
Find MatchCase RegExp "%<<<>>>^(*^)$"
Replace All "^1#!#"
Key END
Key LEFT ARROW
IfCharIs "#"
Key LEFT ARROW
Key LEFT ARROW
Key LEFT ARROW
EndIf
ColumnModeOn
ColumnDelete 1
ColumnModeOff
Top
Find MatchCase RegExp "%^(*^)#!#$"
Replace All "<<<>>>^1"
TrimTrailingSpaces
ClearClipboard
Clipboard 0

WelshFargo · Jan 06, 2012#82012-01-06T23:27+00:00

This is a useful macro, but a bit slow on big files. Has anyone written the same thing as a script?

Mofi · Jan 07, 2012#92012-01-07T10:30+00:00

A script is much faster when everything can be done in RAM instead of active file. So if a file has only some 100 KB or some MB, no problem to do this with a script faster in RAM. But with a large file of several 100 MB or even GB it would be also a problem to do this with a script because loading all lines into RAM could fail. Well, a script which works like this macro while having the maximized window of a new file active would be also faster than the macro because UltraEdit would not need to refresh the display all the time during execution.

Marking lines with duplicates in first 60 columns could be done much faster when the lines are sorted alphabetically. The macro as written here does not require that the lines are sorted and therefore searches every line against all lines below in the file. That can take a long time with hundred thousands of lines. If the lines are sorted, a single Perl regular expression Replace could do the job too and would "compare" the first 60 characters of a line only against the first 60 characters of the line below (and the next but one line if two lines start with same 60 characters). In other words with a sorted file the number of "compares" could be reduced dramatically.