No regular expression find can do that in any application on a large file.
I wrote for this task first following UltraEdit script and tested it with UltraEdit v16.30 on the small example:
Code: Select all
if (UltraEdit.document.length > 0) // Is any file opened?
{
// Define environment for this script.
UltraEdit.insertMode();
if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
// Move caret to bottom of the active file and make sure the last line
// has also a line ending. Next get line number which is equal the total
// number of lines in active file. Then move caret to top of the file.
UltraEdit.activeDocument.bottom();
if (UltraEdit.activeDocument.isColNumGt(1))
{
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1))
{
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
var nTotalLines = UltraEdit.activeDocument.currentLineNum;
UltraEdit.activeDocument.top();
// Has the active file at least two lines?
if (nTotalLines > 2)
{
// Select user clipboard 9 as active clipboard.
UltraEdit.selectClipboard(9);
// Get document index of active file.
var nDocIndex = UltraEdit.activeDocumentIdx;
// Get name of file with full path and append an opening parenthesis.
var sFileName = UltraEdit.activeDocument.path + '(';
// There must be executed lots of finds which would result in lots
// of document window refreshes which would take a lot of time.
// For that reason create a new file which becomes the active file
// and which is hopefully displayed maximized making it not necessary
// for UltraEdit running the finds in previously active file to
// refresh document window area after each successful find.
UltraEdit.newFile();
// Create an empty array with size equal total number of lines.
var abLineOutput = new Array(nTotalLines);
// Define the parameters for the finds used below.
UltraEdit.ueReOn();
UltraEdit.document[nDocIndex].findReplace.mode=0;
UltraEdit.document[nDocIndex].findReplace.matchCase=true;
UltraEdit.document[nDocIndex].findReplace.matchWord=false;
UltraEdit.document[nDocIndex].findReplace.regExp=false;
UltraEdit.document[nDocIndex].findReplace.searchDown=true;
UltraEdit.document[nDocIndex].findReplace.searchInColumn=false;
// Prepare the active output window for the find results.
UltraEdit.outputWindow.clear();
UltraEdit.outputWindow.showStatus=false;
UltraEdit.outputWindow.showWindow(false);
// It is necessary to copy to user clipboard 9 each line and search
// for that line in all lines below active line in the file to find
// duplicates of this line. It is not possible to use a regular
// expression find to make sure that a line is really 100% identical
// from beginning to end of current line and does not contain by
// chance the same string as the current line because of the current
// line could contain also characters which could have a regular
// expression meaning. Lines already found once and written to the
// output window must be also ignored to avoid producing duplicates
// in output window.
var nDuplicateLines = 0;
var nTotalDuplicates = 0;
for (var nLineNumber = 1; nLineNumber < nTotalLines; nLineNumber++)
{
// Is this line already written to output window?
if (abLineOutput[nLineNumber] != null) continue;
// Go to next line to compare against all other lines below
// in initial active file and move caret to end of this line.
UltraEdit.document[nDocIndex].gotoLine(nLineNumber,1);
UltraEdit.document[nDocIndex].key("END");
// Get column number which is equal length of line.
var nLineLength = UltraEdit.document[nDocIndex].currentColumnNum;
// Is caret still at first column, the line is empty.
if (UltraEdit.document[nDocIndex].isColNum(1)) continue;
// Select the line and copy it to active user clipboard 9.
UltraEdit.document[nDocIndex].selectLine();
UltraEdit.document[nDocIndex].copy();
var sSearchedLine = "";
while (UltraEdit.document[nDocIndex].findReplace.find("^c"))
{
// Move caret to end of found line.
UltraEdit.document[nDocIndex].key("UP ARROW");
UltraEdit.document[nDocIndex].key("END");
// Has the found line not the same length as the search line?
if (UltraEdit.document[nDocIndex].currentColumnNum != nLineLength) continue;
// A real duplicate line was found to write to output window.
// Is the searched line with its line number already written
// to the output window?
if (!sSearchedLine.length)
{
// Insert an empty line before a new series of duplicate
// lines except on first series of duplicate lines.
if (nDuplicateLines) UltraEdit.outputWindow.write("");
// Remove all line ending characters from searched line
// in active user clipboard 9 on concatenating it with
// the fixed string written into the string variable.
sSearchedLine = "): " + UltraEdit.clipboardContent.replace(/[\r\n]+/,"");
// Output file name with path, line number in round brackets
// and after a colon and a space the search line itself.
UltraEdit.outputWindow.write(sFileName + nLineNumber + sSearchedLine);
nDuplicateLines++;
nTotalDuplicates++;
}
// Output the found duplicate of searched line in same format
// as the searched line to the output window and mark this
// line in array of lines already output as output.
UltraEdit.outputWindow.write(sFileName + UltraEdit.document[nDocIndex].currentLineNum + sSearchedLine);
abLineOutput[UltraEdit.document[nDocIndex].currentLineNum] = true;
nTotalDuplicates++;
}
}
// Clear user clipboard 9 and select clipboard of operating system.
UltraEdit.clearClipboard(9);
UltraEdit.selectClipboard(0);
// Output a summary information at bottom of output window.
UltraEdit.outputWindow.showWindow(true);
var sLinesPluralS = (nDuplicateLines != 1) ? "s" : "";
UltraEdit.outputWindow.write("");
UltraEdit.outputWindow.write("Found " + nDuplicateLines + " line" + sLinesPluralS +
" existing more than once with in total " +
nTotalDuplicates + " duplicate lines.");
// Move caret to top of the file and close new file without saving it.
UltraEdit.document[nDocIndex].top();
UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
}
}
This script does not modify the active file, except the last line in file has no line ending which would be inserted in this case.
The output window displayed for this small test file:
Code: Select all
C:\Temp\SmallTestFile.tmp(3): Ant
C:\Temp\SmallTestFile.tmp(12): Ant
C:\Temp\SmallTestFile.tmp(6): Mosquito
C:\Temp\SmallTestFile.tmp(13): Mosquito
C:\Temp\SmallTestFile.tmp(9): Butterfly
C:\Temp\SmallTestFile.tmp(14): Butterfly
Found 3 lines existing more than once with in total 6 duplicate lines.
Now it was possible to use the
Ctrl+Shift+Down Arrow and
Ctrl+Shift+Up Arrow from within document window to set caret in document window to next/previous line as listed in output window.
Next I created a large file with the few lines in upper block of example at top, many other lines not containing any duplicate line (inserted incrementing number at beginning of each line) and the three duplicate lines at bottom. The file had 194,911 lines and a file size of 11,217,012 bytes. I started the script and canceled execution after 10 minutes on seeing in status bar that just line 1238 was reached after 10 minutes.
Well, it was clear for me from the beginning that searching for non consecutive duplicate lines by taking a line and searching if any line below is a real duplicate can take many minutes because of lots of bytes must be compared again and again. But the very slow progress was definitely caused by the multiple caret movements in file in background which cause additionally lots of status bar refreshes. Window refreshes during processing a lot of data makes such processes always extremely slow.
So I decided to change the approach a little bit by inserting a marker string at beginning of each line to be able to search for entire lines with avoiding finding lines ending by chance with same string as an entire line with the assumption that the marker string is not present in the file at all.
Code: Select all
if (UltraEdit.document.length > 0) // Is any file opened?
{
// Define environment for this script.
UltraEdit.insertMode();
if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
// Move caret to bottom of the active file and make sure the last line
// has also a line ending. Next get line number which is equal the total
// number of lines in active file. Then move caret to top of the file.
UltraEdit.activeDocument.bottom();
if (UltraEdit.activeDocument.isColNumGt(1))
{
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1))
{
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
var nTotalLines = UltraEdit.activeDocument.currentLineNum;
UltraEdit.activeDocument.top();
// Has the active file at least two lines?
if (nTotalLines > 2)
{
// Select user clipboard 9 as active clipboard.
UltraEdit.selectClipboard(9);
// Get document index of active file.
var nDocIndex = UltraEdit.activeDocumentIdx;
// Get name of file with full path and append an opening parenthesis.
var sFileName = UltraEdit.activeDocument.path + '(';
// There must be executed lots of finds which would result in lots
// of document window refreshes which would take a lot of time.
// For that reason create a new file which becomes the active file
// and which is hopefully displayed maximized making it not necessary
// for UltraEdit running the finds in previously active file to
// refresh document window area after each successful find.
UltraEdit.newFile();
// Create an empty array with size equal total number of lines.
var abLineOutput = new Array(nTotalLines);
// Define the parameters for a replace inserting a marker string
// at beginning of all non empty lines and run the replace all.
UltraEdit.ueReOn();
UltraEdit.document[nDocIndex].findReplace.mode=0;
UltraEdit.document[nDocIndex].findReplace.matchCase=true;
UltraEdit.document[nDocIndex].findReplace.matchWord=false;
UltraEdit.document[nDocIndex].findReplace.regExp=true;
UltraEdit.document[nDocIndex].findReplace.searchDown=true;
UltraEdit.document[nDocIndex].findReplace.searchInColumn=false;
UltraEdit.document[nDocIndex].findReplace.preserveCase=false;
UltraEdit.document[nDocIndex].findReplace.replaceAll=true;
UltraEdit.document[nDocIndex].findReplace.replaceInAllOpen=false;
UltraEdit.document[nDocIndex].findReplace.replace("%^(?^)","#!#^1");
// Define the parameters for the finds used below.
UltraEdit.document[nDocIndex].findReplace.regExp=false;
// Prepare the active output window for the find results.
UltraEdit.outputWindow.clear();
UltraEdit.outputWindow.showStatus=false;
UltraEdit.outputWindow.showWindow(false);
// It is necessary to copy to user clipboard 9 each line and search
// for that line in all lines below active line in the file to find
// duplicates of this line. It is not possible to use a regular
// expression find to make sure that a line is really 100% identical
// from beginning to end of current line and does not contain by
// chance the same string as the current line because of the current
// line could contain also characters which could have a regular
// expression meaning. Lines already found once and written to the
// output window must be also ignored to avoid producing duplicates
// in output window. For that reason the marker string was inserted
// at beginning of each non empty line and this marker string is
// used as beginning of line anchor. It hopefully does not exist
// anywhere else in a line.
var nDuplicateLines = 0;
var nTotalDuplicates = 0;
for (var nLineNumber = 1; nLineNumber < nTotalLines; nLineNumber++)
{
// Is this line already written to output window?
if (abLineOutput[nLineNumber] != null) continue;
// Go to next line to compare against all other lines below
// in initial active file.
UltraEdit.document[nDocIndex].gotoLine(nLineNumber,1);
// Is first character on this line not # from marker string #!#
// then this is an empty line which must be ignored for searching.
if (!UltraEdit.document[nDocIndex].isChar("#")) continue;
// Select the line and copy it to active user clipboard 9.
UltraEdit.document[nDocIndex].selectLine();
UltraEdit.document[nDocIndex].copy();
var sSearchedLine = "";
while (UltraEdit.document[nDocIndex].findReplace.find("^c"))
{
// A real duplicate line was found to write to output window.
// Is the searched line with its line number already written
// to the output window?
if (!sSearchedLine.length)
{
// Insert an empty line before a new series of duplicate
// lines except on first series of duplicate lines.
if (nDuplicateLines) UltraEdit.outputWindow.write("");
// Remove all line ending characters from searched line
// in active user clipboard 9 on concatenating it with
// the fixed string written into the string variable.
sSearchedLine = "): " + UltraEdit.clipboardContent.substr(3).replace(/[\r\n]+/,"");
// Output file name with path, line number in round brackets
// and after a colon and a space the search line itself.
UltraEdit.outputWindow.write(sFileName + nLineNumber + sSearchedLine);
nDuplicateLines++;
nTotalDuplicates++;
}
// Output the found duplicate of searched line in same format
// as the searched line to the output window and mark this
// line in array of lines already output as output.
UltraEdit.outputWindow.write(sFileName + UltraEdit.document[nDocIndex].currentLineNum + sSearchedLine);
abLineOutput[UltraEdit.document[nDocIndex].currentLineNum] = true;
nTotalDuplicates++;
}
}
// Clear user clipboard 9 and select clipboard of operating system.
UltraEdit.clearClipboard(9);
UltraEdit.selectClipboard(0);
// Output a summary information at bottom of output window.
UltraEdit.outputWindow.showWindow(true);
var sLinesPluralS = (nDuplicateLines != 1) ? "s" : "";
UltraEdit.outputWindow.write("");
UltraEdit.outputWindow.write("Found " + nDuplicateLines + " line" + sLinesPluralS +
" existing more than once with in total " +
nTotalDuplicates + " duplicate lines.");
// Move caret to top of the file and remove the marker strings.
UltraEdit.document[nDocIndex].top();
UltraEdit.document[nDocIndex].findReplace.regExp=true;
UltraEdit.document[nDocIndex].findReplace.replace("%#!#","");
// Close new file without saving it.
UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
}
}
I tested it first on the small example file and it produced the same output in output window as first script. So I let it run on the large file and looked on status bar to see the progress. It was faster and so I decided to let it run. But after nearly two hours I canceled again the script as it has processed only the first 39,835 lines up to this moment.
A text editor like UltraEdit even with scripting support is the wrong application for this task.
It would be a trivial task for me as C/C++ programmer as my main job to write a small C++ executable which opens the file, reads in a loop one line after the other from the file, calculates for each line a hash value, looks up the hash value in current hash table which is very fast to find out if the current line is a duplicate of a previous line, output the line in case of being a duplicate, or add the hash value of current line to hash table on being unique up to now, before continuation with next line from file, until all lines of file have been processed by the executable. I am quite sure the code for such an executable written in C++ with the usage of a library for hash calculation and hash table lookup would not be longer than about 25-75 lines. And the executable would process the large test file definitely in just some seconds for producing the same output as the scripts above. The output could be captured by UltraEdit to output window for usage in UltraEdit with
Ctrl+Shift+Up/Down Arrow. But this is an UltraEdit user-to-user forum and not a C++ coding forum. Therefore I don't write the C++ code for you and post it here. You would also need the compiler and the library I would have used to be able to compile the code to an executable.