How does Content Search on finding duplicates work?

Sigrich · PostApr 16, 2017#12017-04-16T01:40+00:00

I may have missed this in the documentation but I would like to understand how files are compared to each other when doing only a content search (no other options checked).
I assume you are doing some sort of checksum compare, but I guess it could also be binary compare.
Please let us know.

The reason I ask is, I want to be very clear on how duplicates are identified for things like photo images which many times will have the exact same filename but be different in content.

Mofi · PostJun 11, 2017#22017-06-11T17:10+00:00

This is a user to user forum and therefore not really the right place to ask such a question.

But how finding duplicate files with just Content checked and all other options not checked works could be find out by me with a black box test.

I created two directories: C:\Temp\Test1 and C:\Temp\Test2
I copied into C:\Temp\Test1 5 files with different file contents (some text, some binary) from another directory.
I created two copies of each of the first 3 files also in C:\Temp\Test1 with different file names (appending an underscore and an incrementing number).
I copied all files from C:\Temp\Test1 to C:\Temp\Test2.
I deleted file 5 in C:\Temp\Test1 to make file 5 in C:\Temp\Test2 unique.
I deleted file 4 in C:\Temp\Test2 to make file 4 in C:\Temp\Test1 unique.
I modified in C:\Temp\Test2 the first file and its two copies by modifying the same byte in all 3 files and setting last modification date of the 3 modified files back to original last modification date.
I modified in C:\Temp\Test2 the second file and its two copies by modifying in those 3 files always 1 byte, but not the same byte in those 3 files, and setting last modification date of the 3 modified files back to original last modification date.
I modified in C:\Temp\Test2 the third file being a text file by converting the file from DOS to UNIX, i.e. removing all carriage returns, and setting last modification date of the modified file back to original last modification date.
I modified in C:\Temp\Test2 the last modification date of second copy of the third file.

What I expected on running UltraFinder for finding duplicates based on content is to list independent on file name and on file date all files from both directories having the same file contents.

From a programmers view this means UltraFinder has to search in both directories for files, get their names and their file sizes and build first multiple groups of files with same file size.

After this action C:\Temp\Test1\File4 and C:\Temp\Test2\File5 having different file sizes than all other files in both directories must have been already filtered out because they must have been identified already as being unique according to their file sizes. And indeed those two files were not in the list displayed by UltraFinder.

And also C:\Temp\Test2\File3.txt was filtered out because this file has after conversion of the line endings from DOS to UNIX a different file size in comparison to the file sizes of all other files in both directories.

UltraFinder has now to process 3 groups of file sizes with more than 1 file name:

File size group with the file names C:\Temp\Test1\File1* and C:\Temp\Test2\File1* with in total 6 files.
File size group with the file names C:\Temp\Test1\File2* and C:\Temp\Test2\File2* with in total 6 files.
File size group with the file names C:\Temp\Test1\File3*.txt and C:\Temp\Test2\File3_*.txt with in total 5 files.
The file C:\Temp\Test2\File3.txt has a unique file size after conversion from DOS to UNIX and is therefore not in this file size group.

For each file size group a binary file contents comparison must be executed next to find out if really all files in the group are real duplicates.

The files C:\Temp\Test1\File1* in file size group 1 have all identical bytes and build therefore a duplicate group to display.

But the files C:\Temp\Test2\File1* in file size group 1 have 1 byte different in comparison to the files C:\Temp\Test1\File1* and must be moved therefore into an new file size group 4 whereby after comparing C:\Temp\Test2\File1_* with first file C:\Temp\Test1\File1.* in file size group 1 and additionally comparing them with C:\Temp\Test2\File1.* it is clear that all 3 C:\Temp\Test2\File1* build another duplicate file size group 4 to display.

Next the 3 C:\Temp\Test1\File2* and the 3 C:\Temp\Test2\File2* in file size group 2 have to be binary compared. The result should be that the 3 files C:\Temp\Test1\File2* remain in duplicate group 2 and the other 3 files having all the same file size but being all different by 1 byte in comparison to each other must be filtered out. And exactly this happened obviously according to the displayed result.

And last the 3 C:\Temp\Test1\File3*.txt and the 2 C:\Temp\Test2\File3_*.txt must be also compared binary with being kept in duplicate group 3 independent on different last modification date of C:\Temp\Test2\File3_2.txt as being binary identical.

I could see the expected result with 4 groups with 3, 3, 5 and 3 duplicates.

I don't know how the binary file contents comparison is really done by UltraFinder.

It could be done by really comparing always 2 files byte-by-byte.

But it could be also done my calculating an MD5 and/or SHA1 hash value for each file with same file size and compare those hash values.

The first method would guarantee 100% accurate duplicate files results, the second method nearly 100%. It is very, really very unlikely but possible that two large files with same file size but very different file contents have same hash value (on using only 1 hash algorithm and not two at the same time).

So you need to ask IDM support by email if you want to know that as ultimate answer because finding this out with a black box test would be really very hard.

How does Content Search on finding duplicates work?

How does Content Search on finding duplicates work?

Choose Display Mode