I provide a Unicode / MinGW port of finddupe duplicate file detector and eliminator for Windows by Matthias Wandel on github.
I really like finddupe when I look for duplicate files among books or photos. It is fast and clever thanks to CRC file signatures. It can also find NTFS hard links, which is difficult otherwise. Please refer to Matthias site for full description. My favourites are
finddupe c:\MyBooks and
finddupe -listlink c:\MyBooks.
Current version 1.23 of finddupe is ASCII-only and fails on non-ASCII filenames, as is often the case with books.
If you are looking for a GUI, there is AllDup and Duplicate Commander, but frankly, a lot of clicking around. Under Linux there is
rdfind. Backup them with
rsync -H, but it has some issues.
I ported finddupe to Unicode (actually UTF-16) and MinGW as version 1.24. I used
tchar.h wrapper, so ASCII build is still possible. Then I added some more functionality (-depth, -ign) in version 1.25.
I hate this wchar_t stuff, but I really like finddupe. Why didn't Microsoft go for UTF-8? It wasn't there at that time. utf8everywhere.org makes an interesting reading.
MinGW-w64 is required for Unicode wmain as described here. I used Ruby DevKit-mingw64-32-4.7.2 from rubyinstaller.org.
I provide a Win32 binary, so you do not have to compile yourself. There is also a 64-bit build.
c:\> finddupe -h finddupe v1.25 compiled May 11 2017 Usage: finddupe [options] [-ref] <filepat> [filepat]... Options: -bat <file.bat> Create batch file with commands to do the hard linking. run batch file afterwards to do it -hardlink Create hardlinks. Works on NTFS file systems only. Use with caution! -del Delete duplicate files -v Verbose -sigs Show signatures calculated based on first 32k for each file -rdonly Apply to readonly files also (as opposed to skipping them) -ref <filepat> Following file pattern are files that are for reference, NOT to be eliminated, only used to check duplicates against -z Do not skip zero length files (zero length files are ignored by default) -u Do not print a warning for files that cannot be read -p Hide progress indicator (useful when redirecting to a file) -j Follow NTFS junctions and reparse points (off by default) -listlink hardlink list mode. Not valid with -del, -bat, -hardlink, or -rdonly, options -ign <substr> Ignore file pattern, like .git, .svn or .bak (can be repeated) -depth <num> Maximum recursion depth, default 0 = infinite <filepat> Pattern for files. Examples: c:\** Match everything on drive C c:\**\*.jpg Match only .jpg files on drive C **\foo\** Match any path with component foo from current directory down
Photo upload from a digital camera
Your shiny and new digital camera connects to your home server via WiFi. Every time you do an upload, all of those mega-pixel photos appear in your incoming folder again. You want to see only new files, so you do
finddupe -del -ref pictures_dir incoming_dir
to get rid of incoming duplicates. The existing pictures in
pictures_dir will be taken as reference only, the duplicates in
incoming_dir will be removed.
Even if your home server is Linux, this should work too, even if a bit slower:
finddupe -ref \\server\public\Foto\2017 \\server\incoming (TODO UNC paths do not work yet).
BTW, if you want to set the JPG file time to the time picture was taken (stored in EXIF), I have a exif_date.rb script for you.
Books coming back
You share your digital books with your brother. Some time later you have a look into his books, and there are plenty of yours in there, and some more you do not have yet. To save your disk space and see new books you do
finddupe -hardlink my_book_dir his_book_dir (can add some more, but they need to be on the same NTFS disk).
A NFTS hardlink (
mklink /H new_link file) is having many filenames for the same contents. It is very transparent to the aplications, and to the user too. It is difficult to see it is a link. Dangerous like a very clean window. There are
fsutil.exe hardlink list C:\Windows\System32\notepad.exe (needs admin rights) and SysInternals FindLinks (no admin, no globs, no sources :(). Both need to know file.
finddupe -listlink you can see which files are linked.
I also recommend an excellent Link Shell Extension by Hermann Schinagl.
Check in files from other repositories
You want to use some scripts from your private Git repositories at work and check them in the company Subversion repository. You want to be DRY, so you think about symbolic links and
core.symlinks=true, but you realize that symlinks are no good on Windows until 10.something, as they need elevated privileges, because somebody at Microsoft decided so. Or you just get the text file with the link path instead.
Interestingly enough hardlinks, which are much more tricky, are easy to set up and need no admin rights. You just copy your files back and forth and then you hardlink them:
finddupe -hardlink -depth 1 -ign .git -ign .svn my_scripts1 my_scripts2
Git status doesn't see any difference, but if you are editing a file in one repository, it changes also in another one.
Original license by Matthias Wandel: Finddupe is totally free. Do whatever you like with it. You can integrate it into GPL or BSD style licensed programs if you would like to.