I provide a Unicode / MinGW port of finddupe duplicate file detector and eliminator for Windows by Matthias Wandel on github.

Reasons

I really like finddupe when I look for duplicate files among books or photos. It is fast and clever thanks to CRC file signatures. It can also find NTFS hard links, which is difficult otherwise. Please refer to Matthias site for full description. My favourites are finddupe c:\MyBooks and finddupe -listlink c:\MyBooks.

Current version 1.23 of finddupe is ASCII-only and fails on non-ASCII filenames, as is often the case with books.

Alternatives

If you are looking for a GUI, there is AllDup and Duplicate Commander, but frankly, a lot of clicking around. Under Linux there is fdupes, hardlink and rdfind. Backup them with rsync -H, but it has some issues.

Port

I ported finddupe to Unicode (actually UTF-16) and MinGW as version 1.24. I used tchar.h wrapper, so ASCII build is still possible. Then I added some more functionality (-depth, -ign) in version 1.25.

I hate this wchar_t stuff, but I really like finddupe. Why didn't Microsoft go for UTF-8? It wasn't there at that time. utf8everywhere.org makes an interesting reading.

MinGW-w64 is required for Unicode wmain as described here. I used Ruby DevKit-mingw64-32-4.7.2 from rubyinstaller.org.

I provide a Win32 binary, so you do not have to compile yourself. There is also a 64-bit build.

Usage

c:\> finddupe -h
finddupe v1.25 compiled May 11 2017
Usage: finddupe [options] [-ref] <filepat> [filepat]...
Options:
 -bat <file.bat> Create batch file with commands to do the hard
                 linking.  run batch file afterwards to do it
 -hardlink       Create hardlinks.  Works on NTFS file systems only.
                 Use with caution!
 -del            Delete duplicate files
 -v              Verbose
 -sigs           Show signatures calculated based on first 32k for each file
 -rdonly         Apply to readonly files also (as opposed to skipping them)
 -ref <filepat>  Following file pattern are files that are for reference, NOT
                 to be eliminated, only used to check duplicates against
 -z              Do not skip zero length files (zero length files are ignored
                 by default)
 -u              Do not print a warning for files that cannot be read
 -p              Hide progress indicator (useful when redirecting to a file)
 -j              Follow NTFS junctions and reparse points (off by default)
 -listlink       hardlink list mode.  Not valid with -del, -bat, -hardlink,
                 or -rdonly, options
 -ign <substr>   Ignore file pattern, like .git, .svn or .bak (can be repeated)
 -depth <num>    Maximum recursion depth, default 0 = infinite
  <filepat>      Pattern for files.  Examples:
                  c:\**        Match everything on drive C
                  c:\**\*.jpg  Match only .jpg files on drive C
                  **\foo\**    Match any path with component foo
                               from current directory down

Examples

Photo upload from a digital camera

Your shiny and new digital camera connects to your home server via WiFi. Every time you do an upload, all of those mega-pixel photos appear in your incoming folder again. You want to see only new files, so you do finddupe -del -ref pictures_dir incoming_dir to get rid of incoming duplicates. The existing pictures in pictures_dir will be taken as reference only, the duplicates in incoming_dir will be removed.

Even if your home server is Linux, this should work too, even if a bit slower: finddupe -ref \\server\public\Foto\2017 \\server\incoming (TODO UNC paths do not work yet).

BTW, if you want to set the JPG file time to the time picture was taken (stored in EXIF), I have a exif_date.rb script for you.

Books coming back

You share your digital books with your brother. Some time later you have a look into his books, and there are plenty of yours in there, and some more you do not have yet. To save your disk space and see new books you do finddupe -hardlink my_book_dir his_book_dir (can add some more, but they need to be on the same NTFS disk).

A NFTS hardlink (mklink /H new_link file) is having many filenames for the same contents. It is very transparent to the aplications, and to the user too. It is difficult to see it is a link. Dangerous like a very clean window. There are fsutil.exe hardlink list C:\Windows\System32\notepad.exe (needs admin rights) and SysInternals FindLinks (no admin, no globs, no sources :(). Both need to know file.

With finddupe -listlink you can see which files are linked.
I also recommend an excellent Link Shell Extension by Hermann Schinagl.

Check in files from other repositories

You want to use some scripts from your private Git repositories at work and check them in the company Subversion repository. You want to be DRY, so you think about symbolic links and core.symlinks=true, but you realize that symlinks are no good on Windows until 10.something, as they need elevated privileges, because somebody at Microsoft decided so. Or you just get the text file with the link path instead.

Interestingly enough hardlinks, which are much more tricky, are easy to set up and need no admin rights. You just copy your files back and forth and then you hardlink them:
finddupe -hardlink -depth 1 -ign .git -ign .svn my_scripts1 my_scripts2
Git status doesn't see any difference, but if you are editing a file in one repository, it changes also in another one.

License

Original license by Matthias Wandel: Finddupe is totally free. Do whatever you like with it. You can integrate it into GPL or BSD style licensed programs if you would like to.

Add a comment

Next Post Previous Post