The 8 TB drive I use for all stuff that may or may not be useful in the future again is filling up faster than prices are dropping for bigger drives. In the past, I’ve always ‘upgraded’ after drive capacities have doubled. But currently, the price for a 16 TB external drive still hovers around €300. And since I always keep several backups, that price would multiply. So that’s not quite in the cards for the time being. So I did the next best thing and had a look for duplicates. Now how hard can that be, I thought.
The easiest way to do this is to just look for duplicate names. In my case that would have given quite a number of false positives. When creating backups of optical media, for example, the audio and video files created often get the same name. Also, one would miss quite a number of duplicates when the filename was changed at some point in time.
Fortunately, someone had this problem before and was kind enough to write ‘rdfind‘ (redundant data find). Available in the Debian repository for easy installation with ‘apt install’, rdfind will not only look for duplicate filenames, but also for identical file sizes, compares first and last byte sequences, and calculates a sha1 checksum over potential matches. Once the tool got to work on my 8 TB drive, it took a couple of hours before it returned with a result. I was positively surprised with all the things it had found that I would have missed if I had done a manual search or just looked for identical file names. In the end, I could remove over 700 GB of duplicated files. One additionally nice feature: One can give rdfind a minimum file size (in bytes) which makes it ignore all smaller files:
rdfind -minsize 50000000 ~/Documents
I used fslint for this task in the past.
It used to be in the Ubuntu repos, but right now I don’t find it in the Mint 20 repos (which are based on Ubuntu as one should know).
http://www.pixelbeat.org/fslint/
fslint needs python2, which is deprecated in most modern distros.
More recent incarnation of fslint is https://github.com/qarmin/czkawka
fslint is horrible when it comes to dupes. it does only compare file names. Good way to ruin things.
I usually just use an alias for this ‘find . -type f -exec md5sum ‘\”{}’\” \; | sort | uniq -d -w 36’
But rdfind seems interesting, thanks, never came across it in Gentoo.
That wait before results was killing me, so I wrote my own tool that deduplicates incrementally, returning results as soon as they’re found.
https://github.com/kornelski/dupe-krill
Hi Kornel, very interesting, thanks!
Why not use a tool that does deduplicated backups in the first place ? e.g. Borg, restic, bupstash, to name few of the good ones