My 8 TB drive I use for all stuff that might may or may not be useful in the future again is filling up faster than prices are dropping for bigger drives. In the past I’ve always ‘upgraded’ after drive capacities have doubled. But currently, the price for a 16 TB external drive still hovers around €300. And since I always keep several backups, that price would multiply. So that’s not quite in the cards for the time being. So I did the next best thing and had a look for duplicates. Now how hard can that be, I thought.
The easiest way to do this is to just look for duplicate names. In my case that would have given quite a number of false positives. When creating backups of optical media for example, the audio and video files created often get the same name. Also, one would miss quite a number of duplicates when the filename was changed at some point in time.
Fortunately someone had this problem before and was kind enough to write ‘rdfind‘ (redundant data find). Available in the Debian repository for easy installation with ‘apt install’, rdfind will not only look for duplicate filenames, but also for identical file sizes, compares first and last byte sequences and calculates a sha1 checksum over potential matches. Once the tool got to work on my 8 TB drive, it took a couple of hours before it returned with a result. I was positively surprised with all the things it had found that I would have missed if I had done a manual search or just looked for identical file names. In the end I could remove over 700 GB of duplicated files. One additionally nice feature: One can give rdfind a minimum file size (in bytes) which makes it ignore all smaller files:
rdfind -minsize 50000000 ~/Documents