Chasing Bit Rot with MD5 Hashes

When I recently copied a couple of hundred gigabytes from one backup drive to another backup drive, things did not go quite according to plan: After a few hours, the Linux kernel reported read issues on the source drive and remounted the drive in read-only mode. Rsync borked and aborted. I eventually fixed this by using a different cable and by connecting the drive to a USB port directly on my notebook. But this shouldn’t have happened in the first place, so I was wondering a bit how my backup on that drive is really doing. So I decided to have a closer look by verifying the integrity of the data on the drive. The results were interesting.

The drive in question is an 8 TB hard disk drive with about 5 TB of data on it. In other words, at 150 to 200 MB/s, verifying data on that drive will take a couple of hours. The approach I took to validate the data on the drive was to generate md5 hashes of the directory trees I frequently back up to the drive and then compare these to the md5 hashes of the files on the hard disk as suggested here by Michael Simons:

The following command recursively generates md5 hashes of all files of a given directory tree:

find "./Documents" -type f -print0 | xargs -0 md5sum > /home/martin/md5-hashes.md5

Note that directory relative to the current path is given to the file command, so the output file also contains relative paths next to the md5 hashes. Applied on my SSD in the notebook, the md5 hashes are generated at a rate of around 500 MB/s. The SSD could deliver data much faster, but ‘find‘ is limited to a single CPU. Still, quite a good speed. The output file (md5-hashes.md5) is then used with the following command to compare the hashes with the files on the backup disk drive:

md5sum -c /home/martin/ > /home/martin/checksum-results.txt

One can monitor progress in additional shells by ‘tailing’ the output file:

cd /home/martin
tail -F -n 50000 checksum-results.txt
tail -F -n 50000 checksum-results.txt | grep ": FAILED"

The first tail command is nice to monitor progress, while the second command that pipes the result into grep only shows files with hashes that are not equal.

So I let the commands run for a couple of hours and to my surprise and dismay, I actually found 3 files (out of around 100.000) that had a different md5 hash. The three files were all image files and close together in the directory tree. I’m not sure if that has any meaning, but it made it easy to literally see if the files on the backup were really broken. And indeed they were.

From a content point of view, two of the image files only contained garbage, while the third was only partially shown with an image viewer program. On the original SSD, all images were OK. Using a hex editor and the ‘cmp‘ command, I could further see that the problems in the third image started exactly at a 4096 byte block boundary.

From a file and storage perspective, the three images were taken and initially stored in 2009 and 2010. The file data and length were identical on both the SSD and the HDD and the files modification data on both sides also pointed to 2009 and 2010. This is why rsync did not see a difference.

As I have a backup of this backup drive (yes, really…) I had a look on that drive as well. And to my surprise, the three files were correct on that drive. As that backup – backup drive was initially created around one and a half years ago, it means that the files were still OK at the time and only became corrupted later on. I then ran an ext4 file system check on the disk drive which was clean. So ext4 did not notice that the files were corrupted. All very strange.

I then ran a check on other backup disk drives but those were all clean, no further md5 mismatches reported over many terabytes of data and hundreds of thousands of files. Also, it’s good to know that such kind of ‘bit rot’ does not propagate forward, unless of course, the backup is actually used.

So I’m really puzzled why those three files were corrupted and when this happened. Obviously, it did not happen when the files were first written to the drive about two years ago, it must have happened afterwards. I guess I will have to watch that drive carefully and decommission it, should further errors appear.