I recently got a bit of a scare when I noticed that some large files of several gigabytes that I copied from an SD card to a hard drive and vice versa were corrupted afterwards. Corrupted as in the last gigabyte of the file missing on the target device. Up to today I have no idea how this happened, I’m usually diligent enough to eject media and wait for the pop-up to notify me that it is save to remove the drive. One thing that was even worse was that I started wondering if my regular backups were affected as well. It’s a strange feeling when you suddenly don’t trust your backups anymore so I invested some time to find out if things were all right or not. Turns out it is not so difficult.
Comparing a few files
For only a few files, let’s say virtual machine snapshot files each of a size of several gigabytes in a single directory can be compared quite easily between master and backup drive by calculating a checksum of each file in the directory, once on the master drive and once on the backup:
md5sum *
The command above reads all files in a directory and generates the md5sum of each. Even a single bit that is different will result in a completely different result and so comparison between master and backup is simple.
Comparing Thousands of Files Needs a Different Approach
While this works well for a few files it’s not practicable to compare thousands of files this way. One solution I came up with at first was to use md5sum recursively through the directory structure and pipe the output generated on the master and the backup drive to text files and then compare them with a program such as meld. Unfortunately that didn’t work as the files were not stored in the directories on my master and slave drives in the same order. Hence, the checksums were in different orders in the output files and so a file comparison with meld just showed chaos.
As I make backups with LuckyBackup, a frontend for rsync, I had a look if rsync could be made to compare a source and destination not based on filesize and modification time but on checksums. I turned out that rsync just has such an option! By default checksums are not used because it requires that all files are read from the beginning to the end and a checksum generated, which takes far far longer than just comparing filesize and date from the directory tree. Here’s an example of an rsync command that does what I needed:
rsync --dry-run --checksum -h --progress --stats -r -tgo -p -l -D --delete-after source-path destination-path
The two important parameters are –dry-run and –checksum as I didn’t want to synchronize the two directories but I just wanted to find out if there are any files that were different. I leave it for your leisure to find out what the other options do.
Rsync –checksum vs. file date/size
When rsync is run with the –checksum command it completely ignores the date. I verified this by changing just a single character in a text file and then setting the file creation/modification/access times to the same value on both sides with the following command:
touch -a -m -t 201702142059 test.txt
When both files have the same modification time and size, rsync does not detect that a character was changed in the text file when run in standard mode. When run with the –checksum parameter, however, the file is marked as changed. That’s what I wanted!
After this test I changed the character back to the original value and let the operating system change the file date. As expected, rsync in standard operating mode would have updated the file as the access time was different. In –checksum mode, however, the different file dates were ignored as the checksums of the two versions were identical.
Fortunately for me, all tests showed that my backups are o.k. That still leaves me with the question of why the files I copied manually were broken, but that is far less of a problem.