OK, one more topic is still missing in my story about running my own cloud: How do I back it all up? Before answering the question, it is of course important to define which scenarios to protect against with backups. And, not surprisingly, I have made good use of my backups over the years to restore accidentally or unintentionally deleted data. Read on for the details.
What Kinds of Scenarios to Protect Against?
All right, so I want backups (plural) to protect against a number of things: First of all, I need backups in case of a server hardware failure or in case of a prolonged connectivity outage. Also, I would like to have a backup as recent as possible in case a virtual machine or container setup fails due to a software update or successful attack against my infrastructure. Also, my backup strategy includes warm standby installations that periodically synchronize with actively running services, so they can take over within a few minutes in case of a loss of the main services for whatever reason. And finally, I have a strategy to get up and running again with current data within a few minutes after theft and loss of the notebooks of me and my family members.
First Line of Defense: Virtual Machine Snapshots
The first line of defense in my overall backup strategy is to take snapshots of virtual machines before I attempt to significantly modify the running software. Think operating system version upgrades. In case something goes wrong, I can just shutdown the VM, restore the snapshot and be up and running again, including all data, in a minute. It has saved me several times over the years when ‘what could possibly go wrong’ software configuration changes or upgrades went wrong, and rescue attempts just made things worse. Such VM snapshots also give a great peace of mind during such operations, as when things start to go wrong, I know I’m only a minute away from having a fully working system again with data that is at most a few minutes old.
Second line of Defense: ZFS Snapshots
A relatively new addition to my overall backup strategy is to create ZFS snapshots of partitions on my physical servers that are the home of VM system and data images. These are automatically created when I reboot a physical server. As ZFS snapshots are easily accessible, I can quickly extract a ZFS snapshot of a single file, e.g. a VM disk image. I can at least recall one event in which this has helped tremendously: After a docker update, my personal MediaWiki would no longer save changes to pages and give me obscure error messages. After trying for an hour to fix things, I decided to roll back to the last ZFS snapshot of the virtual machine disk file and do the docker update for my MediaWiki by hand and debug from there. Again, it’s really good to have that peace of mind that you can get back to a working state within a few minutes from something failing.
In addition to the local restoration of VM disk images, I use the ZFS snapshots to synchronize the snapshots of these files to a central backup server over the Internet. The advantage: I don’t have to wait for the backup to complete before I can start the virtual machines again, as only the read-only ZFS snapshots of the disk images is accessed. In case of a hardware failure, this gives me the option to use VM files that are relatively up to date and apply differential backups of data to them that I do much more frequently.
Third Line of Defense: Central Backup Server and Warm Standby
In addition to synchronizing ZFS snapshots of VM images accross pyhsical sites, I use BorgBackup to automatically backup critical user data, e.g. from my NextCloud systems once a day. Should I have a fatal failure at one site, I can apply this data to a warm standby system via a differential restore. This is important, as we are talking about hundreds of gigabytes that can’t be pushed from the backup server to a new site instantly.
Fourth Line of Defense: Offline Backups
‘Central backup’ and ‘online’ sound great, but I can imagine scenarios in which the backup server is affected by an outage or attack at the same time as my running services, and restoration from the backup server is not possible. For such scenarios, I periodically create manual offline backups to 16 TB disk drives of pretty much all of my data with scripts while keeping a close eye on the process. These drives are then stored off-site and not connected to any kind of server. I would perhaps loose a week’s worth of data if I ever needed such backups, but the scenarios which would make a fallback to these backups necessary are rather dire and hence make a loss of week’s data seem a rather small issue in comparison.
Fifth Line of Defense: Document File Snapshots
I’m sure you know the following scenario: You’ve been working on an important document and at some point something has gone really wrong and you need an earlier version. For such a scenario I have a script on my notebooks that runs every 30 minutes and searches for text, spreadsheet and presentation documents that have changed within the last 30 minutes. A copy of these files is then put into a ZIP archive that has the current time and date in its filename. As such documents are usually rather small, the ZIP files can be stored for a very long time. This way, it is then also possible to quickly answer questions like: Can I get back a text document in the state it was two months ago? You can even find out on which days you have worked on the file and compare what has changed between two copies. I don’t use this often but getting these kinds of details and older versions of files have helped me numerous times in the past decade.
Sixth Line of Deference: Cold Standby NVMes
And I further go down the beanstalk and also have protection against the physical loss or failure of a notebook while traveling. I usually carry a spare notebook in my suitcase and an NVMe in an external casing that has a 1:1 copy of my complete system. All encrypted of course. The data on the ‘notebook on the NVMe’ is brought up to date before I travel, so loosing a notebook becomes a non-issue again. And believe it or not, I once forgot my backpack in a train! Panic set in when I noticed that the notebook was gone, but by then, the train was gone. But while I didn’t quite fully relax, I just used the backup notebook in the suitcase, inserted the drive with the data I updated on it just the day before and I was up and running again. In another instance the notebook of a family member was stolen in a café. You can imagine the panic. But the last automatic differential backup of the data on that notebook over the Internet was just 2 hours old and that’s the only data that was lost. I packed a spare notebook and the latest backup, jumped into a train and service was restored within 5 hours.
Summary
So here we go, lots of approaches to protect against many scenarios for data loss, from hardware fault to software issue, theft and loss. I wished there was one simple way of doing it all, but if you want to protect yourself against different kinds of scenarios, one strategy will not help you.