History in 75 Racks – The Internet Archive Gives Insights

As the saying goes, ‘the cloud is just someone else’s computer’. Unfortunately, one usually does not see how big ‘that computer’ of a service in the cloud actually is. The Internet Archive has recently posted a video in which Jonah Edwards gave an interesting insight into the size of ‘their cloud’.

The Internet Archive wants to be independent and hence have decided from the start not to put their data with a hyperscaler, but run the required infrastructure themselves. They are in good company and as a result, they know exactly how their cloud looks like.

As they show in the video, they run four data centers in the San Francisco bay area with a total of 75 racks, mostly for disk space. Those racks hold 750 servers that host 1300 VMs, which in turn manage 30k storage devices, of which 20.000 are spinning disks in paired storage. That’s almost 200 PB of raw storage capacity. But things are of course not static, the Internet is growing and so does the archive. Currently, they are growing at a rate of 10-12 PB of raw capacity per quarter, which translates into 5-6 PB of real data due to mirroring.

Growing disk space meets growing usage of the archive, and they are currently serving content at an outgoing datarate to the world of 60 Gbit/s and rising. I did a quick traceroute to them and indeed, the video linked to above comes directly to my notebook in Germany from San Francisco via 11 hops and a round trip time of 160 ms.

Back to the hardware: The challenge for the Internet Archive is, as for any data center, that disks have a limited lifetime that is not measured in decades, so they need to be replaced every few years. This of course comes on top of the additional data that ends up in the archive every day. Increasing disk sizes do help them, but it’s hard to stay in front of the curve. With current 16 TB drives, a single copy of the Internet Archive would fit into 15 racks, i.e. 30 racks with redundancy. That’s far less than the 75 racks that are in use today, but they are running their servers for up to 9 years so they have lots of disks that are far smaller than 16 TB.

There’s tons of more info in the video, so go have a look and enjoy the stats! I am very happy and also a bit proud that I’m regularly donating to this great project. After all, this cloud, like any other, does not run on thin air, and, in this particular case, not on advertising either!