Backbone Bottlenecks

backbone congestionAfter upgrading to VDSL vectoring with a 100 Mbit/s downlink and 40 Mbit/s uplink in Cologne recently, I soon figured out what to do with the additional uplink speed: Advanced synchronization with my network behind a fiber line in Paris. Equally soon I noticed that no longer is my access link the bottleneck but rather under-dimensioned backbone links. So I had to find a way to creatively route my traffic around the problem.

While most of my servers automatically synchronize over night and usually at the full speed my lines have to offer, i.e. 40 Mbit/s, I have some machines that I occasionally interact during daytime and in the evening and data transfers of several gigabytes at a time are not an exception. In other words, if there is a throughput problem it immediately becomes apparent. And I soon noticed that especially during evening hours, the backbone links can’t by far support such data rates and there is significant packet loss. To further investigate I started running a script that automatically transferred data once an hour to see what the throughput is. Quite to my surprise it soon showed that throughput is usually o.k. until around 8 pm in the evening at which time there is a sharp and dramatic drop in throughput from 40 Mbit/s down to 2-4 Mbit/s or even less and a massive increase in packet loss. Things get back to normal at around midnight.

Obviously this is quite unsatisfactory so I started experimenting. Instead of directly communicating between my Cologne and Paris sites, I established SSH tunnels from each end to a server located in the cloud. When using this detour I get full speed at any time during the day. A traceroute from each end shows that different backbone links are used which seem not to suffer from congestion. I’ve been observing this for at least two months now and things have not really improved, somebody really doesn’t care that their transit or peering is hopelessly under-dimensioned for evening traffic.

I could start complaining now but what are the chances that things would change? Probably not very high so I won’t even bother. Instead, I’ve made sure my semi-automatic man-in-the-middle SSH setup works without too much hassle and rather work around the problem creatively.