Wireless and Mission Critical

I am on the road quite often and, as most of you have figured out in the meantime, a heavy user of 3G networks for Internet access. While I generally like the experience some network outages like this two and a half day nationwide full Internet access blackout in the Vodafone Germany network recently sends shivers down my spine. After all, we are not talking about a third class operator but one that claims to be a technology leader in the sector. As I use Vodafone Internet access a lot I was glad I was only impacted for half a day, having been in a DSL save haven for the rest of the time. If I had been on the road, however, this would have been a major disaster for me.

I wonder if the company that delivered the equipment that paralyzed the Vodafone network for two and a half days has to pay for the caused damage!? If it’s a small company then such a prolonged outage with millions of euros in lost revenue can easily put them out of business. And that doesn’t even consider the image loss for all parties involved and the financial loss of companies relying on Vodafone to provide Internet access. The name of the culprit was not released to the press but those working in the industry know very well what happened. Hard times for certain marketing people on the horizon…

Vodafone is certainly not alone facing such issues as I can observe occasional connection issues with other network operators as well. These, however, are usually short in nature and range from a couple of minutes to an hour or so. Bad enough.

To me, this shows several things:

There is not a lot of redundancy built into the network.
Disaster recovery and upgrade procedures are not very well thought trough as otherwise such prolonged outages would not happen.
Short outages might be caused by software bugs and resetting devices.
I think we might have reached a point where capacity of core network nodes have reached a level that the failure of one device triggers nationwide outages.

So maybe operators should start thinking in earnest about reversing the trend a bit and consider decentralization again to reduce the effect of fatal equipment failures. And maybe price should not be the only criteria to be considered in the future. Higher reliability and credible disaster recovery mechanisms which do not only work on paper might be worth something as well. An opportunity for network vendors to distinguish themselves?

8 thoughts on “Wireless and Mission Critical”

swordfishBob says:

June 4, 2008 at 9:35 am

As a business customer, we subscribe to our telco’s notifications of planned outages. Most of these (a list every couple of months) are described as “BTS reparenting” with estimated duration of 5 minutes within a 6 hour window at night. I assume that means software upgrades are performed on offline equipment, so a failed cutover can be immediately reversed.
It’d be interesting to know if there’s much redundancy there is among the core equipment.
Dan Iordanescu says:

June 4, 2008 at 12:14 pm

So, Martin, what happened? Please, tell us what you know. I heard nothing about it so far.

2.5 days of no Internet on national scale doesn’t even happen in the third world countries. Must’ve been something very bad in the GGSN IP core and redundancy didn’t kick in.

I guess it was only for data, otherwise Mr. Arun Sarin would’ve been there on the next plane himself.

Thanks,
Dan.
Reda says:

June 4, 2008 at 6:45 pm

I don’t mean to be disrespectful and I apologise in advance if I come across that way, but which experience do you base your consideration on? You didn’t specify if it’s based on your experience of something in particular, or it’s just your perception as a user? Also, did you mean all operators and vendors are affected or just some?
Chris Vail says:

June 4, 2008 at 7:42 pm

I have had to explain to coworkers from Russia why California, USA has electrical power outages from time to time; apparently the Soviet Union “spared no expense” in creating their electrical power grid. Given where the Soviet Union is today (nonexistent), you have to wonder what the balance is.
Martin says:

June 5, 2008 at 12:53 pm

Hi Reda,

Thanks for your comment. I am not quite sure I understand your question. The angle doesn’t really seem relevant to me, a 2.5 day outage is a 2.5 day outage. In my opinion it is that “oh, it’s only the data side” mentioned by Dan above which makes some network operators accept higher failure risks. I think this attitude has to be revised, Internet access has become mission critical and is no longer only a nice to have feature (at least for some…)

Cheers,
Martin
Reda says:

June 6, 2008 at 3:37 pm

Hi Martin,
sorry in my comment I forgot to mention that I was referring to your bullet points. Just to clarify my disagreement with your post:
* There is not a lot of redundancy built into the network.
My experience is that there is redundancy built into the network for almost everything
* Disaster recovery and upgrade procedures are not very well thought trough as otherwise such prolonged outages would not happen.
Could be true. However, I just want to add two points:
1)different operators buy different response times from the vendors who most of the times provide the support to their networks
2)The recovery depends on the experience of the engineers not only procedures and cost cutting exercises are ongoing in all companies these days which affect quality.
* Short outages might be caused by software bugs and resetting devices.
Not necessarily true
* I think we might have reached a point where capacity of core network nodes have reached a level that the failure of one device triggers nationwide outages.
my 2 cents are that we reached a point of high volume on 3g network which requires more attention of the network operators. It’s like in life, the higher the risk, the higher is the attention/mitigating action required.

Having said this, I might be biased in my answers because I work for a vendor 😉
Chris Vail says:

June 6, 2008 at 10:03 pm

>* Short outages might be caused by software bugs and resetting devices.
>Not necessarily true

Once upon a time I fixed a bug in an inverse multiplexer used by MCI to carry telephone traffic. The bug caused a memory leak when SNMP polling was enabled; when memory was all used up (after a couple of days), the device reset (dropping all current calls). Since MCI found the problem in their testing, I got a bonus for finding and fixing the bug.

This was back in the day when memory was expensive, and an embedded device would not have GBs of free memory. Today such a problem might pass QA, but manifest after a year or so of use.
komatineni says:

June 9, 2008 at 3:24 am

IMO, most network operators use
1. Highly redundant networks
2. Lots of BCP,DR stuff
3. Expert Engineers/tech staff
but what they miss
1. End to End (We’ve highly experienced engineer in Mobile CS networks & IP Expert but who is going to ‘translate’ the stuff between them?) portion
2. Verification or dry run of BCP
3.Drive the cost down, no testing in test bed. 🙁

Comments are closed.