Wednesday, November 04, 2009

Twice in One Week - Disaster Recovery beyond our Control

It has been a busy week for disaster recovery so far - we had an outage on Monday and we had another one today.


Monday's Email Outage
On Monday, we lost our e-mail services. No, it had nothing to do with the Lotus Domino server. That was fine.

At first, it looked as if we'd forgotten to renew our domain name. This was quickly followed with a back and forward check of various DNS services out there on the Web, our domain registrar and our Internet service providers. All seemed ok with our domain name but there was definitely something weird happening.

Eventually we discovered that our Internet service provider had their DNS running off a domain which they had forgotten to renew. Since they were providing our primary DNS, all of our inbound mail was getting confused when it went to resolve our domain.

After being told by us (yes, they were unaware of the situation despite the fact that it had occurred during the night and it wasn't noticed immediately in the morning), they quickly got to work renewing their domain. Of course, given the sorts of problems associated with domain name propagation, our problems persisted in one form or another for several hours.

Years ago, such a problem would've been effectively "over" within a matter of hours but unfortunately, more and more companies are outsourcing their services overseas, and it takes more than fixing the local domain services to resolve the problem.


Today's Building Outage
Our Wednesday problem occurred while I was at lunch. I hurried back and arrived at a darkened building. Luckily, since we moved offices, we aren't as high up as we used to be and I only had to run up six stories worth of stairs (although immediately after lunch, it felt like more than that).

It turned out that the entire local grid had lost power. Our domino server and our main file server were happily running off UPS but unfortunately the UPS handling our communications gear was not up to scratch. It didn't matter because there was no way that the UPS would be able to power our systems for more than 30 minutes. Even if this was possible, the temperature inside the computer room was rapidly climbing now that the air-conditioning was off.

We had no choice but to start shutting down the servers. It took us a little while to make that decision because we knew we had a little time and we were hoping for the power to come back on. Of course, as soon as we got halfway through the shutdown procedure, the power came back on. This was after a 45 minute outage in the centre of Sydney's CBD.

Once again, the problem was "environmental" and out of our control. We could have switched to our offsite systems but it is a hard call to make because although our systems are clustered, we have a few special requirements which make a partially manual cut-over desirable. When the cut-over is not automatic, it becomes very difficult to make a decision as to when to flip the switch.


Out of Control Problems
The thing that these stories brings to mind is the fact that I keep reading anti-cloud computing "horror" stories from various vendors. In particular, they talk about Google's Gmail outages. I don't personally understand how people can think that cloud computing is any more or less unsafe than normal computing. As I said before, the problems had nothing to do with the Domino server. In fact, I can't remember a time when we've had an outage due to the Domino server.

I can remember plenty of times when we had hardware failures, ISP failures, power, air-conditioning, gas leaks and DNS failures. We've had problems with Anti-Virus and Anti-Spam services running on the domino server - and when we moved them off the server, they still caused us the occasional problem at the gateway. We've even had problems because of Windows itself and device driver updates. It's never domino though. The server product is entirely stable.

In some respects, our Domino mail is in exactly the same position as Gmail. It's not the product that is at fault, it's the underlying infrastructure - and it's out of our hands.

2 comments:

Graham Dodge said...

Gavin, my concern is not around the relative stability of local v. outsourced facilities but rather that when the Effluent finally does hit the Air Movement Device with outsourced services ...

1. You have no ability to know what went wrong or how long it will take to fix it and are forced to ring a support person who also probably doesn't know what is happening, and ...

2. When services are finally resumed you will not be first in the queue to have your devices restored. All of the Big Boys on the Premium Support will get their stuff on-line first and you will only be notified of progress when the Hosting Org phone jockeys reach 'G' for Gavin on their list.

At least with local hardware you can quickly diagnose the problem and (hopefully) predict how long it should take to resume normal operations.
.

Gavin Bollard said...

Graham,

I'd say that was a fair call except that our general services aren't outsourced to the cloud but...

1. We only discovered what went wrong through a bit of investigative work ourselves. The ISP had no clue.

2. Notifications? What are they? When it comes on, it's fixed. Our ISP contacted me several hours after it was resolved.

I don't see that what I have at the moment provides a huge advantage over cloud services.