Skip to main content

Twice in One Week - Disaster Recovery beyond our Control

It has been a busy week for disaster recovery so far - we had an outage on Monday and we had another one today.


Monday's Email Outage
On Monday, we lost our e-mail services. No, it had nothing to do with the Lotus Domino server. That was fine.

At first, it looked as if we'd forgotten to renew our domain name. This was quickly followed with a back and forward check of various DNS services out there on the Web, our domain registrar and our Internet service providers. All seemed ok with our domain name but there was definitely something weird happening.

Eventually we discovered that our Internet service provider had their DNS running off a domain which they had forgotten to renew. Since they were providing our primary DNS, all of our inbound mail was getting confused when it went to resolve our domain.

After being told by us (yes, they were unaware of the situation despite the fact that it had occurred during the night and it wasn't noticed immediately in the morning), they quickly got to work renewing their domain. Of course, given the sorts of problems associated with domain name propagation, our problems persisted in one form or another for several hours.

Years ago, such a problem would've been effectively "over" within a matter of hours but unfortunately, more and more companies are outsourcing their services overseas, and it takes more than fixing the local domain services to resolve the problem.


Today's Building Outage
Our Wednesday problem occurred while I was at lunch. I hurried back and arrived at a darkened building. Luckily, since we moved offices, we aren't as high up as we used to be and I only had to run up six stories worth of stairs (although immediately after lunch, it felt like more than that).

It turned out that the entire local grid had lost power. Our domino server and our main file server were happily running off UPS but unfortunately the UPS handling our communications gear was not up to scratch. It didn't matter because there was no way that the UPS would be able to power our systems for more than 30 minutes. Even if this was possible, the temperature inside the computer room was rapidly climbing now that the air-conditioning was off.

We had no choice but to start shutting down the servers. It took us a little while to make that decision because we knew we had a little time and we were hoping for the power to come back on. Of course, as soon as we got halfway through the shutdown procedure, the power came back on. This was after a 45 minute outage in the centre of Sydney's CBD.

Once again, the problem was "environmental" and out of our control. We could have switched to our offsite systems but it is a hard call to make because although our systems are clustered, we have a few special requirements which make a partially manual cut-over desirable. When the cut-over is not automatic, it becomes very difficult to make a decision as to when to flip the switch.


Out of Control Problems
The thing that these stories brings to mind is the fact that I keep reading anti-cloud computing "horror" stories from various vendors. In particular, they talk about Google's Gmail outages. I don't personally understand how people can think that cloud computing is any more or less unsafe than normal computing. As I said before, the problems had nothing to do with the Domino server. In fact, I can't remember a time when we've had an outage due to the Domino server.

I can remember plenty of times when we had hardware failures, ISP failures, power, air-conditioning, gas leaks and DNS failures. We've had problems with Anti-Virus and Anti-Spam services running on the domino server - and when we moved them off the server, they still caused us the occasional problem at the gateway. We've even had problems because of Windows itself and device driver updates. It's never domino though. The server product is entirely stable.

In some respects, our Domino mail is in exactly the same position as Gmail. It's not the product that is at fault, it's the underlying infrastructure - and it's out of our hands.

Comments

Graham Dodge said…
Gavin, my concern is not around the relative stability of local v. outsourced facilities but rather that when the Effluent finally does hit the Air Movement Device with outsourced services ...

1. You have no ability to know what went wrong or how long it will take to fix it and are forced to ring a support person who also probably doesn't know what is happening, and ...

2. When services are finally resumed you will not be first in the queue to have your devices restored. All of the Big Boys on the Premium Support will get their stuff on-line first and you will only be notified of progress when the Hosting Org phone jockeys reach 'G' for Gavin on their list.

At least with local hardware you can quickly diagnose the problem and (hopefully) predict how long it should take to resume normal operations.
.
Gavin Bollard said…
Graham,

I'd say that was a fair call except that our general services aren't outsourced to the cloud but...

1. We only discovered what went wrong through a bit of investigative work ourselves. The ISP had no clue.

2. Notifications? What are they? When it comes on, it's fixed. Our ISP contacted me several hours after it was resolved.

I don't see that what I have at the moment provides a huge advantage over cloud services.

Popular posts from this blog

How to Change Your Notification Options for New Lotus Notes Mail in version 8.x

Don't worry, I'm not patronizing you (my readers), I just decided to re-document this for one of our internal users and thought you might want to be able to use it in your own user documentation. WHAT IS THIS DOCUMENT ABOUT? Some people who don't get a lot of mail, like to be notified when such an event occurs. Notification can be; via a sound via a pop-up box via the system tray (where the computer clock is) The pop up box looks like this; Other people, who like myself, get too much mail would rather not be notified. The aim of this document is to tell you how (and where) to turn these options on and off. CHANGING YOUR SETTINGS To change your settings from the Notes 8.x client; On the Menu, click File , then Preferences... On the left hand side , click on the little plus sign to the left of Mail to expand the options. Click on the option marked Sending and Receiving . In the middle section, under receiving, you can control your notifications. If you untick the box mark...

How to Create a Bootable DVD Using Nero Burning ROM 9

I often need to create bootable CDs and DVDs but it's weird because I frequently end up buring myself a new coaster instead. It's not that the process is difficult, just that nero has a few too many options and I forget which ones to choose and end up picking the wrong one. I figured that the best way to avoid this mistake in future would be to write the steps down. Procedure Insert CD or DVD into your DVD Burner. Start Nero Burning ROM 9 Choose DVD-ROM (Boot) or CD-ROM (Boot) depending on what you're creating You'll be prompted for a disk image source. Choose a Nero Source - you'll usually find them somewhere like this... C:\Program Files\Nero\Nero9\Nero Burning Rom\DOSBootImage.ima Leave the Boot Locale as English - unless you really need a different keyboard layout Tick the box marked [X] Enable Expert Settings Choose Hard Drive Emulation and leave any other settings as they are. Click the button marked New Add any files you want but don't try to add operati...

How to Create an Auto-Response Mail Message in Lotus Notes 8.5.3+

Why would you do this? Suppose that you have an externally accessible generic email address for your company; support@mycompany.com or info@mycompany.com. You might expose this to the web and allow people to send messages to you. Setting up an auto-response email will tell the senders that their message reached its destination and that it will be dealt with accordingly.  It's also good practice to include links to FAQs or other useful information. Why 8.5.3 The techniques we'll be using here work in older versions of Notes but some of the options seem to have moved around in 8.5.3.  I figured it was a good time to show you where they've moved to. The Procedure Start Domino Designer and open the Mail file to be modified.  A really quick way to do this is to right-click on the application tab and choose "Open in Designer". In the Left hand panel of designer, expand Code and then double-click Agents.  A new window should appear. Click the action ...