Tuesday, April 15, 2008

Firewalls and Other Single Points of Failure

Last Friday night, we installed a new firewall to replace our "unsupported" Symantec model.

A Bizarre Sense of Timing
Curiously, we'd been complaining to Symantec for years about several of their systems and had been slowly ripping and replacing. Our complaints over their lack of interest in the firewall they sold us had been fairly strong over the last six months but had met with limited responses.

I got a call out of the blue on Friday afternoon from "upper" management telling us that they were now committed to sorting out the firewall for us.

I had to tell them that their firewall only had three hours of life left. I didn't feel guilty though. I don't think that anyone should be willing to tolerate a lack of service.

A Simple Swap-over
We cleverly timed the firewall swapover project to coincide with a weekend where there was a scheduled power outage. This meant that instead of having one firewall (point of failure) on site, we had two.

We spent a while reviewing the rules and it should have been a simple swap-over of cables. It should have been....

but it wasn't.

At first the firewall wouldn't connect - to the internet. Fine, it's a reasonable assumption that you'd need to cycle power on the router. The problem... well, since we had backup for the firewall, it follows that the router becomes the single point of failure.

The router didn't come up properly - all lights remained solid. We tried a number of power cycles but it steadfastly refused to come up.

Luckily it was a managed router and we were still well within our time limit for 24x7 serivce.

The Helpless Helpdesk
We rang our ISP's helpdesk but went through to another state - Victoria. Apparently they couldn't raise anyone in their Sydney office and the technician suggested that they may have left their 24x7 helpdesk to go for "Friday drinks". (sigh). They promised to get in touch as soon as possible.

We got a call back over 90 minutes later and received the incredibly technical description that "your router has fried".

Great! It's a managed router, so kindly deliver a new one... Right???

Wrong! Sorry, we're fresh out of routers. (It felt like we were in Monty Python's Cheese shop sketch). We were told that it was unlikely that they'd be able to come up with a router before Monday.

"Ok" we said, "not what we were hoping for... what time do your techs start on Monday?". I think we were all stunned by the response of 9am. (I'm a 6.30am starter myself).

We then started trying to negotiate, suggesting that if we could buy a replacement router over the weekend, we could install it ourselves if they'd give us the relevant passwords and settings. "Sure", they replied, "but it will cost $350 to change you from a managed to unmanaged router". We figured that this was an acceptable cost and told the technicians. They quickly killed the idea by telling us that they wouldn't be able to give us any connection info.

We were stuck.

Luckily, we had one of those "super-tech's" onsite at the time, you know the type who like to open all their equipment just to see where the motherboard was made... He decided to open our "fried" router and prod around for bloated capacitors. There were none.

Of course, one of the big rules about electronic equipment is that sometimes, when it's hot, it just wants a bit of nudity. With the cover off and the circuit board exposed, we plugged the router back in - it immeidately hummed into life.

The Power Outage
So now we had a working firewall and connection but an impending power shutdown lasting up to 6 hours... would the equipment come back online? Would the motherboards cool down and pop all their solder? I had to wait until Sunday to find out.

On Sunday morning, I arrived at work and started switching all the equipment back on. Desktops, Printers, etc... it all came back... all except the computer room that is.

It turned out that the building's power had flipped a fuse, so I had to call the electricians back. Once fixed, everything, even our very sick router, came back.

I was going to suggest that the moral of the story is that you need to have more than one piece of hardware that matches the spec of devices which are single points of failure but I think the moral is really... whenever there is an opportunity for the single point of failure to make its presence known... it will.

No comments: