Tuesday, May 01, 2007

How Video Killed the NT Domain

Background
Our company will shortly be moving office and as part of our pre-move testing, we need to give all of our servers a cold start. This doesn't guarantee that they will start at the new location but at least it proves that they have recently been able to cold boot.

All of our servers, except one, are Microsoft Windows 2003 server. The one exception is a Microsoft Windows NT 4.0 server which acts as our Primary Domain Controller. We used to have a backup domain controller but when we were asked to move everything to Windows 2003, we lost the device. Windows 2003 server cannot provide Windows NT Domain services. If you need login services in Windows 2003, you are required to run Microsoft active directory.

It was always our intention to put Microsoft active directory on our servers at the earliest possible opportunity and to remove the Windows NT server entirely. Unfortunately, we never got around to the job due to other work commitments and the amount of planning required to implement an active directory infrastructure.

The result was that we were in a precarious situation where we only had a single domain controller on some very old hardware. Management were aware of the problem, but I don't think they fully understood the implications as I had been told on several occasions that if we did not need to go active directory immediately, we should leave it for later.

Our biggest mistake, was not making sure that we have a backup domain controller available at all times. In the IT world, temporary solutions have a way of becoming permanent and as we'd been discussing the move to active directory for 2 years, it should have become obvious that we were going nowhere fast and needed to re-implement NT domain in a more permanent fashion.

The Cold Boot
We aren't a 24 hour shop and it is much easier to do system maintenance early in the morning than it is to do it late at night. Nobody gets in early but lots of people stay back. In addition, we service the whole of Australia, so people in Perth are always working later than people in Sydney. The side effect of early morning work is that you only get a short window of opportunity and if something goes wrong there isn't much time to fix it before business commences.

Our servers don't get rebooted very often. The domain controller hasn't had a reboot in six months. It's most likely that the last few reboots were warm boots and it probably hasn't had a cold boot in more than two years. I wasn't really expecting a problem - if I thought it likely, I wouldn't have done the reboot that morning.

So, 6:30 a.m. I powered the server down. I waited a full 10 seconds after power was off and then pushed the power button again. The power came on and the lights flickered for about five seconds and then the server shut down. I tried again with the same result. From the third try onwards, I wasn't even getting any lights flickering, the server was dead.

I quickly removed the cover of the server to determine whether or not there was any kind of burning smell - this would confirm my fears of a power failure. I couldn't detect any smell. With the lid off, I checked the disk drives - they were SCSI. This meant that they could not easily be transferred to another PC.

[An Aside: While I am quite happy to sing the praises of SCSI drives on Raid 5 for data critical servers, I'm really not convinced that RAID is great way to go for an operating system. It seriously limits your options for transferring drives to other computers and I've heard it said that there is a considerable performance impact.]

Problem Solving
The next step was to remove the server from the computer room. After all, it is just too cramped in there to work. I got the server to my desk and after a bit of a scrounge found some PS2 cables (everything else is on USB). I tried powering the server on and I got five seconds of lights - better than I was getting in the computer room. after a few more goes though the lights stopped appearing.

There wasn't much a could do so I started working on a PC in the hope of loading NT Server on it. All the time, I was thinking how am I going to replicate the domain given that the domain controller is dead. I considered ghosting from one server to another but without power that wasn't really an option. I also thought about our "Full" nightly backup but how are you supposed to get an operating system from tape on a server to a PC when the tape drive is internal. I am sure that there is a way of doing this but I think I need additional resources. In reality, I think that backups are for data, not for operating systems.

Once I had hit some stumbling blocks on the PC (more on them later), I returned my attention to the server. I powered the server on and it started running. Grateful for at least one piece of good luck, I went back to the PC in the hope that I could get it to operate as a backup domain controller before the primary domain controller failed again. Unfortunately, the primary domain controller only lasted 5 minutes.

I started to make a few phone calls but I really couldn't think of anyone who would be able to bring some hardware with them and install it. The server in question was well and truly out of warranty and was a no-name server. I don't approve of unbranded servers but this one existed in the company long before I started.

Staff started to appear and ask questions about why they couldn't login. Our server automatically logs everyone out at night, so nobody had a working login. They did, however have access to Notes mail and the web which supports the theory that systems segregation is probably more important than integration. Our problem wasn't visible to the outside world as all the external systems were functioning normally.

It was about this time that my boss arrived and in the most time-honoured traditions the server worked almost as soon as he touched the power button. More than that, it made a liar out of me by staying running for hours. It was still making the most dreadful high-pitched noise though which led to complaints from some staff (who didn't realize how lucky they were to be able to login at all).

At this time, a technician whom I had contacted earlier arrived. Embarrassingly, the server was operational. The technician listened to the noise and advised that we do something as early as possible. He said he would give us some time to try to set up a backup domain controller and to allow our staff to access their documents.

NT Server Issues
We started trying to build a new backup domain controller on a PC. We installed Windows NT 4.0 Server service Pack 1 and started going through setup and selected backup domain controller when prompted. Of course we were unable to login to the network as most network cards need at least service Pack 4 under Windows NT.

The problem was that you can't apply service packs until you have completed installation and you can't complete installation on an NT Domain Controller until you have replicated the domain settings from a primary domain controller. It was a classic catch 22 and we would not be able to install the server.

We looked at other options, obviously ghosting was one but it would only give us a single primary domain controller. The best solution would be to setup a backup domain controller on trusted hardware and then later promote it to primary. This would give us fallback for the future.

The next thing that we've thought about was that we could install Windows NT server as a standalone server and then upgrade to a domain controller. Yes, I know that it isn't actually possible under Microsoft but there is a very good product out there called upromote which we intended to use. We got to the point of installing service Pack 6a but were still couldn't get a network card running as it was too new. By this stage, we had exhausted the hardware available at the office and the technician had come back ready to perform some server maintenance.

Server Maintenance
The technician first replaced the power supply and we waited with baited breath for the server to restart. It did restart but continued to make the awful noise. The noise was coming from the drives but it was obvious that there was not a problem with them. There was however still a problem with the power. Eventually, the technician discovered that the video RAM had suddenly decided to become faulty. We looked around for a video card on-site but as most PCs now have video on board, it was not easy to find one. Eventually, we replaced it with a $600 High Definition TV/Video card from our presentation PC. The server is now running happily.

[One note of interest: Users were still able to access shared drives when the server was down the second time. This was because they were still logged in. All the more reason to do server maintenance at night - BEFORE the server logs people out]

The End?
The story is not complete as we still need to build ourselves a backup domain controller. I'll document this as it happens. There are also a lot of issues coming out of the reboot, In particular, I guess we need to treat a reboot as part of change management even though there isn't really a change involved.

Finally, there will be the move to consider and migration to active directory. I will post stories about these as and when they happen.

No comments: