Friday, November 10, 2006

A Disasterous Upgrade - and a Warning about Nomad.

What a disastrous week we have had! Last weekend, we upgraded one of our most critical systems on the Domino server. The system in question is primarily transaction based and as a result records are being added and deleted constantly.

The upgrade went well and all our post-implementation and testing was great. We all went home on Saturday feeling much relieved that the upgrade had gone as planned.

On Monday, the first day in which the system was going to be used by our clients, we received a phone call to say that they were able to create a record with the same primary key as another record in the system. We commenced an investigation to see what the extent of the problem was, not expecting a great deal of trouble after all of the testing we have done. We were surprised. There were quite a few duplicate records, many of which were very similar but not exactly the same. The dates on these records varied considerably making it even harder to identify a pattern.

In desperation, we reverted to the previous version of the system. A short while later, records with the current date began to appear despite the fact that we had locked users out. Classic replication syndrome - but where was the replica?

We have been in the process of getting an offsite server setup but we didn't have the connection in place. The people responsible for installing the server had obviously used an Internet connection to get the initial set of replicas. I hadn't given any thought to that server as it wasn't "active" but as a notes person, I quickly realised the problem and asked them to remove their replica of the database. All went well and we managed to revert back to the old version of the database.

Then the investigation began. How could something that had been tested so well be faulty. To cut a long story short, it turned out to be a replication issue. Our offsite server had been connected at some point in the past, via the Internet. This would have enabled them to get replicas of our databases. The server was then disconnected because we were going to have a more appropriate direct connection installed. The cable company, despite repeated requests to install the connection took several months to connect the cable. It was only connected yesterday (well after the problem).

During the long wait for the cable connection, our offsite server had obviously been reconnected to the Internet, hence the replication issues we had after the restore. More importantly, the offsite server had been disconnected long enough for the deletion stubs to be purged from the database. This made our deleted records look like they should be available again. When the server was reconnected and replicated, it performed some resurrections. This was where our problem began.

We would have been okay if the server had not been connected at all - or if the server had been connected and remained connected - or if we had a longer period on retention of deletion stubs. This one will take some explaining to management.

So why is this very rare, very unlikely condition important?

I was trying to think of all way it could recur. Apart from the obvious things such as having a local replica on your PC but turning your PC off when going on a very long holiday, I couldn't think of anything immediately... but now, a new thought springs to mind.

There are obviously a lot of people who are really excited about Nomad (Notes on a USB memory stick). It just occurred to me that if you only use your stick when on holidays or when travelling, any database replicas you have locally on it will grow old and you could find yourself with the same problem.

This isn't a problem with Nomad or with Notes and it will only affect certain types of databases. Nevertheless, you should be prepared for it.