Monday, August 16, 2010

Our Lotus Domino Cluster Failover Problem

In certain computing circles, "clustering" is a dirty word. I've heard of situations where, far from providing business continuity peace of mind, it creates more work and greater risk than it would if it were not involved.

This is not the case with Domino clustering. Done properly, it is extremely reliable

Our Problem
Recently, our cluster seems to have "picked up a slight flutter". Actually, I think that perhaps the rules behind it may have changed sometime around our 8.0 or 8.5 migration.

So, first I want to cover off basically what our cluster looks like;

Ok, this is quite a simplistic view and there are servers missing. I'm concentrating on the problem area only.

We have an onsite and offsite clustered Lotus Domino server, both running Lotus Domino 8.5 HF 1021. We'll call them "Onsite" and "Offsite" for ease of reference. The servers are quite a distance apart because we're clustering for business continuity purposes.

The theory is that our onsite staff members should access the onsite server unless it is down. The majority of our agents also run on this server, as does an intranet, extranet and several web sites. It's a busy and powerful box.

We discovered recently that many of our clients have been using the offsite server but we don't know exactly why.

It seems that if you open a database for which you don't already have a desktop icon, then the Notes client will default to opening it from the offsite server. What has exacerbated this problem is that we upgraded our clients to 8.5.1 and blew away their desktops. Now, suddenly all the computers are trying to access everything off the offsite server.

The reasons?
We don't know but were thinking that it was either;

  • Alphabetic: Because "Offsite" is lower in the alphabet than "Onsite"
  • Task Related: Because the Onsite server is much busier than the offsite one.

Does anyone have any ideas as to how we could go about finding out?


Craig said...

We have a similar set up, but our servers are alphabetically. We'll call the offsite one "Notes" and the onsite one "Domino" (that'll hint at the age of the installation.

Folks' clients go to the onsite one first, so based on that I'd vote for the alphabetica explanation.

But it's most likely because computers hate us and enjoy making life hard. ;-)

Alex said...

A couple of thoughts:
1) Do you enforce the mail server by desktop policy?

a) There was an SPR address in 8.0.2 FP5 AJAS7PDKER
SPR# AJAS7PDKER - After restart, the top of the workspace icon stack does not honor the Mail file location set defined in the location document. This regression was introduced from 8.0.2

2)Have you tried server_restricted on the outside server, to make sure the users only access the inside server.

Lotus Evangelist said...

If the threshold is set to low on usage they will end up at the offsite.
If you use policies to push out the applications and such to be only n the onsite server then they should find it first.
Domino does lookup by alpha so it is possible you have multiple ways to do this.
You could also disable the offsite server from anyone using it by setting its threshold to maximum and stop people from accessing it, if you so desired.

Gavin Bollard said...

Thanks for your comments. We do enforce the mail server by desktop policy but we have a lot of other databases and these are our biggest failover problem.

I looked at server_restricted and although it looks interesting, it seems to be suggesting that replication with a restricted server will fail. Since we need the other server in case of DR, I need things replicated on it.

I'll have a look at the Server_Availability_Threshold notes .ini option.

Paul Mooney said...

First thing... alpahbet comes to mind.

Second thing. Use a server user restricted to stop users hitting the box, but that will still let replication do its thing.

Last - check SAI and expansion factor / trans info range.

By all means email me if I can help.

Mark said...

We had a similar situation that we discovered. Turns out ours was a new feature actually introduced with the R7 client/server and is called Replication Triangulation. This works cluster or if replicas exist on another server. If the primary server is down, user fails to cluster server or using the replication history actually finds the replica on any server in your domain the user has access to and a replica exists. Since we are not a cluster shop (currently), we had to deny users at the server level to access the backup server where all replicas exist in a DR site. There are a few articles posted about this and policy and notes.ini settings you can use to control this. Lotus calls it a feature enhancement to not force full replication from scratch on every server but as you can see it has some negative effect in shops where we need to control server usage more strictly.

Chad Scott said...

Based on the description, my guess is that you have a Domain Catalog and that users have a catalog server defined in the Location document. In that scenario, and absent a workspace icon, the first database found in the catalog (alphabetical search) will be the one opened.

While you could use SERVER_RESTRICTED=2 to prevent users from accessing the DR server, this isn't ideal because access won't be seamless in the event the primary server is down. Instead, the better solution is to set SERVER_AVAILABILITY_THRESHOLD=100 on the DR server, which means it is always in a busy state and will only take user connections if no other server is available.

Randy Bye said...

I'll vote for alphabetical too. See this technote

Anonymous said...


I agree with the previous pot, the best way to solve you problem is the use of SERVER_AVAILABILITY_THRESHOLD=100.

But you need to know that this parameter is actually bugged and doesnt work as you can see in the following email i got with IBM:

"I have managed to reproduce the issue using 8.5 and 8.5.1 versions however Development is already aware of the situation and SPR # JSMN825TC8 is opened with the issue. We are expecting the issue to be resolved in 8.5.2 but the status of the SPR is still open and the only available workaround is to use the server_restricted=1 notes.ini value instead.

At the moment we have to wait for the specific SPRs resolution any progress can be checked from the Fixlist database or by directly calling HelpDesk. Please inform me If you require any further assistance from my side or should l conclude the PMR from now on.

Thanks in advance for your understanding "