Monday, July 18, 2011

Domino Resource Issues under XPages

A couple of weeks ago, we launched a new XPages app. Hopefully the first of many. It was very impressive and we got a lot of hits....

...until the server crashed.

Not a big deal. Our server is set to restart automatically, and it was up and running in no time. Then about 1.5 days later we had another crash.

We've decided to tackle this on a few fronts, first of all we're rewriting some parts of the app to be a bit less intense and to take better advantage of the recycler. That's all cutting edge development stuff, so it's really not "me".

On the admin side, I wanted to see if we could release resources a bit. I checked the server close to a crash but there's no indication on the Windows 2003 side of things. Of course, the Notes Logs tell a different story.

HTTP JVM: CLFAD0211E: Exception thrown. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs
HTTP JVM: >>>>
HTTP JVM: The XPages runtime engine faced an OutOfMemoryError
HTTP JVM: You can fix this by increasing the value of the HTTPJVMMaxHeapSize variable in notes.ini
HTTP JVM: >>>>
HTTP JVM: Out of memory exception occurred servicing request for: /publicsite/OurNewXpagesDB.xsp - HTTP Code: 500. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs
HTTP Web Server: Command Not Handled Exception [/publicsite/OurNewXpagesDB.xsp] Anonymous
HTTP JVM: CLFAD0211E: Exception thrown. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs

So, we decided to try a http restart next time these error messages started building up.

It worked!
Doing a TELL HTTP RESTART bought us two more days of uptime.


To Restart or not to Restart?
We looked around and found that our problems weren't as unique as we'd imagined. There are a few people on 8.5 and above (we're currently 8.5.2) who have this problem.

Apparently also, a Tell HTTP Restart flushes memory but doesn't do much for the JVM.

To flush the JVM, we need to think about;

TELL HTTP QUIT
then after a few minutes
LOAD HTTP

or as the guy in the IT Crowd says "have you tried turning it off and on again?"

We'll be doing this until we get our application sorted but the question is; should this be part of our normal nightly routine? Maybe it's good practice to restart your web server's service nightly - especially if you have a cluster which could take the load while the restart occurs.

Does anyone think that this is "best practice"?

6 comments:

Tim Tripcony said...

You mention taking advantage of the recycler... I assume you're referring to calling the recycle() method on variables pointing to Domino objects (database, view, document, etc.) before assigning them to something else (such as when iterating through documents in a view). You should always, always do this. Forgetting to do this causes memory leaks, and is a Very Bad Thing.

If the JVM is running out of memory, a failure to do the above is the most likely culprit. You can delay this occurrence by increasing the amount of memory allotted to the JVM, or by (as you suggest) periodically restarting the HTTP task. But neither option cures the disease, it just treats the symptoms, so to speak. Unless you fix the memory leak to begin with, your users are likely to experience gradually increasing performance degradation as the server approaches the point at which it would run out of memory entirely.

Assuming you're running 8.5.2 (if not, upgrade), one additional option available to you that certainly is best practice is to update the application properties to tell the application to serialize all XPages to disk. This setting was specifically added by IBM to allow XPage applications to scale to more users; it causes all operations to consume the bare minimum of memory by saving all information about the page structure to the hard drive as soon as each request has returned a response to the browser, instead of storing that structure in memory. If the user triggers any events against the page, the page structure is then loaded back into memory before the event is handled. The tradeoff is that, by not holding all of this information in memory for the duration of the user session, there is a slight performance hit for events... but the response delta is typically in the sub-second range, so this is generally acceptable when scaling to thousands of concurrent users.

Tim Tripcony said...

P.S. Instead of issuing separate "TELL HTTP QUIT" and "LOAD HTTP" commands, the single "RESTART TASK HTTP" command does both. This is not the same as "TELL HTTP RESTART", which, as you mentioned, flushes some memory (and reloads some configuration settings) but does not actually restart the JVM. "RESTART TASK HTTP" actually shuts down the entire HTTP task, then loads it again. The only time I've seen this command fail to do that is when there are hung threads (like web agents that did not complete successfully).

Simon O'Doherty said...

Try doing.

tell http xsp heapdump
tell http xsp javadump

Then have a look at the output. May give some hints.

Paul Hannan said...

Hi,
Do you also have the 852 FixPacks installed upon the servers?
p.

Graham Dodge said...

Gavin,

Is the app an upgraded R7.x app with extensive Lotusscript code possibly calling new Java agents? It may pay to revise old code and make sure that all agents are properly disposing of their artifacts when they end.

Chris Whisonant said...

Out of curiousity, what is the value that you have if you do a SHOW CONFIG HTTPJVMMaxHeapSize ? It could be that you need to increase this value. IBM has tinkered with what they think the default settings should be. See, for instance, my blog post here. I would recommend that you increase this value to 128M if it's not already. And if it is, then add another 128M to it.