A couple of weeks ago, we launched a new XPages app. Hopefully the first of many. It was very impressive and we got a lot of hits....
...until the server crashed.
Not a big deal. Our server is set to restart automatically, and it was up and running in no time. Then about 1.5 days later we had another crash.
We've decided to tackle this on a few fronts, first of all we're rewriting some parts of the app to be a bit less intense and to take better advantage of the recycler. That's all cutting edge development stuff, so it's really not "me".
On the admin side, I wanted to see if we could release resources a bit. I checked the server close to a crash but there's no indication on the Windows 2003 side of things. Of course, the Notes Logs tell a different story.
HTTP JVM: CLFAD0211E: Exception thrown. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs
HTTP JVM: >>>>
HTTP JVM: The XPages runtime engine faced an OutOfMemoryError
HTTP JVM: You can fix this by increasing the value of the HTTPJVMMaxHeapSize variable in notes.ini
HTTP JVM: >>>>
HTTP JVM: Out of memory exception occurred servicing request for: /publicsite/OurNewXpagesDB.xsp - HTTP Code: 500. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs
HTTP Web Server: Command Not Handled Exception [/publicsite/OurNewXpagesDB.xsp] Anonymous
HTTP JVM: CLFAD0211E: Exception thrown. For more detailed information, please consult error-log-0.xml located in e:/Lotus/Domino/data/domino/workspace/logs
So, we decided to try a http restart next time these error messages started building up.
It worked!
Doing a TELL HTTP RESTART bought us two more days of uptime.
To Restart or not to Restart?
We looked around and found that our problems weren't as unique as we'd imagined. There are a few people on 8.5 and above (we're currently 8.5.2) who have this problem.
Apparently also, a Tell HTTP Restart flushes memory but doesn't do much for the JVM.
To flush the JVM, we need to think about;
TELL HTTP QUIT
then after a few minutes
LOAD HTTP
or as the guy in the IT Crowd says "have you tried turning it off and on again?"
We'll be doing this until we get our application sorted but the question is; should this be part of our normal nightly routine? Maybe it's good practice to restart your web server's service nightly - especially if you have a cluster which could take the load while the restart occurs.
Does anyone think that this is "best practice"?
Comments
If the JVM is running out of memory, a failure to do the above is the most likely culprit. You can delay this occurrence by increasing the amount of memory allotted to the JVM, or by (as you suggest) periodically restarting the HTTP task. But neither option cures the disease, it just treats the symptoms, so to speak. Unless you fix the memory leak to begin with, your users are likely to experience gradually increasing performance degradation as the server approaches the point at which it would run out of memory entirely.
Assuming you're running 8.5.2 (if not, upgrade), one additional option available to you that certainly is best practice is to update the application properties to tell the application to serialize all XPages to disk. This setting was specifically added by IBM to allow XPage applications to scale to more users; it causes all operations to consume the bare minimum of memory by saving all information about the page structure to the hard drive as soon as each request has returned a response to the browser, instead of storing that structure in memory. If the user triggers any events against the page, the page structure is then loaded back into memory before the event is handled. The tradeoff is that, by not holding all of this information in memory for the duration of the user session, there is a slight performance hit for events... but the response delta is typically in the sub-second range, so this is generally acceptable when scaling to thousands of concurrent users.
tell http xsp heapdump
tell http xsp javadump
Then have a look at the output. May give some hints.
Do you also have the 852 FixPacks installed upon the servers?
p.
Is the app an upgraded R7.x app with extensive Lotusscript code possibly calling new Java agents? It may pay to revise old code and make sure that all agents are properly disposing of their artifacts when they end.