Software performance after a few hours with large sets of data

* * * * *

Software performance after a few hours with large sets of data

As a corollary to yesterday's entry about testing [1]—make sure you test for
several hours.

I found a bug in the latest version of the greylist daemon [2] that only
manifests itself after about six hours of running. For some as yet unknown
reason, the program just stops responding. It doesn't segfault (if it did, it
would automatically restart). It just doesn't quit (if it did, I wouldn't see
it running in the process list). It just gets into a weird state. When I
attach gdb [3] to the running instance the stack frame is somewhere in the
weeds (that's a technical term) so its hard to isolate the problem.

This type of bug is very difficult to diagnose.

Although I do have an idea of what it might be. The latest feature (as a
request by Smirk) is to checkpoint the program every hour or so—it dumps its
internal state so it can pick up again when it restarts. When I checked the
logs, the last two times it crashed (after running for about six hours) it
was just as it was checkpointing itself (which is logged).

I removed the checkpoint feature from the “production” version, and
hopefully, I won't get another influx of spam in six hours (the Postfix [4]
module accepts the incoming email if it doesn't get a response from the
greyist daemon after five seconds—I figure a) it's better to receive spam
than lose email and b) getting a ton of spam is a clear indication something
is wrong).

Meanwhile, I'm running the grueling test slowly (one tuple per second), with
the hopes of triggering (or at least, reproducing the problem) in six hours.

[1] gopher://gopher.conman.org/0Phlog:2007/09/21.2
[2] gopher://gopher.conman.org/0Phlog:2007/08/16.1
[3] http://sourceware.org/gdb/
[4] http://www.postfix.org/

Email author at [email protected]