* * * * *

             Everything always works the first time on Star Trek

They always made it look so easy on Star Trek: The Next Generation [1].

The Enterprise (NCC-1701-D) is facing certain doom by the imminent implosion
of a nearby nebula because the Thalaxians stole the warp core because they
need it for a new face-lift technology that's a little less than
enviromentally sound.

Wesley [2] barges onto the bridge. “Hey guys! What if—”

“Get that kid off the bridge,” says Captain Jean-Luc “I'm a Shakespearian
actor, not a ham” Picard, as he does his patented Picard Maneuver [3].

“But guys,” says Wesley, as he flips Worf onto his back and steps on his
throat to prevent being forcibly removed from the bridge, “what if we were to
reverse the tachyon field and reroute it through the shield generators to
generate an anti-inertial damping field while simultaneously firing a photon
torpedo loaded with an anti-matter implosion detonation device?”

“Why is that kid still—”

“And because we've generated an anti-inertial damping field outside the
ship,” says Chief Engineer Geordi “Kunta ‘Reading Rainbow’ Kinte” La Forge,
interrupting the [DELETED-French-DELETED] [DELETED-British-DELETED] [DELETED-
French-DELETED] [DELETED-British-DELETED] (aw, the heck with it) stuffy shirt
Captain.

“The force generated by the anti-matter implosion detonation device will push
us,” says Lt. “Fully functional, if you know what I mean, and I think you do,
wink wink nudge nudge say no more say no more” Data, as he's rapidly
calculating something on the computer console in front of him, “into Warp two
point seven eight.”

“Which is fast enough to get us out of range of the imminent implosion of a
nearby nebula!” says Wesley, still holding down a struggling Worf. Captain
[DELETED-stuffy shirt-DELETED] Picard looks momentarily confused as he
quickly glances at all three expository spewing characters.

“And if we can place the photon torpedo loaded with the anti-matter implosion
detonation device 22,453 kilometers at zero one five mark seven two, then
we'll end up right next to the Thalaxian ship.”

“Make is so,” says Picard. “And hurry—we only have one more commercial break
before the end of the show.”

And so, in twenty minutes, Wesley, Data and Geordi have managed to execute
this complicated technobabblish plan flawlessly. On the first attempt. Based
only on a hunch. Of a teenaged geek.

Ah, if only real life were as simple.

But I got the heisenbug nailed. [4]

Table: Latest statistics from the greylist daemon
Start   Wed Oct 17 23:48:32 2007
End     Thu Oct 18 17:01:55 2007
Running time    17h 13m 23s
IPs     117
Requests        3381
Requests-Cu     3
Tuples  1031
Graylisted      3240
Whitelisted     18
Graylist-Expired        2216
Whitelist-Expired       0

So, what was the fix?

It was a comment from Mark [5] about signal handlers under Unix that lead the
way.

> Now, on to your bug. I would say 99% it is timing. Furthermore if I were a
> betting man I would say your bug rests within a signal handler. The only
> thing you can ever do inside a signal handler is set the value of a
> volatile sig_atomic_t variable really. Nothing else is safe.
>

That's pretty much how I wrote the signal handlers:

> static volatile sig_atomic_t mf_sigint;
> static volatile sig_atomic_t mf_sigquit;
> static volatile sig_atomic_t mf_sigterm;
> static volatile sig_atomic_t mf_sigpipe;
> static volatile sig_atomic_t mf_sigalrm;
> static volatile sig_atomic_t mf_sigusr1;
> static volatile sig_atomic_t mf_sigusr2;
> static volatile sig_atomic_t mf_sighup;
>
> /**********************************************/
>
> void sighandler_sigs(int sig)
> {
>   switch(sig)
>   {
>     case SIGINT   : mf_sigint   = 1; break;
>     case SIGQUIT  : mf_sigquit  = 1; break;
>     case SIGTERM  : mf_sigterm  = 1; break;
>     case SIGPIPE  : mf_sigpipe  = 1; break;
>     case SIGALRM  : mf_sigalrm  = 1; break;
>     case SIGUSR1  : mf_sigusr1  = 1; break;
>     case SIGUSR2  : mf_sigusr2  = 1; break;
>     case SIGHUP   : mf_sighup   = 1; break;
>     default: _exit(EXIT_FAILURE); /* because I'm stupid */
>   }
> }
>
> /*************************************************/
>
> void check_signals(void)
> {
>   if (mf_sigalrm)  handle_sigalrm();
>   if (mf_sigint)   handle_sigint();
>   if (mf_sigquit)  handle_sigquit();
>   if (mf_sigterm)  handle_sigterm();
>   if (mf_sigpipe)  handle_sigpipe();
>   if (mf_sigusr1)  handle_sigusr1();
>   if (mf_sigusr2)  handle_sigusr2();
>   if (mf_sighup)   handle_sighup();
> }
>

check_signals() is called during the main loop of the program; it's only
sighandlers_sigs() that's called in a signal context. The only two signals
that get (or rather, got) special treatment are SIGSEGV and SIGCHLD. The
handler for SIGSEGV logs that it was called, unblocks any signals and re-
executes itself [6].

The other signal is SIGCHLD. SIGCHLD is sent when a child process ends. It's
then up to the parent process to call wait() (or one of its variants) to
obtain the return code of the child process. If the parent doesn't do that,
the child process becomes a zombie [7]. The SIGCHLD signal handler called
waitpid() (which is supposedly “safe” to call from a signal handler) as well
as some logging routines [8]. What may be happening (or was happening) is
some very subtle race condition. The logging routine does allocate some
memory (temporarily). The problem comes in if we're in the middle of
allocating some memory when the program receives a SIGCHLD, at which point,
we immediately enter the signal handler, which attempts (indirectly, through
the logging routines) to allocate memory. But we were already in the memory
allocation code when it got interrupted, thus leaving things in an undefined
(or inconsistent) state.

I'll expound on what Mark said, and say that once you start dealing with
signals, you're dealing with a multi-threaded application, and all the hurt
that implies.

I rewrote the SIGCHLD handler to work like the other handlers (signal handler
sets a flag, which is checked later from the main execution loop) and well,
17 hours later it's still running, on the virtual server, checkpointing
itself every hour.

This only took, what? About a month [9] to find?

But then again, I wasn't facing down the Thalaxians.

[1] http://en.wikipedia.org/wiki/Star_Trek:_The_Next_Generation
[2] http://wilwheaton.typepad.com/
[3] http://memory-alpha.org/en/wiki/Picard_Maneuver
[4] gopher://gopher.conman.org/0Phlog:2007/10/16.1
[5] http://gladesoft.com/
[6] gopher://gopher.conman.org/0Phlog:2007/03/09.1
[7] http://en.wikipedia.org/wiki/Zombie_process
[8] gopher://gopher.conman.org/0Phlog:2007/10/17.2
[9] gopher://gopher.conman.org/0Phlog:2007/09/22.1

Email author at [email protected]