* * * * *
Millions of moving parts
> In a system of a million parts, if each part malfunctions only one time out
> of a million, a breakdown is certain.
>
> “—Stanislaw Lem”
>
In between paying work, I'm getting syslogintr [1] ready for release—cleaning
up the Lua [2] scripts, adding licensing information, making sure everything
I have actually works, that type of thing. I have quite a few scripts that
isolated some aspects of working scripts—for instance, checking for ssh
attempts and blocking the offending IP (Internet Protocol) but weren't fully
tested. A few were tested (as I'm using them at home), but not all.
I update the code on my private server, rewrite its script to use the new
modules (as I'm calling them) only to watch the server seize up tight. After
a few hours of debugging, I fixed the issue.
Only it wasn't with my code.
But first, the scenario I'm working with. Every hour, syslogintr will check
to see if the webserver and nameserver are still running (why here? Because I
can, that's why) and log some stats gathered from those processes. The checks
are fairly easy—for the webserver I query mod_status [3] and log the results;
for the nameserver, I pull the PID (Process ID) from /var/run/named.pid and
from that, check to see if the process exists. If they're both running,
everything is fine. It was when both were not running that syslogintr froze.
Now, when the appropriate check determines that the process isn't running it
not only logs the situation, but sends me an email to alert me of the
situation. If only one of the two processes were down, syslogintr would work
fine. It was only when both were down that it froze up solid.
I thought it was another type of syslog deadlock [4]—Postfix [5] spews forth
multiple log entries for each email going through the system and it could be
that too much data is logged before syslogintr can read it, and thus, Postfix
blocks, causing syslotintr to block, and thus, deadlock.
Sure, I could maybe increase the socket buffer size, but that only pushes the
problem out a bit, it doesn't fix the issue once and for all. But any real
fix would probably have to deal with threads, one to just read data
continuously from the sockets and queue them up, and another one to pull the
queued results and process them, and that would require a major restructure
of the whole program (and I can't stand the pthreads API (Application
Programming Interface)). Faced with that, I decide to see what Stevens [6]
has to say about socket buffers:
> With UDP (User Datagram Protocol), however, when a datagram arrives that
> will not fit in the socket receive buffer, that datagram is discarded.
> Recall that UDP has no flow control: It is easy for a fast sender to
> overwhelm a slower receiver, causing datagrams to be discarded by the
> receiver's UDP …
>
Hmm … okay, according to this, I shouldn't get deadlocks because nothing
should block. And when I checked the socket receive buffer size, it was way
larger than I expected it to be (around 99K if you can believe it) so even if
a process could be blocked sending a UDP packet, Postfix (and certainly
syslogintr wasn't sending that much data.
And on my side, there wasn't much code to check (around 2300 lines of code
for everything). And when a process list showed that sendmail was hanging, I
decided to start looking there.
Now, I use Postfix, but Postfix comes with a “sendmail” executable that's
compatible (command line wise) with the venerable sendmail [7]. Imagine my
surprise then:
> [spc]brevard:~>ls -l /usr/sbin/sendmail
> lrwxrwxrwx 1 root root 21 Feb 2 2007 /usr/sbin/sendmail -> /etc/alternatives/mta
> [spc]brevard:~>ls -l /etc/alternatives/mta
> lrwxrwxrwx 1 root root 26 May 5 16:30 /etc/alternatives/mta -> /usr/sbin/sendmail.sendmail
>
Um … what the … ?
> [spc]brevard:~>ls -l /usr/sbin/sendmail*
> lrwxrwxrwx 1 root root 21 Feb 2 2007 /usr/sbin/sendmail -> /etc/alternatives/mta
> -rwxr-xr-x 1 root root 157424 Aug 12 2006 /usr/sbin/sendmail.postfix
> -rwxr-sr-x 1 root smmsp 733912 Jun 14 2006 /usr/sbin/sendmail.sendmail
>
Oh.
I was using sendmail's sendmail instead of Postfix's sendmail all this time.
Yikes!
When I used Postfix's sendmail everything worked perfectly.
Sigh.
[1]
gopher://gopher.conman.org/0Phlog:2010/02/09.1
[2]
http://www.lua.org/
[3]
http://httpd.apache.org/docs/trunk/mod/mod_status.html
[4]
gopher://gopher.conman.org/0Phlog:2010/04/18.1
[5]
http://www.postfix.org/
[6]
http://www.amazon.com/exec/obidos/ASIN/0131411551/conmanlaborat-20
[7]
http://www.sendmail.org/
Email author at
[email protected]