* * * * *

                   Real-time LaBrea data processing program

I finished writing the real-time LaBrea data processing program [1] (most of
it last night at The Hospital) and the final few bugs were real doozies.

The program works by getting data from LaBrea [2], then it looks up the
connection and updates the information accordingly. The code pretty much
looks like:

> void start_tarpit(time_t stamp,char *line)
> {
>   struct tprecord *exists;
>   struct tprecord  rec;
>   size_t           index;
>
>   read_record(&rec,line);
>   exists = tpr_search(&rec,&index);
>   if (exists == NULL)
>     exists = pull_free_record();
>
>   exists->src = rec.src;
>   exists->sport = rec.sport;
>   /* ... */
>
>   record_add(index,exists);
> }
>
> /*******************************/
>
> struct tprecord *pull_free_record(void)
> {
>   if (g_poolnum == g_poolmax)
>     do_forced_garbage_collection();
>
>   return(&g_pool[g_poolnum++]);
> }
>
> /*****************************/
>
> void record_add(size_t index,struct tprecord *rec)
> {
>   if (g_recnum == g_recmax)
>     do_forced_garbage_collection();
>
>   memmove(
>         &g_rec[index + 1],
>         &g_rec[index],
>         (g_recnum - index + 1) * sizeof(struct tprecord *)
>   );
>   g_rec[index] = rec;
>   g_recnum++;
> }
>

tpr_search() is the binary search routine I wrote the other day [3], and as
you can see, it returns the index to where the record is in the array, or to
where it should be. pull_free_record() just returns the next free slot in the
structure array, and if there are no slots available, it does a removes some
older records according to some criteria. And record_add will add the record
to the pointer array, also removing older records if there is no space left.

Some records are deleted. All remaining records move about. The pointer array
is resorted. So between the time

> exists = tpr_search(&rec,&index);
>

and

> record_add(index,exists);
>

index may not be a valid index anymore!

Oops.

(Never mind the fact that one of the two calls to
do_forced_garbage_collection() is redundant)

Simple enough to fix once I knew what was going on.

Another bug dealt with named pipes. Instead of directly piping the data from
LaBrea to ltpstat (what I call the read-time LaBrea data processing program),
I decided to go through a named pipe, which would allow me to start and stop
either one independantly from the other.

Now, I'm testing my program, running cat sample > labrea.pipe in one window,
and ltpstat --input labrea.pipe in another. It's working, until the data in
sample runs out and cat closes its side of the named pipe.

Now, the code that reads in the data is in a library I wrote, and it just
assumes when a read() returns 0, that it's the end of the file, and marks it
as being closed. ltpstat ignores the “end of file” status, and keeps trying
to read a now-closed file. We get into a busy loop and the system load shoots
up. Also, if I now try to pump more data through the named pipe, ltpstat
ignores the data.

Even when I modify the library code to not mark “end of file” when there's
nothing to read does nothing, as from that point on, read() just returns
nothing anyway and seems to be a “feature” of Linux (or Unix—I didn't have
_Advanced Programming in the Unix® Environment_ [4] with me to look this up),
so I restructure the main loop:

> while(1)
> {
>   in = openinput();
>   while(!StreamEOF(in))
>   {
>     process_labrea_output();
>   }
>   StreamFree(in);
> }
>

That was all fine and good, until I threw signals into the mix.

I use signals to tell ltpstat (what I call the read-time LaBrea data
processing program) to dump various information—SIGHUP to print the number of
connections and unique IPs being tracked, SIGUSR1 to do a raw dump of all the
data accumulated so far and SIGUSR2 to generate a more or less human readable
dump.

But signals are basically interrupts. And operating system theory states that
one process should never find another process with its pants down (PCLSRing:
Keeping Process State Modular) [5] (so to speak—and I should warn you—the
paper I linked to is way more technical than I've gotten here). Signals and
system calls interact (that is, if a process is signaled while making a
system call into the kernel) in one of two ways—the system call will simply
fail, or it will be restarted automatically. And one can select which method
to use.

If I elected to have the system call fail, and ltpstat would fail, either to
open (a system call) the named pipe, or in reading (another system call) the
named pipe after I signal the program.

If I elected to have the system call restarted, and the signal handlers I set
up would never get called (due to the way I handle the signals).

I ended up reimplementing two routines from my library (which is used in more
than just this program) for just this program. I select for the “system call
fail” method, check to see if the system called failed due to a signal, and
if so, check for the signals, and try again.

Again, this took a few hours to track down.

But now the program works, and I can finally get real time statistics from
LaBrea.

[1] gopher://gopher.conman.org/0Phlog:2006/01/14.2
[2] http://sourceforge.net/projects/labrea
[3] gopher://gopher.conman.org/0Phlog:2006/01/15.2
[4] http://www.amazon.com/exec/obidos/ASIN/0201563177/conmanlaborat-20
[5] http://fare.tunes.org/tmp/emergent/pclsr.htm

Email author at [email protected]