Processing realtime data from LaBrea

* * * * *

Processing realtime data from LaBrea

Since I'm stuck at The Hospital [1], and The Hospital doesn't have wireless,
I might as well work on the real-time LaBrea data processing program. Loaded
up the laptop with program fragments and libraries I may need and brought
along my copy of _Advanced Programming in the Unix® Environment_ [2] by W.
Richard Stevens, and am coding up a storm.

And yes, I'm coding this up in C. Not like I'm going anywhere in the next
couple of days.

* * * * *

LaBrea spits out lines like:

> 1136832897 Initial Connect - tarpitting: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX *
> 1136832897 Initial Connect - tarpitting: 82.240.204.251 3339 -> XXX.XXX.XXX.XXX 139
> 1136832897 Persist Trapping: 82.240.204.251 3334 -> XXX.XXX.XXX.XXX 445 *
> 1136832898 Persist Activity: 216.248.36.242 45285 -> XXX.XXX.XXX.XXX 135
> 1136832898 Persist Activity: 216.248.36.242 34589 -> XXX.XXX.XXX.XXX 135 *
> 1136832898 Persist Activity: 216.211.61.158 3862 -> XXX.XXX.XXX.XXX 135
> 1136832898 Persist Activity: 211.236.205.138 4459 -> XXX.XXX.XXX.XXX 139 *
>

There's some other stuff that's logged, but I'm primarily looking for lines
like the above. Which means the information I can store will look something
like:

> struct tprecord
> {
> IP src;
> Port sport;
> IP dest;
> Port dport;
> time_t start; /* first time we see src:sport-dest:dport */
> time_t last; /* time of last activity of src:sport-dest:dport */
> size_t packets;
> };
>

Now, since we'll be handling a lot of connections, we need to store these
structures in a way that is easy to search through. Could use a hash table,
and we are keying off the src:sport→dest:dport tuple, and that's 10 bytes of
pretty much random binary data, which is good for this type of thing. But
it's an open ended hash, not knowing how many entries we'll be handling.
There are several methods to handle overflows in a hash table, but that still
means at some point doing a linear search, and then there's the overhead of
having a hash table.

There's also storing the data in a tree. In fact, I even have code in C to
manage a balanced tree (taken from Knuth). But the downside is—I only have
code to add to the tree, not to delete from it. And it was hard enough to
write as is.

And there's still the overhead of maintaining a tree structure.

So I'm using arrays. Less overhead and straightforward to manage. The only
trick is keeping the array sorted in some order, but as long as you compare
the structures in a consistent manner, that's not really an issue (and by the
way, I'm sorting by source address, port, destination address and port, in
that order).

To make it even easier, I have an array of structures (see above), and an
array of pointers to said structures, and it's the pointer based array that
is sorted (since it's faster to swap pointers than swap whole structures).
And once sorted, you can use a binary search to find the record.

The default binary search in C, bsearch(), returns either a pointer to the
found record, or NULL if not found. Good for most uses, but not quite what I
want. In addition to either finding the record or not, I'd also like to know
where in the array the record was found, and if not, where it would appear if
it was in the array. And C's bsearch() does not return the index.

Why do I want that?

My thinking is that if the search (reguardless of what type of search it is)
didn't find the record in a sorted array, it has the information as to where
the missing record should be. And with the information as to where, it would
be “trivial” to add the record at the right point in the array.

And that beats adding the new record at the end, and restorting (C's qsort()
is Quick Sort, which on average is O(n lg n) running time, but the worse case
is O(n^2) and the worst case is then the array is mostly sorted to begin
with, which is why I'd rather not resort after each addition).

And try as I might, I could not get a binary search to work that would also
return the index. For now, I replaced the search with a linear search, until
I can comb through some reference materials at home.

The other problem I'm having is the array manipulation code doesn't quite
work. When the program starts, it has space to store X records. If it runs
out of space, it increases the size of the various arrays until it hits some
large number of records, then it will go through, purge records fitting some
criteria.

Only it's not working properly.

Got a lot of work done on the program though, and it was nice to get into The
Zone (even with a headache). At this point, I'm headed home for sleep and
what looks to be yet another day at The Hospital.

[1] gopher://gopher.conman.org/0Phlog:2006/01/14.1
[2] http://www.amazon.com/exec/obidos/ASIN/0201563177/conmanlaborat-20

Email author at [email protected]