Thoughts on optimizing a greylist daemon

* * * * *

Thoughts on optimizing a greylist daemon

Speaking of optimization [1], I thought it might be fun to profile the
greylist daemon [2]. It's not difficult—just recompile the program with the
appropriate compiler option and run it.

Now, when I wrote the greylist daemon, I did gave thought to how I was going
to search through the vast amount of information, and wrote the code with
that in mind (oooh, premature optimizations? Or was that a mature
optimization [3]?). I was also curious as to any hotspots in the code, as I
tend to max out at 130 requests per second on my development server.

The results were a bit surprising:

Table: Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self Ts/call total Ts/calls name
------------------------------
50.00 0.01 0.01 661477 0.00 0.00 tuple_cmp_ift
50.00 0.02 0.01 1 10.00 10.00 whitelist_dump_stream
0.00 0.02 0.00 140892 0.00 0.00 edomain_cmp
0.00 0.02 0.00 108648 0.00 0.00 crc32
0.00 0.02 0.00 95853 0.00 0.00 edomain_search
0.00 0.02 0.00 28270 0.00 0.00 StreamEOF
0.00 0.02 0.00 27439 0.00 0.00 report_stderr
0.00 0.02 0.00 27273 0.00 0.00 ipv4
0.00 0.02 0.00 27165 0.00 0.00 check_signals
0.00 0.02 0.00 27162 0.00 0.00 send_packet
0.00 0.02 0.00 27155 0.00 0.00 ip_match
0.00 0.02 0.00 27155 0.00 0.00 send_reply
0.00 0.02 0.00 27155 0.00 0.00 type_graylist
0.00 0.02 0.00 25458 0.00 0.00 tuple_search
0.00 0.02 0.00 24692 0.00 0.00 tuple_add
0.00 0.02 0.00 24692 0.00 0.00 tuple_allocate

------------------------------
% time cumulative seconds self seconds calls self Ts/call total Ts/calls name
The big surprise was that the execution time was split between two functions,
tuple_cmp_ift() (which compares two tuples) and whitelist_dump_stream(), and
the amusing bit is the disparity of the number of calls between the two,
661,477 calls to the former vs. the one call to the latter.

But in this case, the one call to whitelist_dump_stream() was made when the
program ended (all it does is write out the tuples that have been
whitelisted). Remove that from the program, and we pretty much have our
hotspot—tuple_cmp_ift().

The other surprise is that there weren't any real surprises. The test data
consists of 27,155 tuples with some duplicates, and you can see the various
checks were called that many times. The only reason edomain_search() was
called as many times as it was is that there are four lists that are checked
24,000 times each. crc32 is called twice for each packet (yes, that's
intentional) so divide that by four (request and response) and it fits.

The code is pretty much even except for tuple_cmp_ift(), which is our obvious
hotspot.

> int tuple_cmp_ift(const void *left,const void *right)
> {
> const struct tuple *l = left;
> const struct tuple *r = right;
> int rc;
>
> ddt(left != NULL); /* similar to assert() */
> ddt(right != NULL); /* but logs to syslogd */
> ddt(l->pad == 0xDECAFBAD); /* compiled out of profiled */
> ddt(r->pad == 0xDECAFBAD); /* code */
>
> /*-------------------------------
> ; sizeof(l->ip) is 16 bytes, enough
> ; space to hold an IPv6 address; not
> ; that we use IPv6 currently, but planning
> ; for the future.
> ;------------------------------------------*/
>
> if ((rc = memcmp(l->ip,r->ip,sizeof(l->ip))) != 0) return(rc);
>
> if (l->fromsize < r->fromsize)
> return(-1);
> else if (l->fromsize > r->fromsize)
> return(1);
>
> if ((rc = memcmp(l->from,r->from,l->fromsize)) != 0) return(rc);
>
> if (l->tosize < r->tosize)
> return(-1);
> else if (l->tosize > r->tosize)
> return(1);
>
> rc = memcmp(l->to,r->to,l->tosize);
> return(rc);
> }
>

Now, before I spend any time trying to optimize this bit of code, I thought
I'd rerun the program without it. I changed tuple_search() (which calls
tuple_cmp_ift()) to simply return “not found,” and I removed the call to
whitelist_dump_stream(). The results were even more amusing:

Table: Each sample counts as 0.01 seconds. no time accumulated
% time cumulative seconds self seconds calls self Ts/call total Ts/calls name
------------------------------
0.00 0.00 0.00 140892 0.00 0.00 edomain_cmp
0.00 0.00 0.00 108668 0.00 0.00 crc32
0.00 0.00 0.00 95853 0.00 0.00 edomain_search
0.00 0.00 0.00 27438 0.00 0.00 report_stderr
0.00 0.00 0.00 27273 0.00 0.00 ipv4
0.00 0.00 0.00 27169 0.00 0.00 check_signals
0.00 0.00 0.00 27167 0.00 0.00 send_packet
0.00 0.00 0.00 27155 0.00 0.00 ip_match
0.00 0.00 0.00 27155 0.00 0.00 send_reply
0.00 0.00 0.00 27155 0.00 0.00 type_graylist
0.00 0.00 0.00 25458 0.00 0.00 tuple_add
0.00 0.00 0.00 25458 0.00 0.00 tuple_allocate
0.00 0.00 0.00 25458 0.00 0.00 tuple_search
0.00 0.00 0.00 3577 0.00 0.00 StreamEOF

------------------------------
% time cumulative seconds self seconds calls self Ts/call total Ts/calls name
Yeah, a lot of called functions, but not enough accumulated time to even
survive rounding up.

I'm begining to think that the 130 requests per second limit I'm seeing isn't
a function of the code, but of the network stack. Given this result, I doubt
I'll bother optimizing tuple_cmp_ift() any time soon, which seems to re-
enforce the whole “premature optimization is eeeeevil” thought, but in
reality, I wrote the code around a few optimizations (namely, well considered
data structures) in the beginning and didn't need to optimize it later.

[1] gopher://gopher.conman.org/0Phlog:2007/10/23.1
[2] gopher://gopher.conman.org/0Phlog:2007/08/16.1
[3] http://cowboyprogramming.com/2007/01/04/mature-optimization-2/

Email author at [email protected]