Programs are buggy because error checking is tedious and error prone. Ironic,

* * * * *

Programs are buggy because error checking is tedious and error prone. Ironic,
don't you think?

> “There are no easy answers when it comes to reporting and handling errors.”
>
> “P. J. Plauger”
>

I release a new version of the greylist daemon [1] and what happens? Mark [2]
finds a bug.

Or rather, the new version didn't fix his current problem.

I take a look into the issue and it's not terribly surprising that I botched
handing a particular error.

Sigh.

Error handling was taught by osmosis at college (Computer Science and
Engineering at Florida Atlantic University) [3]. It was expected that we just
magically pick up on handling errors but really, as long as the program
didn't outright crash and produced something vaguely like the output, all was
fine.

In fact, when I expressly asked one my instructors what to do when handling a
particular thorny error, I was told, point blank: “if you don't know how to
handle the error, then don't check for it.” The instructor, sadly, had a
point—you could go mad from trying to handle every possible error condition.

It's not hard to test for errors. In C, just about every function you can
think of can return an error (with a few exceptions, like getpid() under Unix
if a process can't get its own process ID, then you have other things to
worry about). But checking the return of every function call gets tedious,
fast. What was once a small concise function:

> int create_socket(struct sockaddr *paddr,socklen_t saddr)
> {
> ListenNode listen;
> struct epoll_event ev;
> int reuse = 1;
>
> assert(paddr != NULL);
> assert(saddr > 0);
>
> listen = malloc(sizeof(struct listen_node));
>
> memset(listen,0,sizeof(struct listen_node));
> memcpy(&listen->local,paddr,saddr);
>
> listen->fn = event_read;
> listen->sock = socket(paddr->sa_family,SOCK_DGRAM,0);
>
> setsockopt(listen->sock,SOL_SOCKET,SO_REUSEADDR,&reuse,sizeof(reuse));
> fcntl(listen->sock,F_GETFL,0);
> fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK);
> bind(listen->sock,paddr,saddr);
> memset(&ev,0,sizeof(ev));
>
> ev.events = EPOLLIN;
> ev.data.ptr = listen;
> epoll_ctl(g_queue,EPOLL_CTL_ADD,listen->sock,&ev);
>
> return listen->sock;
> }
>
>

becomes twice as big as each function call is wrapped up in an if statement:

> int create_socket(struct sockaddr *paddr,socklen_t saddr)
> {
> ListenNode listen;
> struct epoll_event ev;
> int reuse = 1;
>
> assert(paddr != NULL);
> assert(saddr > 0);
>
> listen = malloc(sizeof(struct listen_node));
> if (listen == NULL)
> return -1;
>
> memset(listen,0,sizeof(struct listen_node));
> memcpy(&listen->local,paddr,saddr);
>
> listen->fn = event_read;
> listen->sock = socket(paddr->sa_family,SOCK_DGRAM,0);
>
> if (listen->sock == -1)
> {
> perror("socket()");
> return -1;
> }
>
> if (setsockopt(listen->sock,SOL_SOCKET,SO_REUSEADDR,&reuse,sizeof(reuse)) == -1)
> {
> perror("setsockopt()");
> return -1;
> }
>
> if fcntl(listen->sock,F_GETFL,0) == -1)
> {
> perror("fcntl(GETFL)");
> return -1;
> }
>
> if (fcntl(listen->sock,F_SETFL,rc | O_NONBLOCK) == -1)
> {
> perror("fcntl(SETFL)");
> return -1;
> }
>
> if (bind(listen->sock,paddr,saddr) == -1)
> {
> perror("bind()");
> return -1;
> }
>
> memset(&ev,0,sizeof(ev));
> ev.events = EPOLLIN;
> ev.data.ptr = listen;
>
> if (epoll_ctl(g_queue,EPOLL_CTL_ADD,listen->sock,&ev) == -1)
> {
> perror("epoll_ctl(ADD)");
> return -1;
> }
>
> return listen->sock;
> }
>
>

And even this isn't all that great. The function prints what happened (via
the perror() call) but that presupposes that stderr (the standard error
reporting file) is open! If it's not open, well then … if you're lucky,
perror() (or whatever code it eventually calls) checks to see if stderr is
open and if not, just … fail … gracefully? I guess? Hopefully? Maybe?

But even if we could return what failed to the caller, then that means the
implementation details get pushed out of the create_socket() function (which
is typically considered a “bad idea”). Even setting that aside, what can the
caller do? Well, not much really.

The socket() call could fail because there's not enough kernel memory, or
there's too many files already open, we don't have permission to create the
socket, or the protocol isn't supported. Don't have the right privileges? If
the process isn't root, there's not much that can be done (and if the process
was root, there wouldn't be an issue). Not enough kernel memory? I wouldn't
even know what to do in that case except kill off a few processes or reboot
the box. But in each case, there isn't much the program can do except give up
(and maybe attempt to log the error somewhere).

And all the other calls are pretty much the same—no memory, no privileges,
can't do it, etc. We can break the errors down into a few categories:

1. programming errors, like EBADF (not an open file, or the operation
couldn't be done given how the file was open originally) or EINVAL
(invalid parameter) that need to be fixed, but once fixed, never happens
again;
2. it can be fixed, like EACCESS (bad privileges) or ELOOP (too many
symbolic links when trying to resolve a filename) but that the fix has
to happen outside the scope of the program, but once fixed, tends not
happen again unless someone made a mistake;
3. better exit the program as quickly and cleanly as possible because
something bad, like ENOMEM (insuffient kernel memory) just happened and
things are going bad quickly. Depending upon the circumstances, a fast,
hard crash might be the best thing to do;
4. and finally, the small category of errors that a program might be able
to handle, like ENOENT (file doesn't exist) depending upon the context
(it could then create the file, or ask the user for a different file,
etc.).

The problem being: a program can run for ages before you see an error (Mark
ran my greylist daemon [4] for a few years before the error manifested
itself, and only then because the server operating system was upgraded—the
bug he hit was of the last category—something it could have handled, but I
didn't handle it properly) so it's not uncommon for error paths in programs
to have, well, errors (ironically enough).

In fact, there's really only three ways to handle errors:

1. every subroutine returns an indication of success or failure (C uses
this—the “calls filter downward, errors bubble upwards” model) and every
call site needs a check;
2. subroutines can cause an exception, which immediately transfers program
flow control to some earlier caller up the stack frame, which caller
gets the exception depends upon which caller is expecting which
exception (C++ uses this—the “dynamic spaghettiesque come-from” model);
3. ignore errors entirely and assume everything will always work (you can
do this in any programming language—it's from the Alfred E. Neuman [5]
“What? Me, worry?” school of programming).

Each method has its pros and cons and nobody is really happy with any of the
methods, but really, that's all there is when you get down to it. The first
is tedious, but doesn't require any special langauge features; the second
requires support in the language, and even so, is really spaghetti code [6]
in hiding and the third … well, again, no special language support is needed,
isn't tedious and it tends to make the code fast but, well … let's just say
that the error recovery can make a programmer go postal [7].

I just wish there was a better way …

Update on Wednesday, Debtember 2^nd, 2009

I could claim I left finding the errors in the code above as an exercise for
the reader, but really, I blew it [8].

[1] gopher://gopher.conman.org/0Phlog:2009/12/01.1
[2] http://gladesoft.com/
[3] http://www.cse.fau.edu/
[4] http://www.x-grey.com/
[5] http://www.leconcombre.com/concpost/us/postcard4/alfred_e_neuman.html
[6] http://en.wikipedia.org/wiki/Spaghetti_code
[7] http://en.wikipedia.org/wiki/Going_postal
[8] gopher://gopher.conman.org/0Phlog:2009/12/02.1

Email author at [email protected]