* * * * *

            It works, but mysteriously crashes after a day or so …

A program I'm trying to run (for a small side project) keeps crashing. Well,
“crashing” isn't the right term—it technically doesn't crash, but calls
exit() when certain errors occur. The error in question happens with the
following code:

> x = fcntl(fd, F_GETFL, &fl);
> if (x < 0)
> {
>   syslog(LOG_ERR, "fcntl F_GETFL: FD %d: %s", fd, strerror(errno));
>   exit(1);
> }
>

and the error in question is:

> fcntl F_GETFL: FD -1: Bad file descriptor
>

It's in a function called set_nonblock() and it pretty much takes a file
desriptor (reference to an open file) as a parameter and makes two calls to
fcntl() and it's failing with an invalid file descriptor on the first call.
So I check the code that calls set_nonblock(); there are only two locations
were set_nonblock() is called, and in both cases, the file descriptor is
checked before the call to set_nonblock() which means that the file
descriptor is being clobbered between the initial test and the call.

Not good.

So I add more logging, and run again (mind you, this is over the course of
several days).

I finally get a location:

> stp.c:233: failed assertion newsock >= 0
>

Okay, check the code:

> int wait_for_connection(int s)
> {
>   int                newsock;
>   int                len;
>   struct sockaddr_in peer;
>
>   ddt(s > -1);
>
>     len = sizeof(struct sockaddr_in);
>     newsock = accept(s, (struct sockaddr *) &peer, &len);
>     /* dump_sockaddr (peer, len); */
>     if (newsock < 0) {
>         if (errno != EINTR)
>             perror("accept");
>     }
>     get_hinfo_from_sockaddr(peer, len, client_hostname);
>     **ddt(newsock >= 0);**
>     set_nonblock(newsock);
>     return (newsock);
> }
>

Line 233 is highlighted, and ddt() (which is a function I wrote) basically
checks the condition and if false, logs it (via syslog()) and exits the
program. And I see the error. It's subtle, but it's there. The fragment:

> newsock = accept(s, (struct sockaddr *) &peer, &len);
>
> if (newsock < 0) {
>   if (errno != EINTR)
>     perror("accept");
> }
>

is the culprit.

Under Unix, a system call (like accept()) can be interrupted, and if so, the
call fails with an error code of EINTR. Why could a system call be
interrupted? Well, say a program creates a child process (which this one
does), and that child does its job and exits, then the parent process (which
created the child process) is “interrupted” with a message: “your child
process has finished.” Normally, if a system call is interrupted, you want to
try the system call again, only this code doesn't do that! (although it looks
like the author intended to recall accept() but forgot to write that code).

Patch the code:

> int wait_for_connection(int s)
> {
>   int                newsock;
>   int                len;
>   struct sockaddr_in peer;
>
>   ddt(s > -1);
>
>   do
>   {
>     len     = sizeof(struct sockaddr_in);
>     newsock = accept(s,(struct sockaddr *) &peer,&len);
>     if (newsock < 0)
>     {
>       if (errno != EINTR)
>       {
>         perror("accept");
>         return(-1);
>       }
>     } while (newsock < 0);
>
>     get_hinfo_from_sockaddr(peer,sizeof(struct sockaddr_in),client_hostname);
>     set_nonblock(newsock);
>     return(newsock);
>   }
> }
>

and try again. Hopefully, this (and some other minor cleanup) will fix the
problem.


Email author at [email protected]