Bug hunt

* * * * *

Bug hunt

Today T, a fellow cow-order here at The Ft. Lauderdale Office of The
Corporation, found a particularly nasty bug in “Project: Sippy-Cup [1]”—two
identical SIP (Session Initiation Protocol) packets caused the request to
fail. It uses UDP (User Datagram Protocol) for the network transport, so
duplicate packets are a possibility (so are dropped packets, but that's
already being handled properly), something that the code didn't handle
properly.

Now, “Project: Sippy-Cup” is written in Lua [2]. I only say this because of
how I handle requests. The main loop of the program waits for network packets
using poll() [3] (or epoll_wait() [4] if running under Linux; they're like
calls to select() [5] (which is all you ever see in Unix networking tutorials
because everybody is basing the code off examples written in the early 90s
but I digress) only less CPU (Central Processing Unit) intensive). When
poll() returns, I either create a coroutine or resume a coroutine (depending
on where we are in the request) to handle the request.

I did it this way, even though I have to manually switch coroutines, to make
the processing of requests easier to follow:

> function handle(request)
> if not validate(request)
> socket:send(request.sender,"400 Bad client! No cookie for you!")
> return
> end
>
> socket:send(request.sender,"200 Good boy! You're such a good boy!")
>
> local info = lookup_details(request) -- blocking call
> local contact = dns.lookup(request.contact) -- blocking call
>
> socket:send(contact,info)
> local reply = socket:recv() -- blocking call
>
> if not reply.okay then
> log("I messed up")
> end
> end
>

rather than having all this logic spread across multiple functions where each
function represents a step in the processing:

> -- not shown, the code that receives packets and calls the appropriate
> -- routine.
>
> function handle_incoming(state)
> if not validate(state.request)
> socket:send(state.request.sender,"400 Bad client! No cookie for you!")
> return
> end
>
> socket:send(state.request.sender,"200 Good boy! You're such a good boy!")
> return lookup_details -- next function to call
> end
>
> function lookup_details(state)
> state.info = -- code to send request
> return contact_dns
> end
>
> function contact_dns(state)
> state.contact = state.request.contact
> -- code to send a DNS request
> return got_dns_response
> end
>
> function got_dns_response
> socket:send(state.contact,state.info)
> return process_reply
> end
>
> function process_reply(state)
> if not state.reply.okay then
> log("I messed up")
> end
> end
>

It's fairly straightforward—once I get a SIP packet, I check a list to see if
I've already created a coroutine based on the unique ID included in the
packet. If one does exist, I then resume that coroutine; otherwise, I create
one. The coroutine will run until it either finishes, or a function will
block (say, a DNS (Domain Name Service) query), and the code that could block
will add the coroutine to a list it maintains and yield back to the main
code.

The main coroutine, at 50,000 feet:

1. poll() returns.
2. It's a SIP packet: process it:
2. 1. Get the unique ID from the SIP packet.
2. 2. Use the ID to check the list of SIP coroutines. If a coroutine does
not exist, create one and add it to the list.
2. 3. Resume the coroutine.
2. 4. If the coroutine has ended, remove the reference from the list of
SIP coroutines; otherwise
2. 5. The coroutine has called coroutine.yield(), so keep the reference
arond.

3. It's another type of packet (see below)
4. Go back to step 1.

Any routine that might block (and they're all network related) will put the
coroutine on a list, and yield. So, when call dns.lookup(), that function
does:

1. Do have have the data cached? If so, return it.
2. Otherwise, create the DNS request.
3. Send the DNS request to a DNS server.
4. Place the coroutine on a DNS list, indexed with the DNS request ID.
5. Call coroutine.yield(), which returns us to the main loop.
6. We resume here, so remove the coroutine from the DNS list.
7. Cache the data.
8. Return the data.

So, step 3 of the main coroutine contains these sub-steps:

1. Get the unique ID from the DNS packet.
2. Get the coroutine from the DNS list, indexed by the ID.
3. Resume that coroutine.

The lookup_detail() function (the handling of which would be step 4, which
isn't shown) is similar to the DNS lookup. This all works and while the main
driver code is a bit complex (and the complexity has to live somewhere), the
actual processing code is nice and linear, making it easier to change the
“business logic” in one place rather than have it scattered throughout the
codebase.

Now we get to the bug—mishandling of duplicate packets. The first SIP packet
comes in, has a unique ID, so a coroutine is created, we get to the call to
lookup_details() and we yield, waiting for a reponse from our database engine
(not a real database engine, but close enough). The duplicate SIP packet
comes in. We already have a coroutine associated with this SIP ID, so re
blindly resume it. The coroutine, freshly resumed, assumes the reply from the
database has arrived and things go down hill from there (but only for that
request; the program is still running and can handle new requests).

The solution is to realize we have a duplicate request and not resume the
coroutine in that case.

And like most bugs, the fix is trivial—it's the hunt that's difficult.

[1] gopher://gopher.conman.org/0Phlog:2014/03/05.1
[2] http://lua.org/
[3] http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html
[4] http://linux.die.net/man/4/epoll
[5] http://pubs.opengroup.org/onlinepubs/007908799/xsh/select.html

Email author at [email protected]