* * * * *

                 Sixteen hours to change three lines of code

The bug report was simple enough: “Project: Sippy-Cup [1]” would, when it had
to retransmit the data (remember, we're running over UDP), it would send an
additional packet.

Table: Data flow for “Project: Sippy-Cup”
Sippy Cup       direction       Client
------------------------------
DATA    →     (packet dropped)
DATA    →     (packet received)
(packet recieved)       ←     ACK
DATA    →      

So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce
(thankfully!) and easy to fix once identified. This one was easy to reproduce
… and took two days of intense effort to fix. The bug is not a show-stopper,
and the client should (there's that word) handle the duplicate packet. My
manager, S, was subtly hinting that I should put this on the backburner, but
this is one of those problems that just irks me.

I added logging statements. I instrumented the code. I traced the execution
of the program line-by-line (easy enough to do within Lua [2] as there's a
pre-defined method for hooking into the execution engine, and even though
this didn't reveal the root cause of the bug, it did reveal a few places
where some simple changes optimized the code (but that's for another post
[3]). For example, here's all the code that's executed [4] when you run the
luarocks [5] command, but fair warning—it's nearly 7,000 lines long) and it
became clear that something was wrong with the timeout handling (what I
didn't mention the other day about scheduling coroutines [6] is how I handle
timeouts, but alas, that's for another post [7] as well). The code is
basically:

> request = receive_packet_from_client()
> send_ack()
> send_request_to_db(request)
>
> -- ------------------------------------------------------------------
> -- This sets things up to force the coroutine to resume in one second,
> -- returning a result of nil.
> -- ------------------------------------------------------------------
>
> timeout(1)
>
> -- ----------------------------------------------------------------
> -- the code that resumes this coroutine will pass in the result in
> -- coroutine.resume(), which is assigned to dbresult
> -- ----------------------------------------------------------------
>
> dbresult = coroutine.yield()
>
> -- ------------------------------------------------------------------------
> -- cancel a pending timeout, if any.  If we did timeout, this is a NOP.  If
> -- we haven't timed out, this prevents the timeout from triggering.
> -- ------------------------------------------------------------------------
>
> timeout(0)
>
> if not dbresult then
>   info = failed()
> else
>   info = extract(dbresult)
> end
>
> for retries = 1 , 3 do
>   send_information_to_client()
>   timeout(1)
>   ack = coroutine.yield()
>   timeout(0)
>   if ack then
>     return
>   end
> end
>

I was able to determine that the initial “timeout” was the request to the
database engine timeout trigger (after hacking in a way to track the timeout
requests) which meant the timeout code wasn't working properly. Once that was
clear, I was able to locate the problem code:

> -- ------------------------------------------------------------------------
> -- TQUEUE contains references to coroutines that want a timeout.  If we set
> -- the timeout to 0, this just sets the awake field to 0, which informs the
> -- coroutine scheduler not to run the coroutine when it "times out".
> -- ------------------------------------------------------------------------
>
> function timeout(when)
>   local f = coroutine.running()
>
>   if when == 0 then
>     if TQUEUE[co] then
>       TQUEUE[co].awake = 0 -- oops, bug.  See this post [8] for details
>     end
>   else
>     TQUEUE:insert(when,f)
>   end
> end
>

Here we have an instance where a static compiler would have flagged this
immediately [9]. I either need to rename f to co, or change co to f. A simple
two line fix (and I really need to find a code checker for Lua).

The other issue, now quickly found, was this bit of code:

> local function handler(info,err,remote,message,startime)
>   ...
>   local ninfo,err,nremote,nmessage = coroutine.yield()
>   timeout(0)
>
>   if err == 0 then
>     if not message.request.status then
>       -- error
>     else
>       return nmessage
>     end
>   else
>   ...
>

Even a static compiler would not have found this issue—I was checking the
wrong variable! The code should read:

> if not nmessage.request.status then
>

Bad me for using a bad name for a variable. No cookie for me!

Okay, this is a simple one line change. Total lines changed: 3.

[1] gopher://gopher.conman.org/0Phlog:2014/03/05.1
[2] http://www.lua.org/
[3] gopher://gopher.conman.org/0Phlog:2015/02/13.1
[4] gopher://gopher.conman.org/0Phlog:2015/02/12/luarocks-run.txt
[5] https://rocks.moonscript.org/
[6] gopher://gopher.conman.org/0Phlog:2015/02/05.1
[7] gopher://gopher.conman.org/0Phlog:2015/02/16.1
[8] gopher://gopher.conman.org/0Phlog:2015/02/16.1
[9] gopher://gopher.conman.org/0Phlog:2015/02/09.1

Email author at [email protected]