* * * * *
Sixteen hours to change three lines of code
The bug report was simple enough: “Project: Sippy-Cup [1]” would, when it had
to retransmit the data (remember, we're running over UDP), it would send an
additional packet.
Table: Data flow for “Project: Sippy-Cup”
Sippy Cup direction Client
------------------------------
DATA → (packet dropped)
DATA → (packet received)
(packet recieved) ← ACK
DATA →
So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce
(thankfully!) and easy to fix once identified. This one was easy to reproduce
… and took two days of intense effort to fix. The bug is not a show-stopper,
and the client should (there's that word) handle the duplicate packet. My
manager, S, was subtly hinting that I should put this on the backburner, but
this is one of those problems that just irks me.
I added logging statements. I instrumented the code. I traced the execution
of the program line-by-line (easy enough to do within Lua [2] as there's a
pre-defined method for hooking into the execution engine, and even though
this didn't reveal the root cause of the bug, it did reveal a few places
where some simple changes optimized the code (but that's for another post
[3]). For example, here's all the code that's executed [4] when you run the
luarocks [5] command, but fair warning—it's nearly 7,000 lines long) and it
became clear that something was wrong with the timeout handling (what I
didn't mention the other day about scheduling coroutines [6] is how I handle
timeouts, but alas, that's for another post [7] as well). The code is
basically:
> request = receive_packet_from_client()
> send_ack()
> send_request_to_db(request)
>
> -- ------------------------------------------------------------------
> -- This sets things up to force the coroutine to resume in one second,
> -- returning a result of nil.
> -- ------------------------------------------------------------------
>
> timeout(1)
>
> -- ----------------------------------------------------------------
> -- the code that resumes this coroutine will pass in the result in
> -- coroutine.resume(), which is assigned to dbresult
> -- ----------------------------------------------------------------
>
> dbresult = coroutine.yield()
>
> -- ------------------------------------------------------------------------
> -- cancel a pending timeout, if any. If we did timeout, this is a NOP. If
> -- we haven't timed out, this prevents the timeout from triggering.
> -- ------------------------------------------------------------------------
>
> timeout(0)
>
> if not dbresult then
> info = failed()
> else
> info = extract(dbresult)
> end
>
> for retries = 1 , 3 do
> send_information_to_client()
> timeout(1)
> ack = coroutine.yield()
> timeout(0)
> if ack then
> return
> end
> end
>
I was able to determine that the initial “timeout” was the request to the
database engine timeout trigger (after hacking in a way to track the timeout
requests) which meant the timeout code wasn't working properly. Once that was
clear, I was able to locate the problem code:
> -- ------------------------------------------------------------------------
> -- TQUEUE contains references to coroutines that want a timeout. If we set
> -- the timeout to 0, this just sets the awake field to 0, which informs the
> -- coroutine scheduler not to run the coroutine when it "times out".
> -- ------------------------------------------------------------------------
>
> function timeout(when)
> local f = coroutine.running()
>
> if when == 0 then
> if TQUEUE[co] then
> TQUEUE[co].awake = 0 -- oops, bug. See this post [8] for details
> end
> else
> TQUEUE:insert(when,f)
> end
> end
>
Here we have an instance where a static compiler would have flagged this
immediately [9]. I either need to rename f to co, or change co to f. A simple
two line fix (and I really need to find a code checker for Lua).
The other issue, now quickly found, was this bit of code:
> local function handler(info,err,remote,message,startime)
> ...
> local ninfo,err,nremote,nmessage = coroutine.yield()
> timeout(0)
>
> if err == 0 then
> if not message.request.status then
> -- error
> else
> return nmessage
> end
> else
> ...
>
Even a static compiler would not have found this issue—I was checking the
wrong variable! The code should read:
> if not nmessage.request.status then
>
Bad me for using a bad name for a variable. No cookie for me!
Okay, this is a simple one line change. Total lines changed: 3.
[1]
gopher://gopher.conman.org/0Phlog:2014/03/05.1
[2]
http://www.lua.org/
[3]
gopher://gopher.conman.org/0Phlog:2015/02/13.1
[4]
gopher://gopher.conman.org/0Phlog:2015/02/12/luarocks-run.txt
[5]
https://rocks.moonscript.org/
[6]
gopher://gopher.conman.org/0Phlog:2015/02/05.1
[7]
gopher://gopher.conman.org/0Phlog:2015/02/16.1
[8]
gopher://gopher.conman.org/0Phlog:2015/02/16.1
[9]
gopher://gopher.conman.org/0Phlog:2015/02/09.1
Email author at
[email protected]