* * * * *

                            Concurrency is tricky

As I was writing the previous entry [1] I got the nagging feeling that
something wasn't quite right with the code. I got distracted yesterday
helping a friend bounce programming issues off me, but after that, I was able
to take a good look at the code and figured out what I did wrong.

Well, not “wrong” per se, the code as it worked—it's just that it could fail
catastrophically in the right conditions (or maybe wrong conditions,
depending upon your view).

But first, a bit about how my network sever framework works. The core bit of
code is this:

-----[ Lua ]-----
local function eventloop(done_f)
 if done_f() then return end

 -- calculate and handle timeouts
 -- each coroutine that timed out is
 -- scheduled to run on the RUNQUEUE,
 -- with nil and ETIMEDOUT.

 SOCKETS:wait(timeout)
 for event in SOCKETS:events() do
   event.obj(event)
 end

 while #RUNQUEUE > 0 do
   -- run each coroutine in the run queue until
   -- it eithers yields or returns (meaning
   -- it's finished running).
 end

 return eventloop(done_f)
end
-----[ END OF LINE ]-----

Details are emitted full gory details here [2] but in general, the event loop
calls a passed in function to check if we need to shut down, then calculates
a timeout value while checking for coroutines that registered a timeout. If
any did, we add the coroutine to a run queue with nil and ETIMEDOUT to inform
the resuming coroutine that it timed out. Then we scan a set of network
sockets for activity with SOCKETS:wait() (on Linux, this ends up calling
epoll_wait(); BSD (Berkeley Software Division)s with kqueue() and most other
Unix systems with poll()). We then call the handling function for each event.
These can end up creating new coroutines and scheduling coroutines to run
(these will be added to the run queue). And then for each coroutine in the
run queue, we run it. Lather, rinse, repeat. Simple enough.

Now, on to the code I presented. This code registers a function to run when
the given UDP (User Datagram Protocol) socket recieves a packet of data, and
schedules a number of coroutines waiting for data to run. This happens in the
eventloop() function.

-----[ Lua ]-----
nfl.SOCKETS:insert(lsock,'r',function()
 local _,data,err = lsock:recv()
 if data then
   for co in pairs(clients) do
     nfl.schedule(co,data) -- problem
   end
 else
   syslog('error',"recv()=%s",errno[err])
 end
end)
-----[ END OF LINE ]-----

I've noted a problematic line of code here.

And now the core of the routine to handle a TLS (Transport Layer Security)
connection. This code yields to receive data, then writes the data to the TLS
socket.

-----[ Lua ]-----
 while true do
   local data = coroutine.yield()
   if not data then break end
   local okay,errmsg = ios:write(data,'\n') -- <<< HERE
   if not okay then
     syslog('error',"tls:read() = %s",errmsg)
     break
   end
 end
-----[ END OF LINE ]-----

I've marked where the root cause lies, and it's pretty subtle I think. The
core issue is that ios:write() here could block, because the kernel output
buffer is full and we need to wait for the kernel to send it. But the code
that handles the UDP socket just assumes that the TLS coroutine is ready for
more data. If ios:write() blocks and more UDP data comes on, the coroutine is
prematurely resumed with the data, but that's just taken by the TLS thread as
the write being successful, then yielding and then things get … weird, as the
UDP side and the TLS side are now out of sync with each other. This,
fortunately, hasn't trigger on me.

Yet.

It could, if too much was being logged to syslog. I wrote the following code
to test it out:

-----[ C ]-----
#include <syslog.h>

#define MSG " !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXUZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

int main(void)
{
 int i;
 for (i = 0 ; i < 500 ; i++)
   syslog(LOG_DEBUG,"%3d " MSG MSG MSG,i);
 return 0;
}
-----[ END OF LINE ]-----

And sure enough, the [DELETED-spice-DELETED] data stopped flowing.

What I needed to do was queue up the log messages to a given client, and only
schedule it to run when it's waiting for more data. A few failed attempts
followed—they were all based on scheduling the TLS thread when X number of
messages were queued up (I tried one, then zero; neither worked). It worked
much better by using a flag to indicate when the TLS coroutine wanted to be
scheduled or not.

The UDP socket code is now:

-----[ Lua ]-----
nfl.SOCKETS:insert(lsock,'r',function()
 local _,data,err = lsock:recv()
 if data then
   for co,queue in pairs(clients) do
     table.insert(queue,data)
     if queue.ready then
       nfl.schedule(co,true)
     end
   end
 else
   syslog('error',"recv()=%s",errno[err])
 end
end)
-----[ END OF LINE ]-----

The client list now contains a list of logs to send, along with a flag that
the TLS coroutine sets indicating if it needs running or not. This takes
advantage of Lua's tables which can have a hash part (named indices) and an
array part, so we can include a flag in the queue.

And now the updated TLS coroutine:

-----[ Lua ]-----
local function client_main(ios)
 local function main()
   while #clients[ios.__co] > 0 do
     local data     = table.remove(clients[ios.__co],1)
     local okay,err = ios:write(data,'\n')
     if not okay then
       syslog('error',"tls:write()=%s",err)
       return
     end
   end

   clients[ios.__co].ready = true
   if not coroutine.yield() then
     return
   end
   clients[ios.__co].ready = false
   return main()
 end

 ios:_handshake()

 if ios.__ctx:peer_cert_issuer() ~= ISSUER then
   ios:close()
   return
 end

 syslog('info',"remote=%s",ios.__remote.addr)
 clients[ios.__co] = { ready = false }
 main()
 clients[ios.__co] = nil
 syslog('info',"remote=%s disconnecting",ios.__remote.addr)
 ios:close()
end
-----[ END OF LINE ]-----

The core of the routine, the nested function main() does the real work here.
When main() starts, the flag for queue readiness is false. It then runs
through its input queue sending data to the client. Once that is done, it
sets the queue readiness flag to true and then yields. Once it resumes, it
sets the queue readiness flag to 'false' and (through a tail call) starts
over again.

This ensures that logs are queued properly for delivery, and running the C
test program again showed it works.

[1] gopher://gopher.conman.org/0Phlog:2025/02/02.1
[2] https://github.com/spc476/lua-conmanorg/blob/43bf22dd7931b17dc86ee44b49b6c7fe6c2cc133/lua/nfl.lua#L160

Email author at [email protected]