* * * * *

  On the plus side, it happens the same way each time; on the down side, it
                             doesn't happen often

I'm still trying to find the cause of the crash [1]. I made a small change (a
switch from using lua_touserdata(), which can return NULL, to using
luaL_checkudata(), which if it returns, will never return NULL; else it
throws an error) which didn't help resolve the issue—the program still dumped
core.

Both last week and today, it crashed in the same place in Ltls_write():

-----[ C ]-----
static int Ltls_write(lua_State *L)
{
 struct tls **tls  = luaL_checkudata(L,1,TYPE_TLS);
 size_t       len;
 char const  *data = luaL_checklstring(L,2,&len);

 lua_pushinteger(L,tls_write(*tls,data,len));
 return 1;
}
-----[ END OF LINE ]-----

The variable tls was itself not NULL, but the value it pointed to was!
There's only one place in the code where it's set to NULL—in the __gc
metamethod (the bits with the Lua registry is to deal with callbacks from
libtls which knows nothing about Lua):

-----[ C ]-----
static int Ltls___gc(lua_State *L)
{
 struct tls **tls = luaL_checkudata(L,1,TYPE_TLS);
 if (*tls != NULL)
 {
   lua_pushlightuserdata(L,*tls);
   lua_pushnil(L);
   lua_settable(L,LUA_REGISTRYINDEX);
   tls_free(*tls);
   *tls = NULL;
 }
 return 0;
}
-----[ END OF LINE ]-----

This is only called when the object no longer has any references to it. I'm
not aware of any place where this value gets overwritten (but I'm not
discounting it outright quite yet). The crashes started after I fixed the
leak, which turned out to be this function (this is the pre-fix version):

-----[ C ]-----
static int Ltls_close(lua_State *L)
{
 lua_pushinteger(L,tls_close(*(struct tls **)lua_touserdata(L,1)));
 return 1;
}
-----[ END OF LINE ]-----

The leak happened because I left the reference to the object in the Lua
registry, so the garbage collector would never reclaim the resource. The fix
itself was easy:

-----[ C ]-----
static int Ltls_close(lua_State *L)
{
 int rc = tls_close(*(struct tls **)luaL_checkudata(L,1,TYPE_TLS));
 if ((rc != TLS_WANT_POLLIN) && (rc != TLS_WANT_POLLOUT))
   Ltls___gc(L);
 lua_pushinteger(L,rc);
 return 1;
}
-----[ END OF LINE ]-----

Yes, it's slightly complicated by the fact that closing a TLS (Transport
Layer Security) connection might require a few more packets of data between
the two endpoints to close out the protocol, but that's taken into account by
the framework I'm using. At least, I hope it is.

So my theory right now is that the __gc metamethod is being called before the
connection is fully closed, but to confirm that, I somehow need to capture
that scenario. It's consistent in that it seems to be the same client making
the same requests causing the same crash. On the down side, I have no idea
who runs the client, or where it is (other than an IP (Internet Protocol)
address). And if I make the same requests with my client, it doesn't crash,
so I can't reproduce it.

And there's no telling when it will happen again.

I know it's just a small detail I'm missing on my side that's causing the
issue, but I just can't seem to locate the detail. To that end, I modified
Ltls___gc() and Ltls_write() to attach some traceback information to see if
my theory that Ltls___gc() is being called too early. Lua makes this pretty
easy to do. I modified Ltls___gc():

-----[ C ]-----
static int Ltls___gc(lua_State *L)
{
 struct tls **tls = luaL_checkudata(L,1,TYPE_TLS);
 if (*tls != NULL)
 {
   luaL_traceback(L,L,"__gc",0);
   lua_setuservalue(L,1);
   lua_pushlightuserdata(L,*tls);
   lua_pushnil(L);
   lua_settable(L,LUA_REGISTRYINDEX);
   tls_free(*tls);
   *tls = NULL;
 }
 return 0;
}
-----[ END OF LINE ]-----

I get a traceback message and assign it to the object about to be cleaned. In
Lua, a userdata goes through the garbage collection twice—once to clean the
resources that Lua isn't aware of (that's what this function does) and the
second time (in the next garbage collection cycle) to reclaim the memory Lua
uses to store the userdata. Then in Ltls_write() I have:

-----[ C ]-----
static int Ltls_write(lua_State *L)
{
 struct tls **tls  = luaL_checkudata(L,1,TYPE_TLS);
 size_t       len;
 char const  *data = luaL_checklstring(L,2,&len);

 if (*tls == NULL)
 {
   lua_getuservalue(L,1);
   if (!lua_isnil(L,-1))
     syslog(LOG_NOTICE,"%s",lua_tostring(L,-1));
   luaL_traceback(L,L,"write",0);
   syslog(LOG_NOTICE,"%s",lua_tostring(L,-1));
   lua_pop(L,2);
   lua_pushinteger(L,-1);
 }
 else
   lua_pushinteger(L,tls_write(*tls,data,len));
 return 1;
}
-----[ END OF LINE ]-----

This will (hopefully, knock on formica) give me two tracebacks—one when the
garbage collection is called (or Ltls_close() is called) and then a write is
attempted, and that will point me to the detail I'm missing.

The other theory is that I'm overwriting memory somewhere and that's a harder
issue to track down.

[1] gopher://gopher.conman.org/0Phlog:2020/04/30.1

Email author at [email protected]