* * * * *
“This is an invalid protocol because I can't open a file”
TS2 comes to my desk at the Ft. Lauderdale Office of the Corporation. “I'm
running a load test of ‘Project: Cleese [1]’ and it's not functioning. If I
run a normal test, it runs fine.”
“Hmm … let me take a look.” I head back to TS2's desk. Sure enough, “Project:
Cleese” is crashing under load. Well, not a hard crash—it is written in Lua
[2] and what's crashing are individual coroutines due to an uncaught error
(the Lua equivalent of exceptions) where the only information being reported
is “invalid protocol.” I have TS2 send me copies of the data files and script
he's using to load test, and I'm able to reproduce the issue. It's an odd
problem, because it appears to be crashing on this line of code:
-----[ Lua ]-----
local sock,err = net.socket(addr.family,'tcp')
-----[ END OF LINE ]-----
I dive in, and isolate the issue to this bit of C code that's part of the
net.socket() function:
-----[ C ]-----
if (getprotobyname_r(proto,&result,tmp,sizeof(tmp),&presult) != 0)
return luaL_error(L,"invalid protocol");
-----[ END OF LINE ]-----
Odd, “tcp” is a valid protocol, so I shouldn't be getting ENOENT, and the
buffer used to store data is large enough (because normally it works fine) so
I don't think I'm getting ERANGE. And that covers the errors that
getprotobyname_r() [3] is documented to return. I add some logging to see
what error I'm actually getting.
I'm getting “Too many open files” and it suddenly all makes sense.
getprotobyname_r() is using some data file (probably /etc/protocols) to
translate “tcp” to the actual protocol value but it can't open the file
because the program is out of available file descriptors. “Project: Cleese”
is out of file descriptors because each network connection counts as a file
descriptor, and the test systems (Linux in this case) only allow 1,024
descriptors per process. It's easy enough to up that to some higher value (I
did 65,536) and sure enough, the “Too many open files” error starts showing
up where I expect it to.
On the plus side, it's not my code. On the minus side, you have to love those
leaky abstractions [4] (and perhaps relying upon documentation a bit too
much).
[1]
gopher://gopher.conman.org/0Phlog:2018/09/11.2
[2]
https://www.lua.org/
[3]
https://linux.die.net/man/3/getprotobyname_r
[4]
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/
Email author at
[email protected]