* * * * *
Four lines of code … it was four lines of code
It was bugging me [1].
I checked last night, and found the following:
Table: CPU utilization after a week
CPU requests
------------------------------
gopher 28:37 8360
gemini 6:02 24345
How can a TCP (Transmission Control Protocol) service take more time than a
TLS (Transport Layer Security) service? Especially when TLS runs over TCP?
And TLS generates six times the packets than TCP?
There's some difference in how I handle TCP and TLS that's causing this.
Profiling the code didn't reveal much as my attempts kept showing that gopher
[2] should be using less CPU than Gemini [3], so then it was just a matter of
staring at code and thinking hard.
It wasn't easy. Yes, there are some differences between the TCP interface [4]
and the TLS interface [5], so it was a matter of going through the code,
bringing them closer in implementation and seeing if that fixed things. I
would fix the code, run a test, see it not make a difference, and try again.
Eventually, I came across this bit of code [6]:
-----[ Lua ]-----
-- ---------------------------------------------------------------------
-- It might be a bug *somewhere*, but on Linux, this is required to get
-- Unix domain sockets to work with the NFL driver. There's a race
-- condition where writting data then calling close() may cause the
-- other side to receive no data. This does NOT appoear to happen with
-- TCP sockets, but this doesn't hurt the TCP side in any case.
-- ---------------------------------------------------------------------
while self.__socket.sendqueue and ios.__socket.sendqueue > 0 do
nfl.SOCKETS:update(self.__socket,'w')
coroutine.yield()
end
-----[ END OF LINE ]-----
This bit of code wasn't in the TLS implementation, and as the comment says,
it “shouldn't” hurt the TCP path, but hey, it was only needed for local (or
Unix) domain sockets to begin with, so let me try removing that bit of code
and run a test.
Table: Test before removing the code, 275 requests; time in seconds
CPU real user sys
------------------------------
gopher 18 46.726 13.971 4.059
gemini 4 107.928 3.951 0.322
Table: Test after removing the code, 275 requests; time in seconds
CPU real user sys
------------------------------
gopher 0 52.403 0.635 0.185
gemini 4 103.290 3.957 0.285
**THAT'S** more like it!
Now, about that code … this is my best guess as to what was going on.
The sendqueue field is a Linux-only field, so I'm checking to see if it
exists, and then checking the value (if only to avoid the “attempt to compare
number with nil” Lua error on non-Linux systems). This returns the number of
bytes the kernel has yet to send, so we switch the socket to trigger an event
when we can write to it, and yield. Even worse, the check isn't just a simple
variable fetch, but ends up being an ioctl() system call. Twice!
Now the “write” event code:
-----[ Lua ]-----
if event.write then
if #ios.__output > 0 then
local bytes,err = ios.__socket:send(nil,ios.__output)
if err == 0 then
ios.__wbytes = ios.__wbytes + bytes
ios.__output = ios.__output:sub(bytes + 1,-1)
else
syslog('error',"socket:send() = %s",errno[err])
nfl.SOCKETS:remove(ios.__socket)
nfl.schedule(ios.__co,false,errno[err],err)
end
end
if #ios.__output == 0 then
nfl.SOCKETS:update(ios.__socket,'r')
nfl.schedule(ios.__co,true)
end
end
-----[ END OF LINE ]-----
There's probably no data to be sent (ios.__output is empty) so we immediately
switch the socket back to trigger for readability, then resume the coroutine
handling the socket. It goes back to the while loop, setting the socket back
to trigger on writability, and kept bouncing like that until the kernel
buffer had been sent.
As I wrote, this fixed an issue I had with local domain sockets—it wasn't an
issue with an IP (Internet Protocol) socket, and in fact, was the wrong thing
to do for an IP socket.
And as it is with these types of bugs, finding the root cause is not trivial,
but the fix almost always is.
[1]
gopher://gopher.conman.org/0Phlog:2024/06/23.2
[2]
https://en.wikipedia.org/wiki/Gopher_(protocol)
[3]
https://geminiprotocol.net/
[4]
https://github.com/spc476/lua-conmanorg/blob/628992e0a63d062aeabfeb90fc6f524df4f6fd17/lua/nfl/tcp.lua
[5]
https://github.com/spc476/lua-conmanorg/blob/628992e0a63d062aeabfeb90fc6f524df4f6fd17/lua/nfl/tls.lua
[6]
https://github.com/spc476/lua-conmanorg/blob/628992e0a63d062aeabfeb90fc6f524df4f6fd17/lua/nfl/tcp.lua#L72
---
Discussions about this page
Four lines of code … it was four lines of code | Lobsters
https://lobste.rs/s/myfkbt
Four lines of code it was four lines of code | Hacker News
https://news.ycombinator.com/item?id=40842275
Four lines of code it was four lines of code - ZeroBytes
https://zerobytes.monster/post/1720254
Email author at
[email protected]