* * * * *
No product survives first contact with production
Many months ago, my manager S asked about a “health check service” for
“Project: Sippy-Cup [1].” Something that operations could query to see if my
component was still up and running. I rejected the idea of embedding a web
server in the component as being complete overkill (and really, any embedded
webserver would swamp the amount of code that actually does the useful work
in my component, which just processes one SIP (Session Initiation Protocol)
message.
So I did the simplest thing that could possibly work [2]: a simple UDP (User
Datagram Protocol) service. It accepts a packet with the string “STATUS” and
replies with “OKAY.” It was only a few lines of code, and with netcat [3] I
figured it would be a simple matter for operations to do a health check.
It seems that UDP is too confusing for operations to deal with, so I changed
the underlying protocol to TCP (Transmission Control Protocol). It's a bit
more complicated to support as I now have to listen and accept connections,
but then it should be even easier for operations to handle it with netcat.
The protocol stills accept a string of “STATUS” and returns with “OKAY”.
And it's still apparently too much for operations to deal with. Operations
actually asked if they could send a SIP message, and I was like, Wow! If it's
easier for you guys to send a SIP message for a health check, more power to
you! But my manager nixxed that idea and we stuck with the current TCP
version, which he feels is the simplest thing that could work.
I'm not sure what operations is actually doing. My manager mentioned that my
component was failing the health check, yet when check it was fine (using
netcat of course). Yet the logs were filled with errors (“recvfrom: Bad file
number” and “poll: Invalid argument”), probably from all the failed attempts
by operations to do a health check.
I did ask operations what is sent and how often. What they're sending is
right, but they're asking “Areyoustillup?Whyhaven'tyouansweredme?Areyoup?Are
youup?McFly!McFly!Answerme!” before my component has a chance to even answer.
I think they're a bit too aggressive. They don't.
Sigh.
[1]
gopher://gopher.conman.org/0Phlog:2014/03/05.1
[2]
http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork
[3]
https://en.wikipedia.org/wiki/Netcat
Email author at
[email protected]