The case of the regression test regression, part III

* * * * *

The case of the regression test regression, part III

When I last left off [1] I identified the slow code in the regression test
and it left me puzzled—it was a single function call that was called during
startup and not when the tests were running. Faced with the fact that I had
identified (for a second time) that code that did not change was causing a
second of delay, I decided to change my approach. Instead of starting with
the working, fast version of the code and going forward, I would start with
the current, slow version of the code and work backwards. The first point of
business—see if I could reproduce the fast code/slow code change with the
code I identified as a sanity check. And of course that sanity check failed
as both code versions were now slow.

What was that Sherlock Holmes quote? Ah yes, “When you have eliminated the
impossible, whatever remains, must be the truth.” Must be the computer hates
me.

Okay, so now that the slowdown was not the regression test script, but one of
the mocked services. So I did to that script what I did to the regression
test script—record how long it takes to handle a query [2]. And of course it
always responded in a timely fasion. No second long delay found at all!

So the delay is not in the regression test. The delay is not in the mocked
service. I am slowly running out of ideas.

Okay, time to take stock: what do I know? Some queries are taking up to a
second to run. They appear to be the same tests from run to run. Okay, let's
run with that idea.

I modify the regression test to kick out which test cases are taking over
900,000 microseconds. I then just run those tests and yes, it's consistent. I
remove those slow tests, the regression test just flows. If I just run the
“slow” tests, each one takes about a second to run.

So now I know it's consistent. What's consistent about them?

Now I start digging into the logs and … yeah … something odd is going on with
those tests. It's a path that “Project: Lumbergh [3]” can take, and it's
hitting a mocked service that I keep forgetting about. I mean, the regression
test starts it up, but otherwise, it's pretty silent (and pretty easy to
forget about). I add some logging to this particular mocked service, run the
regression test and see that it responds once, then nothing ever again.

I add logging to the “fast” version and yes, it gets hit, and keeps
responding.

I check the differences between the two versions:

-----[ shell ]-----
[spc]saltmine-2:~>diff path/to/slow/version path/to/fast/version
96c93
< nfl.SOCKETS:insert(sock,'r',function()
---
> nfl.SOCKETS:insert(sock:fd(),'r',function()
106c103
< nfl.server_eventloop()
---
> nfl.eventloop()
[spc]saltmine-2:~>
-----[ END OF LINE ]-----

Yeah … just some API updates, like yesterday. But if I run the new regression
test while running the older version of this one mock, the new regression
test runs as I expected. The delay is actually coming from “Project:
Lumbergh” because the new version of this mocked service doesn't respond
properly, causing “Project: Lumbergh” to time out, thus the delay I'm seeing.

I think I'm finally getting somewhere, but why is it so hot? And what am I
doing in this basket?

[1] gopher://gopher.conman.org/0Phlog:2021/08/24.2
[2] gopher://gopher.conman.org/0Phlog:2021/08/19.1
[3] gopher://gopher.conman.org/0Phlog:2018/09/11.2

Email author at [email protected]