The case of the regression test regression

* * * * *

The case of the regression test regression

So the SVN (Subversion) [1]-git [2] conversion is back on the table. When I
worked on this [3] last year, I stopped when it became clear that a critical
build server didn't support git. I duly reported the issue, and it's only
been fixed in the past few months. Last year, it was delegated solely to me.
This year, it's more of a team effort.

For our team, we currently have two main SVN repositories with some duplicate
third party packages checked in. We went through the lists to remove the
redundencies (a Good Thing™) and I felt that it might be a good time to
ensure they're up to date. I spent a week doing so and ran into an odd issue—
the old regression test now runs slower! Not thirteen (or nineteen) hours
slow [4] but slow enough to be seriously annoying. I was able to isolate the
slowdown to some Lua modules [5] (that I wrote cough cough—I didn't think the
changes were that bad).

We do development primarily on Macs. My first thought was to either confirm
or deny that the Mac was the issue. I ran tests on a Linux laptop I have for
work and nope—the slowdown exists on that. I then tried to profile the code
(to no conclusive result). I compared the code between the two versions (only
five lines changed in the regression test, and all of those were small API
changes like fsys.redirect(stdin,fsys.STDIN) becoming
fsys.redirect(stdin,io.stdin)). I just could not figure out why the new code
was slower.

So then I decided to just time things. I added code to time how long it took
to run each test (and to limit the number of tests run—enough to show the
issue, but not the entire 15,852 tests) and plot the results.

[Graph of the time to run each test from the old regression test and the new
regression test. The old one is faster.] [6]

The X-axis is the test case number, the Y-axis is time in microseconds (and
plotted on a logarithmic scale, otherwise, you wouldn't be able to make out
any detail along the lower portion). The new regression test is in green, the
old one in red. It's clear that there are a few dozen tests that take a full
second to run in the new program, whereas the average (of both, aside from
the outliers) seems to be around 5ms (milliseconds). So good, the problem
clearly shows up.

Then I timed each section of running a test to see which section is sucking
up the time, and eventually, I get this graph:

[Graph of the small segment of the test that is taking all the time.] [7]

[The sharper eyed of you might have noticed the X-axis change—each test is
run twice (for reasons). This means the second graph has twice as many
entries. Yes, that means the full 15,852 tests are each run twice, for 31,704
requests. —Sean]

This represents the timing of this bit of code:

-----[ Lua ]-----
-- --------------------------------
-- Receive and process NOTIFY
-- --------------------------------

local remote,data = sipsock:recv(10)
if not data then
return false,"no NOTIFY"
end
-----[ END OF LINE ]-----

The only program that's different is the regression test, not the programs
being tested. And it's not like I updated the version of Lua either—that
didn't change. The third party code I'm using, yes, to some degree. But the
underlying code to sipsock:recv() did not change. Yet it's this bit of code
that is consistently slow for a consistent set of tests.

To say I'm a bit perplexed is a bit of an understatement.

[1] https://subversion.apache.org/
[2] https://git-scm.com/
[3] gopher://gopher.conman.org/0Phlog:2020/05/14.1
[4] gopher://gopher.conman.org/0Phlog:2021/06/08.1
[5] https://github.com/spc476/lua-conmanorg
[6] gopher://gopher.conman.org/IPhlog:2021/08/19/overall.png
[7] gopher://gopher.conman.org/IPhlog:2021/08/19/receive.png

Email author at [email protected]