Of course it's slower, but I didn't expect it to be quite that bad

* * * * *

Of course it's slower, but I didn't expect it to be quite that bad

Time for another useless µbenchmark! This time, the overhead of trapping
integer overflow!

So, inspired by this post about trapping integer overflow [1], I thought it
might be interesting to see how bad the overhead is of using the x86 [2]
instruction INTO [3] to catch integer overflow. To do this, I'm using DynASM
[4] to generate code from an expression that uses INTO after every operation.
There are other ways of doing this, but the simplist way is to use INTO. I'm
also using 16-bit operations, as the numbers involved (between -32,768 and
32,767) are reasonable (for a human) to deal with (unlike the 32-bit range -
2,147,483,648 to 2147483647 or the insane 64-bit range of -
9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).

The one surprising result was that Linux treats the INTO trap as a segfault!
Even requesting additional information (passing the SA_SIGINFO flag with
sigaction()) doesn't tell you anything. But that in itself tells you it's not
a real segfault, as a real segfault will report a memory mapping error.
Personally, I would have expected a floating point fault, even though it's
not a floating point operation, because on Linux, integer division by 0
results in floating point fault (and oddly enough, a floating point division
by 0 results in ∞ but no fault)!

But, aside from that, some results. I basically run the expression one
million times and simply record how long it takes. The first is just setting
a variable to a fixed value (and the “- 0” bit is there just to ensure an
overflow check is included):

Table: x = 1 - 0
overflow time expression result
------------------------------
true 0.009080000 1
false 0.006820000 1

Okay, not terribly bad. But how about a longer expression? (and remember, the
expresssion isn't optimized)

Table: x = 1 + 1 + 1 + 1 + 1 + 1 * 100 / 13
overflow time expression result
------------------------------
true 0.079528000 46
false 0.030125000 46

Yikes! (But this is also including the function call overhead). For the
curious, the last example compiled down to:

> xor eax,eax
> mov ax,1
> add ax,1
> into
> add ax,1
> into
> add ax,1
> into
> add ax,1
> into
> add ax,1
> into
> imul 100
> into
> mov bx,13
> cwd
> idiv bx
> into
> mov [$0804f50E],ax
> ret
>

The non-overflow version just had the INTO instructions missing—otherwise it
was the same code.

I think what's surprising the most here is that the INTO instruction just
checks the overflow flag and only if set does it cause a trap. The timings I
have (and I'll admit, the figures I have are old and for the 80486) show that
INTO only has a three-cycle overhead if not taken. I'm guessing things are
worse with the newer multipipelined multiscalar multiprocessor monstrosities
we use these days.

Next I'll have to try using the JO instruction [5] and see how well that
fares.

[1] http://blog.regehr.org/archives/1154
[2] https://en.wikipedia.org/wiki/X86
[3] http://x86.renejeschke.de/html/file_module_x86_id_142.html
[4] gopher://gopher.conman.org/0Phlog:2015/09/05.1
[5] gopher://gopher.conman.org/0Phlog:2015/09/07.1

Email author at [email protected]