* * * * *
Obsessing over stupid benchmarks (and you thought my obsessing over a
greylist daemon was bad … )
I'm still obsessing over stupid benchmarks. This time, I was curious about
the overhead of calling functions and system calls, to simulate a typical
locking scenario, like:
> int gv;
>
> void t(void)
> {
> while(gv < 1000000000)
> {
> lock();
> gv++;
> unlock();
> }
> }
>
So, I wrote some test code:
> int gv
>
> void null(void)
> {
> }
>
> void t1(void)
> {
> while(gc < 1000000000)
> {
> null();
> gv++;
> null();
> }
> }
>
> void t2(void)
> {
> pid_t me;
>
> me = getpid();
> while(gc < 1000000000)
> {
> kill(me,0);
> gv++;
> kill(me,0);
> }
> }
>
The first test is pretty simple—it just tests the function call overhead, and
that particular piece of code takes 9 seconds (give or take a few tenths of a
second) to run. But let me explain about the second test, and therein lies
the interesting bit of this tale.
I wanted to test the system call overhead, and thus, I needed a system call
that did very little. I found this bit in the kill() manpage:
> If sig is 0, then no signal is sent, but error checking is still performed.
>
That's about as close to a null system call as I could find. No real bad side
effects, and I don't particularly care about the return code (since this is
just measuring overhead). So I ran that bit of code. Fifteen minutes later it
finished.
Fifteen minutes.
Either system calls are slow, or that particular system call does a vast
amount of nothing. Curious, I delved into that part of the kernel source (I'm
running all this under Linux, by the way) when I came across this bit of
kernel code:
> asmlinkage long sys_getpid(void)
> {
> return current->tgid;
> }
>
Now, I knew that getpid() was a system call, but I didn't realize it was a
system call. What I mean by that is, even though it's documented in section 2
of the manual (usually reserved for system calls), it may not actually call
into the kernel. For instance, alarm() is documented in section 2, but it's a
library call that eventually calls other system calls. And I thought that
getpid() was a library call that just returned a value (the current process
ID) stored in userspace (which makes sense to me—it's not like it changes or
anything as the process runs). I didn't expect it to be an actual system
call.
Well then.
And it's about as low overhead of a system call as you can get.
So I rewrote t2():
> void t2(void)
> {
> while(gc < 1000000000)
> {
> getpid();
> gv++;
> getpid();
> }
> }
>
And the runtime dropped to 17 seconds.
Okay, much better. Not quite twice the function call overhead. But what the
heck was kill() doing that warranted fifteen minutes? I checked the Linux
source code and it didn't appear to be that much. What the heck was going on?
So what's the actual code to getpid()? Was it perhaps making the system call
once and storing the result? Time to check the code to getpid(). Easiest way
to do that was to statically compile the test program, and disassemble it
(using objdump):
> 0804e380 <__getpid>:
> 804e380: b8 14 00 00 00 mov $0x14,%eax
> 804e385: cd 80 int $0x80
> 804e387: c3 ret
>
Okay. Not much to it. What if I make the system call directly then? And here,
I resorted to assembly (since the inline assembly directives for gcc make no
sense to me whatsoever):
> t41: mov eax,20
> int $80
> mov eax,[gv]
> inc eax
> mov [gv],eax
> mov eax,20
> int $80
> mov eax,[gv]
> cmp eax,1000000000
> jl t41
> ret
>
(oh, the reason the code looks a bit different? objdump uses AT&T (American
Telephone and Telegraph) syntax, while nasm uses Intel syntax)
It was a bit surprising to find myself waiting for 11 minutes for this bit of
code to run.
Okay, that's more on par with using the system call kill(), but that doesn't
explain why my version took 66 times longer. Perhaps it's a CPU (Central
Processing Unit) pipeline issue? Perhaps it is faster to call a function that
invokes a system call than it is to inline the actual system call. So I'll
called the function getpid():
> extern __getpid
>
> t42: call __getpid
> inc dword [gv]
> call __getpid
> cmp dword [gv],1000000000
> jl t42
> ret
>
Back to 17 seconds. Okay, what if I supply my own wrapper function around the
getpid() system call?
> t43: call getpid
> inc dword [gv]
> call getpid
> cmp dword [gv],1000000000
> jl t43
> ret
>
> getpid: mov eax,20
> int $80
> ret
>
And just to make sure, check with objdump:
> 080484b0 <getpid>:
> 80484b0: b8 14 00 00 00 mov $0x14,%eax
> 80484b5: cd 80 int $0x80
> 80484b7: c3 ret
>
> ...
>
> 0804e420 <__getpid>:
> 804e420: b8 14 00 00 00 mov $0x14,%eax
> 804e425: cd 80 int $0x80
> 804e427: c3 ret
>
So, other than the memory location of the routine (and the name, but the name
isn't part of the executable) they're the same. So why does my version of the
getpid() function take 11 minutes to run? Something odd is going on
somewhere.
I then wrote a version of getpid() that caches the return value from the
first call (and this time, I made it callable by C functions):
> global my_getpid
>
> my_getpid: mov eax,[pid]
> or eax,eax
> jz .10
> ret
> .10 mov eax,20
> int $80
> mov [pid],eax
> ret
>
Two tests, one calling this routine from C, another test calling this routine
from another assembly routine. And both take around 10 seconds to run, which
is just a shade slower than calling a null routine.
But calling __getpid() (the system supplied C library routine that calls the
system call) is faster than calling getpid() (the function I wrote to call
the system call), even though they're identical functions.
Is there something else going on that I'm not aware of?
Weird …
Email author at
[email protected]