Is profiling even viable now?

* * * * *

Is profiling even viable now?

Mark brought up (in email) an interesting optimization technique using GCC 3:

> I came across an interesting optimization that is GCC specific but quite
> clever.
>
> In lots of places in the Linux kernel you will see something like:
>
> > p = get_some_object();
> > if (unlikely(p == NULL))
> > {
> > kill_random_process();
> > return (ESOMETHING);
> > }
> >
> > do_stuff(p);
> >
>
> The conditional is clearly an error path and as such means it is rarely
> taken. This is actually a macro defined like this:
>
> > #define unlikely(b) __builtin_expect(b, 0)
> >
>
> On newer versions of GCC this tells the compiler to expect the condition
> not to be taken. You could also tell the compiler that the branch is likely
> to be taken:
>
> > #define likely(b) __builtin_expect(b, 1)
> >
>
> So how does this help GCC anyhow? Well, on some architectures (PowerPC)
> there is actually a bit in the branch instruction to tell the CPU's
> speculative execution unit if the branch is likely to be taken. On other
> architectures it avoids conditional branches to make the “fast path” branch
> free (with -freorder-blocks).
>

I was curious to see if this would actually help any, so I found a machine
that had GCC 3 installed (swift), compiled a version of mod_blog [1] with
profiling information, ran it, found a function that looked good to speed up,
added some calls to __builtin_expect(), reran the code and got a rather
encouragine interesting result.

I then reran the code, and got a completely different result.

In fact, each time I run the code, the profiling information I get is nearly
useless—well, to a degree. For instance one run:

Table: Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
------------------------------
100.00 0.01 0.01 119529 0.00 0.00 line_ioreq
0.00 0.01 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.01 0.00 60991 0.00 0.00 line_readchar
0.00 0.01 0.00 59747 0.00 0.00 ht_readchar

Then another run:

Table: Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
------------------------------
33.33 0.01 0.01 119529 0.00 0.00 line_ioreq
33.33 0.02 0.01 60991 0.00 0.00 line_readchar
33.33 0.03 0.01 21200 0.00 0.00 ufh_write
0.00 0.03 0.00 141779 0.00 0.00 BufferIOCtl

Yet another run:

Table: Each sample counts as 0.01 seconds. no time accumulated
% time cumulative seconds self seconds calls self ms/call total ms/call name
------------------------------
0.00 0.00 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.00 0.00 119529 0.00 0.00 line_ioreq
0.00 0.00 0.00 60991 0.00 0.00 line_readchar
0.00 0.00 0.00 59747 0.00 0.00 ht_readchar

And still another one:

Table: Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
------------------------------
50.00 0.01 0.01 60991 0.00 0.00 line_readchar
50.00 0.02 0.01 1990 0.01 0.01 HtmlParseNext
0.00 0.02 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.02 0.00 119529 0.00 0.00 line_ioreq

Like I said, nearly useless. Sure, there are the usual suspects, like
BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements
I'm getting by doing this. And by today's standards, swift isn't a fast
machine being only (only!) a 1.3GHz (gigaHertz) Pentium III with half a gig
of RAM (Random Access Memory). I could only imagine the impossibility of
profiling under a faster machine, or even imagining what could be profiled
under a faster machine.

I have to wonder what the Linux guys are smoking to even think, in the grand
scheme of things, if __builtin_expect() will even improve things all that
much.

Unless they have access to better profiling mechanics than I do.

Looks like I might have to find a slower machine to get a better feel for how
to improve the speed of the program.

[1] https://boston.conman.org/mod_blog.tar.gz

Email author at [email protected]