* * * * *

                        Is profiling even viable now?

Mark brought up (in email) an interesting optimization technique using GCC 3:

> I came across an interesting optimization that is GCC specific but quite
> clever.
>
> In lots of places in the Linux kernel you will see something like:
>
> > p = get_some_object();
> > if (unlikely(p == NULL))
> > {
> >   kill_random_process();
> >   return (ESOMETHING);
> > }
> >
> > do_stuff(p);
> >
>
> The conditional is clearly an error path and as such means it is rarely
> taken. This is actually a macro defined like this:
>
> > #define unlikely(b)   __builtin_expect(b, 0)
> >
>
> On newer versions of GCC this tells the compiler to expect the condition
> not to be taken. You could also tell the compiler that the branch is likely
> to be taken:
>
> > #define likely(b)     __builtin_expect(b, 1)
> >
>
> So how does this help GCC anyhow? Well, on some architectures (PowerPC)
> there is actually a bit in the branch instruction to tell the CPU's
> speculative execution unit if the branch is likely to be taken. On other
> architectures it avoids conditional branches to make the “fast path” branch
> free (with -freorder-blocks).
>

I was curious to see if this would actually help any, so I found a machine
that had GCC 3 installed (swift), compiled a version of mod_blog [1] with
profiling information, ran it, found a function that looked good to speed up,
added some calls to __builtin_expect(), reran the code and got a rather
encouragine interesting result.

I then reran the code, and got a completely different result.

In fact, each time I run the code, the profiling information I get is nearly
useless—well, to a degree. For instance one run:

Table: Each sample counts as 0.01 seconds.
% time  cumulative seconds      self seconds    calls   self ms/call    total ms/call   name
------------------------------
100.00  0.01    0.01    119529  0.00    0.00    line_ioreq
0.00    0.01    0.00    141779  0.00    0.00    BufferIOCtl
0.00    0.01    0.00    60991   0.00    0.00    line_readchar
0.00    0.01    0.00    59747   0.00    0.00    ht_readchar

Then another run:

Table: Each sample counts as 0.01 seconds.
% time  cumulative seconds      self seconds    calls   self ms/call    total ms/call   name
------------------------------
33.33   0.01    0.01    119529  0.00    0.00    line_ioreq
33.33   0.02    0.01    60991   0.00    0.00    line_readchar
33.33   0.03    0.01    21200   0.00    0.00    ufh_write
0.00    0.03    0.00    141779  0.00    0.00    BufferIOCtl

Yet another run:

Table: Each sample counts as 0.01 seconds. no time accumulated
% time  cumulative seconds      self seconds    calls   self ms/call    total ms/call   name
------------------------------
0.00    0.00    0.00    141779  0.00    0.00    BufferIOCtl
0.00    0.00    0.00    119529  0.00    0.00    line_ioreq
0.00    0.00    0.00    60991   0.00    0.00    line_readchar
0.00    0.00    0.00    59747   0.00    0.00    ht_readchar

And still another one:

Table: Each sample counts as 0.01 seconds.
% time  cumulative seconds      self seconds    calls   self ms/call    total ms/call   name
------------------------------
50.00   0.01    0.01    60991   0.00    0.00    line_readchar
50.00   0.02    0.01    1990    0.01    0.01    HtmlParseNext
0.00    0.02    0.00    141779  0.00    0.00    BufferIOCtl
0.00    0.02    0.00    119529  0.00    0.00    line_ioreq

Like I said, nearly useless. Sure, there are the usual suspects, like
BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements
I'm getting by doing this. And by today's standards, swift isn't a fast
machine being only (only!) a 1.3GHz (gigaHertz) Pentium III with half a gig
of RAM (Random Access Memory). I could only imagine the impossibility of
profiling under a faster machine, or even imagining what could be profiled
under a faster machine.

I have to wonder what the Linux guys are smoking to even think, in the grand
scheme of things, if __builtin_expect() will even improve things all that
much.

Unless they have access to better profiling mechanics than I do.

Looks like I might have to find a slower machine to get a better feel for how
to improve the speed of the program.

[1] https://boston.conman.org/mod_blog.tar.gz

Email author at [email protected]