Profiling is still viable, if you run the program long enough

* * * * *

Profiling is still viable, if you run the program long enough

So last night [1] I said, “I might have to find a slower machine to get a
better feel for how to improve the speed of [mod_blog [2]].” Well, I found
one other way—increase the amount of work the program does.

mod_blog, or at least the program I was working with last night, bp (for
“build page”) is responsible for building the HTML (HyperText Markup
Language) pages served up the webserver. In its default mode (which is what I
was working with last night) it generates the main index page, and the RSS
file, which only (only!) generates about 100,000 bytes of output and happened
too quickly to get any meaningful data out of profiling the program. But it
can also generate pages with hundreds of entries [3] given the right options,
so by having it generate a page with every entry in 2000 through 2003
inclusive (which generates a page that is 3,352,028 bytes in size) I was able
to get meaningful profiling information:

Table: Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
------------------------------
29.73 0.22 0.22 5817769 0.00 0.00 line_ioreq
14.86 0.33 0.11 6650539 0.00 0.00 BufferIOCtl
10.81 0.41 0.08 2908866 0.00 0.00 ht_readchar
9.46 0.48 0.07 790760 0.00 0.00 ufh_write
5.41 0.52 0.04 2908398 0.00 0.00 line_readchar

Yes, with a run time of nearly 3 seconds (2.992) I was able to generate
consistant, meaningful profiling information (meaning, we went from calling
BufferIOCtl() 141,779 times to 6,650,539 times) which was good enough to see
if the GCC optimzation of using -freorder-blocks with __builtin_expect()
would help any, at least on the Intel x86 line.

Well, after running a dozen tests, I can say that it does help on the Intel
platform, but at best, using __builtin_expect() with -freorder-blocks will
only give you a few percent boost in speed. As in, single digit percentage
boost.

In certain cases.

Certainly, when I did tests generating the output to /dev/null you could see
the boost, but in running tests generating an actual file (as opposed to just
tossing the data in the bit bucket), it's not quite so clear cut (and for
each test, I ran it five times, taking the timings from the fifth run—this to
help smooth out any system caching effects). The best improvement came when
using -O3 -march=pentium3 -fomit-frame-pointer to compile the program, but it
was still hardly noticable from a user perspective (maybe about a tenth or
two tenths of a second).

Mark is expecting __builtin_expect() to have better impact on systems where
the CPU (Central Processing Unit) will use the hint.

[1] gopher://gopher.conman.org/0Phlog:2004/07/10.1
[2] https://boston.conman.org/mod_blog.tar.gz
[3] https://boston.conman.org/about/technical.html

Email author at [email protected]