Some musings on parallelizing programs

* * * * *

Some musings on parallelizing programs

> Can you, in your favourite language, show me the easiest way to parallelise
> the following program (written in C)? I am just genuinely curious.
>
> > int gimme(int x)
> > {
> > return x * x;
> > }
> >
> > int main(void)
> > {
> > int i = 2, j = 4;
> > int a = gimme(i),
> > b = gimme(j);
> > return a * b;
> > }
> >
>

“Write Me A Small Program [1]”

In my ideal world, any compiler worth its salt would optimize (mostly through
variable lifetime analysis and constant propagation) that program down to the
following:

> int main(void)
> {
> return(64);
> }
>

but the author's main point is—what languages today can easily parallelize
that code? His point, on a massively parallel computer, the calls to gimme()
are both independant of each other, and therefore can be computed at the same
time (assume, for the sake of argument, that gimme() takes a bit more time to
execute than a single function call and multiply).

And in an ideal world, a compiler should be able to figure out that gimme()
is, for lack of a better term, a “pure” function (one that does not rely upon
data outside of its input parameters and local variables). This also implies
that thread creation is cheaper than it is today (or easier, which is another
of the author's points), and that the compiler will take care of the messy
thread creation details for me.

But even an ideal compiler won't take us very far.

I was thinking about this as I was running tests on the greylist daemon [2]
last night. The program maxed out at around 130 requests per second, and this
on a dual-core 2.6GHz (Gigahertz) Intel Pentium system that hasn't even used
swap space, so speed and memory isn't an issue (and remember, the greylist
daemon [3] keeps everything in memory).

But as written, the program only uses a single thread of execution. Could I
not create two threads (one for each CPU (Central Processing Unit) in the
machine) and double the speed?

Maybe, but the compiler won't help me much.

When a tuple [IP (Internet Protocol) address, sender address, recipient
address] comes in, the program scans a list of IP addresses which can accept
or reject that tuple. If the tuple isn't outright accepted or rejected based
on the IP address, then the sender address is checked, then the sender
domain, then the recipient address, then the recipient domain. Now, those
lists are, for the most part, read-only, so no coordination is required
between multiple threads for read access.

The actual greylist itself? The program checks to see if the tuple [IP
address, sender address, recipient address] already exists, and if not, adds
it to the existing list of tuples. And that part, adding the tuple to the
list, happens on just about every tuple that comes in. In effect, the tuple
list is “write- mostly.” And to ensure a consistent list of tuples, you need
to coordinate multiple threads when trying to update the list. And now you
get into semaphores, and locking and dining philosophers [4] …

Messy stuff.

I suppose each thread could get its own tuple list to update, but then you
have to ensure that any given tuple X always goes to the same thread. Easy
enough I suppose, but it's not something that you can leave up to the
compiler.

In fact, in an ideal world, an ideal compiler would be hard-pressed to
automagically parallelize this program. Certain parts, maybe. But overall, it
would required thinking how best to write the program to be parallelized in
the first place.

[1] http://freeshells.ch/~revence/par.txt
[2] gopher://gopher.conman.org/0Phlog:2007/10/12.1
[3] gopher://gopher.conman.org/0Phlog:2007/08/16.1
[4] http://en.wikipedia.org/wiki/Dining_philosophers_problem

Email author at [email protected]