Small forays into multiprocessing programming II

* * * * *

Small forays into multiprocessing programming II

So, my attempt at parallelizing the Pythagoran Triples program [1]. I rewrote
it as a function to make it easier to parallelize:

> struct params
> {
> double start;
> double end;
> int count;
> }
>
> int pythagorean_triple(void *data)
> {
> struct params *p = data;
> double a;
> double b;
> double c;
>
> for (p->count = 0 , c = p->start ; c <= p->end ; c++)
> {
> for (b = 1 ; b < c ; b++)
> {
> a = sqrt(c * c - b * b);
> if (trunc(a) == a)
> p->count++;
> }
> }
> return(0);
> }
>

That way, I could then do:

> /*--------------------------------------
> ; the reason for these structures
> ; is that threads created with clone()
> ; only get a single pointer parameter.
> ; This way, I can pass in more data
> ; using that one pointer.
> ;
> ; And yes, I'm using C. Nyah.
> ;-------------------------------------*/
>
> struct param t1;
> struct param t2;
>
> t1.start = 2;
> t1.end = 16383;
> t2.start = 16384;
> t2.end = 32768;
>
> clone(pythagorean_triple,threadstack1,CLONE_VM,&t1);
> clone(pythagorean_triple,threadstack2,CLONE_VM,&t2);
>

And I was surprised that it only took one second less when the processing was
split using this method than when it ran not split at all.

I ran it several times, and yes, both threads were being created, but one was
finishing way sooner than I expected. It was then that I realized the reason—
it takes less time to count to a thousand than it does to thirty thousand.

D'uh!

Obvious in hindsight.

Basically, thread one was doing:

> for (c = 2 ; c <= 16383 ; c++)
> for (b = 1 ; b < c ; b++)
> ... ;
>

but thread two:

> for (c = 16384 ; c <= 32768 ; c++)
> for (b = 1 ; b < c ; b++)
> ... ;
>

Now … how to split the load evenly between two (or more) processors?
Obviously splitting the range in half wasn't cutting it, but is that the only
way to split the workload? What if I were to have one processor do even
numbers, and the other one odd numbers? Something like:

> for (c = 2 ; c <= 32768 ; c += 2)
> for (b = 1 ; b < c ; b++)
> ... ;
>
> for (c = 3 ; c <= 32768 ; c += 2)
> for (b = 1 ; b < c ; b++)
> ... ;
>

Yeah, that might work.

And it did—the workload was evenly distributed across the processors. The
code slightly changed though:

> struct params
> {
> double start;
> double end;
> double delta;
> int count;
> }
>
> int pythagorean_triple(void *data)
> {
> struct params *p = data;
> double a;
> double b;
> double c;
>
> for (
> p->count = 0 , c = p->start ;
> c <= p->end ;
> c += p->delta
> )
> {
> for (b = 1 ; b < c ; b++)
> {
> a = sqrt(c * c - b * b);
> if (trunc(a) == a)
> p->count++;
> }
> }
> return(0);
> }
>
> {
> struct params t1;
> struct params t2;
>
> t1.end = t2.end = 32768;
> t1.delta = t2.delta = 2;
> t1.start = 2;
> t2.start = 3;
>
> clone(pythagorean_triples,threadstack1,CLONE_VM,&t1);
> clone(pythagorean_triples,threadstack1,CLONE_VM,&t2);
> }
>

This concurrent programming certainly requires a different way of thinking
about things. And obvious ways of splitting up workloads don't always work
out as expected.

[1] gopher://gopher.conman.org/0Phlog:2007/05/21.1

Email author at [email protected]