An implementation of coroutines for C

* * * * *

An implementation of coroutines for C

Obligatory Sidebar Links

* Faster Fibers/Coroutines - 1024cores [1]
* Coroutines in less than 20 lines of standard C - Tony Finch's blog [2]
* stevedekorte/coroutine: C multiplatform coroutine implementation via
ucontext, fibers or setjmp. [3]
* Libtask: a Coroutine Library for C and Unix [4]
* Coroutines in one page of C [5]
* Coroutines in C [6]
* mpu/gthreads at code0 [7]

I use coroutines [8] in Lua [9] and I do miss having them in C. Sure, there
are a number of coroutine implementations for C, but they all generally fall
into two camps:

1. they rely upon setjmp()/longjmp(), which is problematic because it is
seriously abusing those two functions (which exist to give C a form of
exception) and
2. they rely upon ucontext.h, which has been deprecated by POSIX (and the
API is a bit clunky for what I want to do)

But really, I just wanted to write my own implementation [10], because.

So the idea is to write some code that works like:

-----[ C ]-----
extern int coroutine_create(coroutine__s **pcp,coroutine_function__f fun);
extern uintptr_t coroutine_yield (coroutine__s *co,uintptr_t value);
extern int coroutine_free (coroutine__s *co);

uintptr_t sub_task(coroutine__s *self,uintptr_t value) /* [1] */
{
value = coroutine_yield(self,value); /* [2] */
/* [3] */
value = coroutine_yield(self,value); /* [4] */
/* [5] */
value = coroutine_yield(self,value); /* [6] */
/* [7] */
return value;
}

void main_task(void)
{
coroutine__s *co;
uintptr_t v;

coroutine_create(&co,sub_task); /* [8] magic here! */
/* [ 9] */
v = coroutine_yield(co,v); /* [10] */
/* [11] */
v = coroutine_yield(co,v); /* [12] */
/* [13] */
v = coroutine_yield(co,v); /* [14] */
/* [15] */
v = coroutine_yield(co,v); /* [16] */
/* [17] */
v = coroutine_yield(co,v); /* [18] */
/* [19] */
coroutine_free(co);
}
-----[ END OF LINE ]-----

It's a contrived example that one would not use coroutines for, but it does
serve to illustrate the issue that popped up while developing the code for
this. And I"m going to start coroutine_yield(), as that does the actual
switching of the stack to another “unit of execution” (note: this code is for
the Intel 32-bit x86 architecture):

-----[ Assembly ]-----
%assign P_param 8 + 16
%assign P_co 4 + 16

coroutine_yield32:
push ebp ; save callee saved registers
push ebx
push esi
push edi

mov eax,[esp + P_param] ; return parameter
mov edx,[esp + P_co] ; get stack to yield to
xchg esp,[edx] ; YIELD!

pop edi ; retore registers
pop esi
pop ebx
pop ebp
ret
-----[ END OF LINE ]-----

Since this is interfacing with C, I have to use the x86 32-bit calling
convention [11] (and for the record, I'm using the Intel syntax, not the AT&T
syntax). Parameters are passed on the stack, and the callee (in this case,
coroutine_yield32()) needs to save certain registers.

Normally, when switching a “unit of execution” such as a thread or process,
one needs to save the entire CPU state. But I can cheat here—I'm calling a
function, so I can skip saving registers the callee can use (read: trash),
which saves a bit of time in the switching. So that's what's going on here. I
have the registers that the C calling convention require saving, putting
P_param into EAX to return it, get the pointer to the stack we're switching
to and at the line that states “YIELD!” we switch the “units of execution.”
The final five instructions are running under the coroutine, where we pull
the registers saved and return into our now running coroutine.

But now here's the problem—this assumes the stack for the coroutine is
properly initialized. Refering back to the C code, line 12 will yield back to
line 3 and it works there because everything has been set up. But line 10 is
problematic—that's the first switching of execution, and we haven't actually
started sub_task(), which is expecting arguments already existing on the
stack. Furthermore, for the C calling convention to work, we need to actually
call sub_task(). I really don't want to mess up coroutine_yield() with
special code to handle that case (that's just … ugly). I want to handle this
cleanly.

So the first coroutine_yield() needs to call into (as per our example)
sub_task(). The code for that looks like:

-----[ Assembly ]-----
push eax ; return from coroutine_yield
push <coroutine self parameter>
call <our function>
-----[ END OF LINE ]-----

Setting aside where we'll get the coroutine self paramter and the address for
the function, we just need to ensure that our first call to coroutine_yield()
resumes to this code fragment. And we can do that in the coroutine_create()—
initialize the stack of the coroutine properly such that that happens. So
let's name our fragment:

-----[ Assembly ]-----
start_it_up: push eax ; return from coroutine_yield
push <coroutine self parameter>
call <our function>
-----[ END OF LINE ]-----

and we can initialize the coroutine stack:

-----[ Assembly ]-----
mov dword [ecx + 16],start_it_up
xor eax,eax
mov [ecx + 8],eax ; "saved" EBX
mov [ecx + 4],eax ; "saved" ESI
mov [ecx + 0],eax ; "saved" EDI
mov [edx],ecx
-----[ END OF LINE ]-----

For now, just accept that we have the new coroutine stack pointer in ECX (the
final version uses ECX but I don't want to spoil things too much at this
point). This populates the stack with the values needed for coroutine_yield()
to fall into our code fragment, which is techincally a thunk [12]. Now we
turn our attention to saving the data required for our new thunk to call our
function.

Now, on the 32-bit x86, a classical stack frame will look something like
this:

Table: Typical stack frame
offset from EBP contents
------------------------------
12 parameter 2
8 parameter 1
4 return address
0 previous stack frame address (previous EBP)
-4 local variable 1
-8 local variable 2

The thunk doesn't need paramaters, nor does it need the return address or
even a previous stack frame. We just need some local variables. So set up the
stack like:

Table: Coroutine stack contents
EBP of coroutine
coroutine pointer
address of sub_task()
address of start_it_up
stack frame for start_it_up
“saved” EBX
“saved” ESI
ESP of coroutine “saved” EDI

We can fix start_it_up:

-----[ Assembly ]-----
%assign L_co -4
%assign L_fun -8

start_it_up: push eax
push dword [ebp + L_co]
call [ebp + L_fun]
-----[ END OF LINE ]-----

And with our C example, this will get us to through line 15. At line 16 we
have an issue, where we resume at line 7 and our coroutine now returns. Well,
we did call it, so we get its return value back to our thunk. Well, the easy
thing here is to just yield it back. And since we have the stack set for a
call, we can save some instructions:

-----[ Assembly ]-----
%assign L_co -4
%assign L_fun -8
%assign C_param -12

start_it_up: push eax
push dword [ebp + L_co]
call [ebp + L_fun]

mov ebp + C_param],eax
call coroutine_yield32
-----[ END OF LINE ]-----

And that will get us to line 18. But now we no longer have a running
coroutine and we've run off the bottom of our thunk. There are two options
here:

1. call (or JMP) to abort();
2. just yield the value back.

Both are valid responses, but I like the second one better as you might not
know if a coroutine has finished or not. And that just requires one more
instruction to start_it_up:

-----[ Assembly ]-----
%assign L_co -4
%assign L_fun -8
%assign C_param -12

start_it_up: push eax
push dword [ebp + L_co]
call [ebp + L_fun]

do_it_again: mov ebp + C_param],eax
call coroutine_yield32
jmp do_it_again
-----[ END OF LINE ]-----

And there you go—coroutines for C [13].

The 64-bit version [14] is pretty much the same—just that the registers
needed to be saves are different, and the parameters are passed in registers
instead of the stack, but overall, it's the same approach.

Should this code be used in production? I don't know. It works for Linux
(both 32 and 64 bit versions) and for Mac OS-X (64 bit version). And while
you can use setjmp()/longjmp(), you **CANNOT** do so across coroutine stacks
(within the same coroutine—fine). And this has only been tested for C,
**NOT** for C++. I don't know enough about C++ (or its calling conventions or
exception handling) to recommend this for that.

But really that's all there is to it for coroutines in C.

And the final question—what are coroutines good for? That's for another post.

[1] http://www.1024cores.net/home/lock-free-algorithms/tricks/fibers
[2] http://fanf.livejournal.com/105413.html
[3] https://github.com/stevedekorte/coroutine
[4] https://swtch.com/libtask/
[5] http://yosefk.com/blog/coroutines-in-one-page-of-c.html
[6] http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html
[7] https://github.com/mpu/gthreads/tree/code0
[8] gopher://gopher.conman.org/0Phlog:2007/01/14.1
[9] http://www.lua.org/
[10] https://github.com/spc476/C-Coroutines
[11] https://idea.popcount.org/2013-07-16-baby-steps-in-x86-assembly/
[12] https://en.wikipedia.org/wiki/Thunk
[13] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-32.asm
[14] https://github.com/spc476/C-Coroutines/blob/master/coroutine_yield-x86-64.asm

Email author at [email protected]