What are coroutines and why should I care?
In The Art of Computer Programming Donald Knuth introduced coroutines as an alternative to the usual function caller/callee idiom where two pieces of code were treated as cooperating equals.
Coroutines can be thought of as language-level constructs that generalize subroutines by providing multiple exit/entry points. A normal subroutine usually has a starting point and one or more exit (return) points. A coroutine provides the ability to enter/exit its control flow at different spots therefore allowing for greater code expressiveness, preservation of automatic states across function calls and nonpreemptive multitasking.
It has to be noted that different programming languages can provide various levels of support for coroutines, e.g.
- Languages supporting the yield keyword
- Languages providing full support for async, await, yield
In this article we’ll focus on the former.
Thoughtful use of coroutines can lead to cleaner and more maintainable code in a variety of situations. As a motivating example let’s take for instance the following pseudocode
function start_accepting() { socket.async_accept(accept_handler); start_accepting(); } function accept_handler() { socket.async_read(read_handler); } function read_handler(data) { request = parse_data(data); switch(request) { case SEND_DATA: { data_to_send = prepare_data_for(request); socket.async_write(data_to_send, write_handler); } break; }; } function write_handler() { ... // continue execution }
Asynchronous programming is often the preferred way of accomplishing potentially blocking operations without stalling the thread on blocking calls. In the pseudocode above we’re assuming (and omitting for clarity’s sake) that all operations are queued and handled by an event loop manager (a common and powerful idiom in asynchronous applications programming, cfr. boost::asio).
Coroutines allow modeling the same behavior with more readable code
coroutine acceptor() { while(true) { socket.async_accept_connection(yield); // To event manager start_coroutine(serve_request); } } coroutine serve_request() { socket.async_read(data, yield); request = parse_data(data); switch(request) { case SEND_DATA: { data_to_send = prepare_data_for(request); socket.async_write(data_to_send, yield); ... // Continue execution } break; }; }
The code in serve_request() uses a sequential-looking paradigm
Coroutines let you create a structure that mirrors the actual program logic. Asynchronous operations don’t split functions, because there are no handlers to define what should happen when an asynchronous operation completes. Instead of having handlers call each other, the program can use a sequential structure.
Standard support
At the time of writing this article coroutines didn’t make it for the next standard version (C++17) although recent MSVC versions already ship with an /await option to test experimental support for coroutines.
Coroutines internals
It is important to understand the role of coroutines in providing a collaborative non-preemptive multitasking: spawning a coroutine does not spawn a new thread of execution but coroutines waive execution by yielding to callers (asymmetric coroutines) or to other coroutines (symmetric coroutines) explicitly.
Since coroutines are concepts that have been known for a relatively long time many different techniques (both at language level and system level) have been devised (an interesting suggested reading: Duff’s device based coroutines).
Implementing basic support for asymmetric stackful coroutines can be a rewarding experience in terms of understanding the relationship between coroutines and the way these program flow constructs interact with callers and the underlying memory. Most of the code that will be presented is a pruned-down version of the coroutine implementation by Oliver Kowalke (cfr. boost::coroutine2) available with the boost libraries.
Abstracting away the execution context
In order to implement a coroutine context switch (in its simplest form from the callee to the caller) we need a mechanism to abstract the execution state of our routines and to save/restore stack, registers, CPU flags, etc.
A context switch between threads on x86_64 can be quite costly in terms of performances since it also involves a trip to kernel syscalls. Coroutines context switches in the same thread are far more lightweight and require no kernel interaction. The foundation bricks for userland context switches are contained in the boost/context library. These functionalities provide a solid abstraction that can provide ready-to-use stacks (either basic malloc’d stack buffers or even split stacks on supported architectures) to store our context data or complement system-level constructs (e.g. ucontext on Unix systems, Fibers on Windows). It has to be noted that boost::context used to support Windows fibers due to undocumented TEB-swapping related issues; after fixes were deployed support was dropped since the introduction of execution context v2.
In this article we’ll go for a fcontext implementation in assembler on a x86_64 Unix system (no system calls involved).
Saving the state
When a coroutine yields an fcontext switch should occur, i.e. we should save whatever state the routine was at that point in time and JMP to another context. On a recent Unix system calling conventions, object and executable file formats and other low-level ABI issues are defined by the System V ABI. On a x86_64 architecture the stack grows downwards and parameters to functions are passed in registers rdi, rsi, rcx, r8, r9 + additional stack space if needed. The stack is always 16-byte aligned before a call instruction is issued. Registers rbx, rsp, rbp, r12, r13, r14, and r15 are preserved across function calls while rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11 are scratch registers:
Return value | Parameter Registers | Scratch Registers | Preserved Registers |
---|---|---|---|
rax, rdx | rdi, rsi, rdx, rcx, r8, r9 + additional_stack | rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11 | rbx, rsp, rbp, r12, r13, r14, r15 |
Therefore following in boost::context‘s footsteps a reasonable memory layout is the following
/**************************************************************************************** * * * ---------------------------------------------------------------------------------- * * | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | * * ---------------------------------------------------------------------------------- * * | 0x0 | 0x4 | 0x8 | 0xc | 0x10 | 0x14 | 0x18 | 0x1c | * * ---------------------------------------------------------------------------------- * * | R12 | R13 | R14 | R15 | * * ---------------------------------------------------------------------------------- * * ---------------------------------------------------------------------------------- * * | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | * * ---------------------------------------------------------------------------------- * * | 0x20 | 0x24 | 0x28 | 0x2c | 0x30 | 0x34 | 0x38 | 0x3c | * * ---------------------------------------------------------------------------------- * * | RBX | RBP | RIP | EXIT | * * ---------------------------------------------------------------------------------- * * * ****************************************************************************************/
The EXIT field is going to be left unused for our purposes but it will be left in place anyway.
The first thing we need is to allocate space to store the context data and make sure it has a valid alignment for the architecture we’re dealing with
// Allocate context-stack space context_stack = (void*)malloc(64_Kb); std::size_t space = UNIX_CONTEXT_DATA_SIZE + 64; sp = static_cast<char*>(context_stack) + 64_Kb - space; sp = std::align(64, UNIX_CONTEXT_DATA_SIZE, sp, space); assert(sp != nullptr && space >= UNIX_CONTEXT_DATA_SIZE);
boost::context offers both memory-efficient on-demand growing stacks or fixed stack allocations (cfr. boost docs). In this example code we’ll go for a fixed stack allocation.
Since we can’t deal with registers directly in C++ we’ll have to fallback on a pure assembly routine. The GAS backend seems the logical tool of choice for this work. We therefore define an external function to link against our executable with C linkage:
extern "C" fcontext_t jump_to_context(fcontext_t context);
What is an fcontext_t? In a x86_64 world it is just a register’s content:
using fcontext_t = void*;
Luckily for us RIP will have already been set by the fact we’re invoking jump_to_context with a CALL instruction so we get an instruction pointer on the stack for free in our assembly code:
.text .globl jump_to_context .type jump_to_context,@function .align 16 jump_to_context: pushq %rbp /* save RBP */ pushq %rbx /* save RBX */ pushq %r15 /* save R15 */ pushq %r14 /* save R14 */ pushq %r13 /* save R13 */ pushq %r12 /* save R12 */ /* omissis */ /* restore RSP (pointing to context-data) from RDI */ movq %rdi, %rsp popq %r12 /* restore R12 */ popq %r13 /* restore R13 */ popq %r14 /* restore R14 */ popq %r15 /* restore R15 */ popq %rbx /* restore RBX */ popq %rbp /* restore RBP */ /* continue... */ .size jump_to_context,.-jump_to_context /* Mark that we don't need executable stack. */ .section .note.GNU-stack,"",%progbits
Using CMake putting everything together becomes quite easy:
project(simple_crts CXX ASM) cmake_minimum_required(VERSION 2.8.12) set (CMAKE_CXX_STANDARD 14) set_source_files_properties(jump_to_context_x86_64_elf_gas.S PROPERTIES COMPILE_FLAGS "-x assembler-with-cpp") add_executable(simple_crts simple_crts.cpp jump_to_context_x86_64_elf_gas.S) target_link_libraries(simple_crts ${CONAN_LIBS} pthread)
Trampolines to coroutines
Something is missing at this point: we need a valid RIP pointer to the coroutine to jump to. We could enter the coroutine and have another function store this information for us, but there’s a better way which avoids cluttering the coroutine code entirely: using a trampoline function.
Just as in boost::context, we define a trampoline function ourselves which, when jumped to, re-jumps to the caller and saves its context as a pre-stage for the coroutine:
void trampoline(fcontext_t ctx) { yield_ctx = jump_to_context(ctx); wannabe_coroutine(); }
What we have to do now is a simplified version of the make_context routine to set the first RIP towards the trampoline’s prologue:
// Do make_context's work (simplified) // Do *NOT* try this at home (or even worse in the office) void** addr = reinterpret_cast<void**>(static_cast<char*>(sp) + UNIX_CONTEXT_DATA_RIP_OFFSET); *addr = reinterpret_cast<void*>(&trampoline); // In a more complex case there might be additional initialization and // frame adjustments going on coroutine_ctx = jump_to_context(sp);
So right now we have a valid trampoline RIP set in place:
/**************************************************************************************** * * * ---------------------------------------------------------------------------------- * * | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | * * ---------------------------------------------------------------------------------- * * | 0x0 | 0x4 | 0x8 | 0xc | 0x10 | 0x14 | 0x18 | 0x1c | * * ---------------------------------------------------------------------------------- * * | R12 | R13 | R14 | R15 | * * ---------------------------------------------------------------------------------- * * ---------------------------------------------------------------------------------- * * | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | * * ---------------------------------------------------------------------------------- * * | 0x20 | 0x24 | 0x28 | 0x2c | 0x30 | 0x34 | 0x38 | 0x3c | * * ---------------------------------------------------------------------------------- * * | RBX | RBP | RIP | EXIT | * * ---------------------------------------------------------------------------------- * * ^^^^^^^^^^^^^^^^^^^^ * ****************************************************************************************/
This kickstarts the bouncing to/from the trampoline:
.text .globl jump_to_context .type jump_to_context,@function .align 16 jump_to_context: pushq %rbp pushq %rbx pushq %r15 pushq %r14 pushq %r13 pushq %r12 /* store RSP (pointing to context-data) in RAX */ movq %rsp, %rax movq %rdi, %rsp popq %r12 popq %r13 popq %r14 popq %r15 popq %rbx popq %rbp /* restore return-address (must have been put on the new stack) */ popq %r8 /* pass the old context as first parameter (if we're headed towards a landing function) */ movq %rax, %rdi /* indirect jump to new context */ jmp *%r8 .size jump_to_context,.-jump_to_context .section .note.GNU-stack,"",%progbits
It is important to note that we’re keeping the stack aligned during this entire process (recall that the stack has to be 16-bytes aligned before a call instruction is issued).
The process roughly goes on like this:
It has to be noted that the trampoline function might reserve stack space for its parameters as well. In the code above we allocated 64Kb of heap space to be used as stack space for context operations. So after the first jump the sp automatic variable is no longer reliable. coroutine_ctx should be used instead.
Resuming fcontext
Resuming trampoline’s fcontext requires another call and rip-save and stack pointer adjustment to coroutine_ctx. Trampoline’s old RIP will be available for free after we’ve restored the first 48 bytes of the fcontext.
Execution can afterwards continue to the designated coroutine. At this point the coroutine should be somehow encapsulated to be able to use the yield_ctx context pointer: that is the gateway to our (in an asymmetric view) caller context.
Each time we want to yield execution back to the caller we’ll have to jump_to_context to the yield_ctx:
void yield() { yield_ctx = jump_to_context(yield_ctx); } void wannabe_coroutine() { std::cout << "I wanna be a coroutine when I grow my stack space up\n"; yield(); std::cout << "Hooray!\n"; yield(); }
Notice that we’re also reassigning the variable with the return value provided by jump_to_context. This assignment is not executed until the control flow comes back to the yield() function:
.. save /* store RSP (pointing to context-data) in RAX */ movq %rsp, %rax .. restore
This is a cooperative behavior example: each jump_to_context() invocation from this point onward actually returns fcontext data for the previous invocation.
The rest of the code bounces back and forth through the contexts resulting in the following sequence:
At the end of the day the stack is freed (sounds weird to say) and the program terminates.
Exercise: wrapping up
As a didactic exercise (i.e. do not EVER use this code in a production environment) we can use some metaprogramming to wrap our coroutines and avoid polluting our code with stack adjustments and cleanup boilerplate. Eventually we’d like to end up with code like this
int g_fib = -1; void fibonacci_gen() { int first = 1; g_fib = first; yield(); int second = 1; g_fib = second; yield(); for (int i = 0; i < 9; ++i) { int third = first + second; first = second; second = third; g_fib = third; yield(); } } int main() { // Outputs the first 10 Fibonacci numbers coroutine<void(void)> coro(fibonacci_gen); for (int i = 0; i <= 10; ++i) { coro.resume(); if(i) std::cout << g_fib << " "; } }
To do this we create a templated wrapper class:
template class coroutine;
that will handle the stack and trampoline setup for us. One difference from the first example is the need for a wrapper that will handle the trampoline invocation (shields us from implementation-dependent issues):
template void call_member_trampoline(coroutine *instance, fcontext_t ctx) { instance->trampoline(ctx); }
The trampoline is therefore modified as follows:
void trampoline(fcontext_t ctx) { size_t index = yield_ctx.size() - 1; yield_ctx[index] = jump_to_context(this, ctx); this->m_fn(); }
The only difference in the jump_to_context() function is in handling its new arity:
/* restore RSP (pointing to context-data) from RSI */ movq %rsi, %rsp
and the promotion of %rdi from scratch-register to parameter-register (since we’re directly jumping to a destination context’s RIP).
The rest of the code remains largely unchanged.
Back to boost::context
If you’ve followed through the entire article and you made it here, you should by now know what the following boost::context2 program does:
#include <boost/context/all.hpp> #include <iostream> #include <array> namespace ctx = boost::context::detail; class Coroutine { public: Coroutine() { my_context = ctx::make_fcontext( stack.data() + stack.size(), stack.size(), &Coroutine::dispatch ); } virtual ~Coroutine() {} void operator()() { auto transfer_ctx = ctx::jump_fcontext(my_context, this); my_context = transfer_ctx.fctx; } protected: void yield() { auto transfer_ctx = ctx::jump_fcontext(yield_context, 0); my_context = transfer_ctx.fctx; } virtual void call() = 0; private: static void dispatch(ctx::transfer_t coroutine_ptr) { Coroutine *coroutine = reinterpret_cast<Coroutine *>(coroutine_ptr.data); coroutine->yield_context = coroutine_ptr.fctx; coroutine->call(); while(true) coroutine->yield(); } private: ctx::fcontext_t my_context; ctx::fcontext_t yield_context; std::array<intptr_t, 66 * 1024> stack; }; struct A : public Coroutine { void call() { std::cout << " __________________________________ " << std::endl; yield(); std::cout << "| _ _ |_| _| |_ |" << std::endl; yield(); std::cout << "| |_| |_| | | | |_ |" << std::endl; } }; struct B : public Coroutine { void call() { std::cout << "| |" << std::endl; yield(); std::cout << "| _| |_ _| |_ _ |_ _| |" << std::endl; yield(); std::cout << "| |_| |___| |" << std::endl; yield(); std::cout << "| |" << std::endl; yield(); std::cout << "|__________________________________|" << std::endl; } }; struct C : public Coroutine { void call() { std::cout << "| _ _ |" << std::endl; yield(); std::cout << "| |_ _| |_ _| | | | | |" << std::endl; } void operator++(int) { std::cout << "| ++It - The Italian C++ Community |" << std::endl; std::cout << "|__________________________________|" << std::endl; } }; int main() { A a; B b; C c; for (size_t i = 0; i<10; ++i) { a(); b(); c(); } c++; // An entire operator overloading to write 'c++'? Worth it! }
Final words
Coroutines provide a powerful abstraction to offer the same level of concurrency one would get with asynchronous callbacks by offering at the same time a chance to write more maintainable code. At the time of writing this article boost offers context, coroutine and fiber libraries. As we’ve seen boost::context provides the foundation for userland context switches, boost::coroutine2 offers coroutines support (which conceptually have no need of synchronization whatsoever since they implement a nonpreemptive cooperative multitasking) and boost::fiber which builds on boost::context to add a scheduling mechanism: each time a fiber yields, control is given back to a scheduler manager which decides the next execution strategy (cfr. N4024).
As usual it is up to the programmer to carefully choose which abstractions are to be used in a specific context.
References and credits
- The boost library
- Coroutine implementation
- Coroutines in C
- N3985
- N4024
- Library Foundations for Asynchronous Operations
Special thanks to the ++it community and Oliver Kowalke for providing insights and reviews of parts of this article.