Marco Alesiani – Italian C++ Community

Coroutines Internals

Marco Alesiani — Wed, 02 Nov 2016 09:15:30 +0000

What are coroutines and why should I care?

In The Art of Computer Programming Donald Knuth introduced coroutines as an alternative to the usual function caller/callee idiom where two pieces of code were treated as cooperating equals.
Coroutines can be thought of as language-level constructs that generalize subroutines by providing multiple exit/entry points. A normal subroutine usually has a starting point and one or more exit (return) points. A coroutine provides the ability to enter/exit its control flow at different spots therefore allowing for greater code expressiveness, preservation of automatic states across function calls and nonpreemptive multitasking.

It has to be noted that different programming languages can provide various levels of support for coroutines, e.g.

Languages supporting the yield keyword
Languages providing full support for async, await, yield

In this article we’ll focus on the former.

Thoughtful use of coroutines can lead to cleaner and more maintainable code in a variety of situations. As a motivating example let’s take for instance the following pseudocode

function start_accepting() {

  socket.async_accept(accept_handler);

  start_accepting();

}

function accept_handler() {

  socket.async_read(read_handler);
 
}

function read_handler(data) {

  request = parse_data(data);

  switch(request) {

    case SEND_DATA: {

      data_to_send = prepare_data_for(request);

      socket.async_write(data_to_send, write_handler);

    } break;

  };

}

function write_handler() {

  ... // continue execution

}

Asynchronous programming is often the preferred way of accomplishing potentially blocking operations without stalling the thread on blocking calls. In the pseudocode above we’re assuming (and omitting for clarity’s sake) that all operations are queued and handled by an event loop manager (a common and powerful idiom in asynchronous applications programming, cfr. boost::asio).

Coroutines allow modeling the same behavior with more readable code

coroutine acceptor() {

  while(true) {

    socket.async_accept_connection(yield); // To event manager

    start_coroutine(serve_request);

  }

}

coroutine serve_request() {

  socket.async_read(data, yield);

  request = parse_data(data);

  switch(request) {

    case SEND_DATA: {

      data_to_send = prepare_data_for(request);

      socket.async_write(data_to_send, yield);

      ... // Continue execution

    } break;

  };

}

The code in serve_request() uses a sequential-looking paradigm

Coroutines let you create a structure that mirrors the actual program logic. Asynchronous operations don’t split functions, because there are no handlers to define what should happen when an asynchronous operation completes. Instead of having handlers call each other, the program can use a sequential structure.

(boost.asio-coroutines)

Standard support

At the time of writing this article coroutines didn’t make it for the next standard version (C++17) although recent MSVC versions already ship with an /await option to test experimental support for coroutines.

Coroutines internals

It is important to understand the role of coroutines in providing a collaborative non-preemptive multitasking: spawning a coroutine does not spawn a new thread of execution but coroutines waive execution by yielding to callers (asymmetric coroutines) or to other coroutines (symmetric coroutines) explicitly.

Since coroutines are concepts that have been known for a relatively long time many different techniques (both at language level and system level) have been devised (an interesting suggested reading: Duff’s device based coroutines).

Implementing basic support for asymmetric stackful coroutines can be a rewarding experience in terms of understanding the relationship between coroutines and the way these program flow constructs interact with callers and the underlying memory. Most of the code that will be presented is a pruned-down version of the coroutine implementation by Oliver Kowalke (cfr. boost::coroutine2) available with the boost libraries.

Abstracting away the execution context

In order to implement a coroutine context switch (in its simplest form from the callee to the caller) we need a mechanism to abstract the execution state of our routines and to save/restore stack, registers, CPU flags, etc.

A context switch between threads on x86_64 can be quite costly in terms of performances since it also involves a trip to kernel syscalls. Coroutines context switches in the same thread are far more lightweight and require no kernel interaction. The foundation bricks for userland context switches are contained in the boost/context library. These functionalities provide a solid abstraction that can provide ready-to-use stacks (either basic malloc’d stack buffers or even split stacks on supported architectures) to store our context data or complement system-level constructs (e.g. ucontext on Unix systems, Fibers on Windows). It has to be noted that boost::context used to support Windows fibers due to undocumented TEB-swapping related issues; after fixes were deployed support was dropped since the introduction of execution context v2.

In this article we’ll go for a fcontext implementation in assembler on a x86_64 Unix system (no system calls involved).

Saving the state

When a coroutine yields an fcontext switch should occur, i.e. we should save whatever state the routine was at that point in time and JMP to another context. On a recent Unix system calling conventions, object and executable file formats and other low-level ABI issues are defined by the System V ABI. On a x86_64 architecture the stack grows downwards and parameters to functions are passed in registers rdi, rsi, rcx, r8, r9 + additional stack space if needed. The stack is always 16-byte aligned before a call instruction is issued. Registers rbx, rsp, rbp, r12, r13, r14, and r15 are preserved across function calls while rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11 are scratch registers:

Return value	Parameter Registers	Scratch Registers	Preserved Registers
rax, rdx	rdi, rsi, rdx, rcx, r8, r9 + additional_stack	rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11	rbx, rsp, rbp, r12, r13, r14, r15

Therefore following in boost::context‘s footsteps a reasonable memory layout is the following

/****************************************************************************************
 *                                                                                      *
 *  ----------------------------------------------------------------------------------  *
 *  |    0    |    1    |    2    |    3    |    4     |    5    |    6    |    7    |  *
 *  ----------------------------------------------------------------------------------  *
 *  |   0x0   |   0x4   |   0x8   |   0xc   |   0x10   |   0x14  |   0x18  |   0x1c  |  *
 *  ----------------------------------------------------------------------------------  *
 *  |        R12        |         R13       |         R14        |        R15        |  *
 *  ----------------------------------------------------------------------------------  *
 *  ----------------------------------------------------------------------------------  *
 *  |    8    |    9    |   10    |   11    |    12    |    13   |    14   |    15   |  *
 *  ----------------------------------------------------------------------------------  *
 *  |   0x20  |   0x24  |   0x28  |  0x2c   |   0x30   |   0x34  |   0x38  |   0x3c  |  *
 *  ----------------------------------------------------------------------------------  *
 *  |        RBX        |         RBP       |         RIP        |       EXIT        |  *
 *  ----------------------------------------------------------------------------------  *
 *                                                                                      *
 ****************************************************************************************/

The EXIT field is going to be left unused for our purposes but it will be left in place anyway.

The first thing we need is to allocate space to store the context data and make sure it has a valid alignment for the architecture we’re dealing with

// Allocate context-stack space
context_stack = (void*)malloc(64_Kb);

std::size_t space = UNIX_CONTEXT_DATA_SIZE + 64;
sp = static_cast(context_stack) + 64_Kb - space;

sp = std::align(64, UNIX_CONTEXT_DATA_SIZE, sp, space);
assert(sp != nullptr && space >= UNIX_CONTEXT_DATA_SIZE);

boost::context offers both memory-efficient on-demand growing stacks or fixed stack allocations (cfr. boost docs). In this example code we’ll go for a fixed stack allocation.
Since we can’t deal with registers directly in C++ we’ll have to fallback on a pure assembly routine. The GAS backend seems the logical tool of choice for this work. We therefore define an external function to link against our executable with C linkage:

extern "C" fcontext_t jump_to_context(fcontext_t context);

What is an fcontext_t? In a x86_64 world it is just a register’s content:

using fcontext_t = void*;

Luckily for us RIP will have already been set by the fact we’re invoking jump_to_context with a CALL instruction so we get an instruction pointer on the stack for free in our assembly code:

.text
.globl jump_to_context
.type jump_to_context,@function
.align 16
jump_to_context:
    pushq  %rbp  /* save RBP */
    pushq  %rbx  /* save RBX */
    pushq  %r15  /* save R15 */
    pushq  %r14  /* save R14 */
    pushq  %r13  /* save R13 */
    pushq  %r12  /* save R12 */

    /* omissis */

    /* restore RSP (pointing to context-data) from RDI */
    movq  %rdi, %rsp

    popq  %r12  /* restore R12 */
    popq  %r13  /* restore R13 */
    popq  %r14  /* restore R14 */
    popq  %r15  /* restore R15 */
    popq  %rbx  /* restore RBX */
    popq  %rbp  /* restore RBP */

    /* continue... */

.size jump_to_context,.-jump_to_context

/* Mark that we don't need executable stack. */
.section .note.GNU-stack,"",%progbits

Using CMake putting everything together becomes quite easy:

project(simple_crts CXX ASM)
cmake_minimum_required(VERSION 2.8.12)
set (CMAKE_CXX_STANDARD 14)

set_source_files_properties(jump_to_context_x86_64_elf_gas.S 
                            PROPERTIES COMPILE_FLAGS "-x assembler-with-cpp")

add_executable(simple_crts simple_crts.cpp jump_to_context_x86_64_elf_gas.S)
target_link_libraries(simple_crts ${CONAN_LIBS} pthread)

Trampolines to coroutines

Something is missing at this point: we need a valid RIP pointer to the coroutine to jump to. We could enter the coroutine and have another function store this information for us, but there’s a better way which avoids cluttering the coroutine code entirely: using a trampoline function.

Just as in boost::context, we define a trampoline function ourselves which, when jumped to, re-jumps to the caller and saves its context as a pre-stage for the coroutine:

void trampoline(fcontext_t ctx) {

  yield_ctx = jump_to_context(ctx);

  wannabe_coroutine();
}

What we have to do now is a simplified version of the make_context routine to set the first RIP towards the trampoline’s prologue:

// Do make_context's work (simplified)
// Do *NOT* try this at home (or even worse in the office)
void** addr = reinterpret_cast(static_cast(sp) +
                                         UNIX_CONTEXT_DATA_RIP_OFFSET);
*addr = reinterpret_cast(&trampoline);

// In a more complex case there might be additional initialization and
// frame adjustments going on
coroutine_ctx = jump_to_context(sp);

So right now we have a valid trampoline RIP set in place:

/****************************************************************************************
 *                                                                                      *
 *  ----------------------------------------------------------------------------------  *
 *  |    0    |    1    |    2    |    3    |    4     |    5    |    6    |    7    |  *
 *  ----------------------------------------------------------------------------------  *
 *  |   0x0   |   0x4   |   0x8   |   0xc   |   0x10   |   0x14  |   0x18  |   0x1c  |  *
 *  ----------------------------------------------------------------------------------  *
 *  |        R12        |         R13       |         R14        |        R15        |  *
 *  ----------------------------------------------------------------------------------  *
 *  ----------------------------------------------------------------------------------  *
 *  |    8    |    9    |   10    |   11    |    12    |    13   |    14   |    15   |  *
 *  ----------------------------------------------------------------------------------  *
 *  |   0x20  |   0x24  |   0x28  |  0x2c   |   0x30   |   0x34  |   0x38  |   0x3c  |  *
 *  ----------------------------------------------------------------------------------  *
 *  |        RBX        |         RBP       |         RIP        |       EXIT        |  *
 *  ----------------------------------------------------------------------------------  *
 *                                           ^^^^^^^^^^^^^^^^^^^^                       *
 ****************************************************************************************/

This kickstarts the bouncing to/from the trampoline:

.text
.globl jump_to_context
.type jump_to_context,@function
.align 16
jump_to_context:
    pushq  %rbp
    pushq  %rbx 
    pushq  %r15 
    pushq  %r14 
    pushq  %r13 
    pushq  %r12 

    /* store RSP (pointing to context-data) in RAX */
    movq  %rsp, %rax

    movq  %rdi, %rsp

    popq  %r12 
    popq  %r13 
    popq  %r14 
    popq  %r15 
    popq  %rbx 
    popq  %rbp 

    /* restore return-address (must have been put on the new stack) */
    popq  %r8

    /*
       pass the old context as first parameter (if we're headed
       towards a landing function)
    */
    movq  %rax, %rdi

    /* indirect jump to new context */
    jmp  *%r8

.size jump_to_context,.-jump_to_context
.section .note.GNU-stack,"",%progbits

It is important to note that we’re keeping the stack aligned during this entire process (recall that the stack has to be 16-bytes aligned before a call instruction is issued).

The process roughly goes on like this:

It has to be noted that the trampoline function might reserve stack space for its parameters as well. In the code above we allocated 64Kb of heap space to be used as stack space for context operations. So after the first jump the sp automatic variable is no longer reliable. coroutine_ctx should be used instead.

Resuming fcontext

Resuming trampoline’s fcontext requires another call and rip-save and stack pointer adjustment to coroutine_ctx. Trampoline’s old RIP will be available for free after we’ve restored the first 48 bytes of the fcontext.

Execution can afterwards continue to the designated coroutine. At this point the coroutine should be somehow encapsulated to be able to use the yield_ctx context pointer: that is the gateway to our (in an asymmetric view) caller context.

Each time we want to yield execution back to the caller we’ll have to jump_to_context to the yield_ctx:

void yield() {
  yield_ctx = jump_to_context(yield_ctx);
}

void wannabe_coroutine() {
  std::cout << "I wanna be a coroutine when I grow my stack space up\n";
  yield();
  std::cout << "Hooray!\n";
  yield();
}

Notice that we’re also reassigning the variable with the return value provided by jump_to_context. This assignment is not executed until the control flow comes back to the yield() function:

.. save

/* store RSP (pointing to context-data) in RAX */
movq  %rsp, %rax

.. restore

This is a cooperative behavior example: each jump_to_context() invocation from this point onward actually returns fcontext data for the previous invocation.

The rest of the code bounces back and forth through the contexts resulting in the following sequence:

At the end of the day the stack is freed (sounds weird to say) and the program terminates.

Exercise: wrapping up

As a didactic exercise (i.e. do not EVER use this code in a production environment) we can use some metaprogramming to wrap our coroutines and avoid polluting our code with stack adjustments and cleanup boilerplate. Eventually we’d like to end up with code like this

int g_fib = -1;

void fibonacci_gen() {

  int first = 1;
  g_fib = first;
  yield();

  int second = 1;
  g_fib = second;
  yield();

  for (int i = 0; i < 9; ++i) {

    int third = first + second;
    first = second;
    second = third;
    g_fib = third;
    yield();

  }

}

int main() {

  // Outputs the first 10 Fibonacci numbers

  coroutine coro(fibonacci_gen);

  for (int i = 0; i <= 10; ++i) {
      coro.resume();

    if(i) std::cout << g_fib << " ";
  }

}

To do this we create a templated wrapper class:

template class coroutine;

that will handle the stack and trampoline setup for us. One difference from the first example is the need for a wrapper that will handle the trampoline invocation (shields us from implementation-dependent issues):

template 
void call_member_trampoline(coroutine *instance, fcontext_t ctx) {
  instance->trampoline(ctx);
}

The trampoline is therefore modified as follows:

void trampoline(fcontext_t ctx) {

  size_t index = yield_ctx.size() - 1;
  yield_ctx[index] = jump_to_context(this, ctx);

  this->m_fn();
}

The only difference in the jump_to_context() function is in handling its new arity:

/* restore RSP (pointing to context-data) from RSI */
movq  %rsi, %rsp

and the promotion of %rdi from scratch-register to parameter-register (since we’re directly jumping to a destination context’s RIP).

The rest of the code remains largely unchanged.

Back to boost::context

If you’ve followed through the entire article and you made it here, you should by now know what the following boost::context2 program does:

#include 
#include 
#include 

namespace ctx = boost::context::detail;

class Coroutine {
public:
  Coroutine() {
    my_context = ctx::make_fcontext(
      stack.data() + stack.size(),
      stack.size(),
      &Coroutine::dispatch
    );
  }
  virtual ~Coroutine() {}

  void operator()() {
    auto transfer_ctx = ctx::jump_fcontext(my_context, this);
    my_context = transfer_ctx.fctx;
  }

protected:
  void yield() {
    auto transfer_ctx = ctx::jump_fcontext(yield_context, 0);
    my_context = transfer_ctx.fctx;
  }

  virtual void call() = 0;

private:
  static void dispatch(ctx::transfer_t coroutine_ptr) {
    Coroutine *coroutine = reinterpret_cast(coroutine_ptr.data);
    coroutine->yield_context = coroutine_ptr.fctx;
    coroutine->call();
    while(true)
      coroutine->yield();
  }

private:
  ctx::fcontext_t my_context;
  ctx::fcontext_t yield_context;
  std::array stack;
};

struct A : public Coroutine {
  void call() {
    std::cout << " __________________________________ " << std::endl;
    yield();
    std::cout << "|    _       _       |_|    _| |_  |" << std::endl;
    yield();
    std::cout << "|   |_|     |_|      | |     | |_  |" << std::endl;
  }
};

struct B : public Coroutine {
  void call() {
    std::cout << "|                                  |" << std::endl;
    yield();
    std::cout << "|  _| |_   _| |_      _    |_   _| |" << std::endl;
    yield();
    std::cout << "|                    |_|     |___| |" << std::endl;
    yield();
    std::cout << "|                                  |" << std::endl;
    yield();
    std::cout << "|__________________________________|" << std::endl;
  }
};

struct C : public Coroutine {
  void call() {
    std::cout << "|                     _       _    |" << std::endl;
    yield();
    std::cout << "| |_   _| |_   _|    | |     | |   |" << std::endl;
  }

  void operator++(int) {
    std::cout << "| ++It - The Italian C++ Community |" << std::endl;
    std::cout << "|__________________________________|" << std::endl;
  }
};


int main() {

  A a;
  B b;
  C c;
  for (size_t i = 0; i<10; ++i) {
    a();
    b();
    c();
  }

  c++; // An entire operator overloading to write 'c++'? Worth it!
}

Final words

Coroutines provide a powerful abstraction to offer the same level of concurrency one would get with asynchronous callbacks by offering at the same time a chance to write more maintainable code. At the time of writing this article boost offers context, coroutine and fiber libraries. As we’ve seen boost::context provides the foundation for userland context switches, boost::coroutine2 offers coroutines support (which conceptually have no need of synchronization whatsoever since they implement a nonpreemptive cooperative multitasking) and boost::fiber which builds on boost::context to add a scheduling mechanism: each time a fiber yields, control is given back to a scheduler manager which decides the next execution strategy (cfr. N4024).

As usual it is up to the programmer to carefully choose which abstractions are to be used in a specific context.

References and credits

Special thanks to the ++it community and Oliver Kowalke for providing insights and reviews of parts of this article.

Unicode, localization and C++ support

Marco Alesiani — Wed, 20 Apr 2016 10:06:20 +0000

This document doesn’t attempt to be yet another Unicode article but rather target the fundamental points that should be of interest to a C++ programmer diving into the new standards and into the Unicode subject either for the first time or as a refresher. It isn’t by any means a complete source of information on Unicode (there’s no mention of versioning, BOM sequences or other advanced topics like cluster composition algorithms or language-specific drawbacks addressing techniques) but only meant to provide insights on features that might be relevant for a programmer.

What is Unicode

Unicode is an industry standard to encode text independently from language, platform or program. The idea is to assign a unique number (called code point) to each character used in writing. E.g. the codepoint for the latin capital letter A is 0041 and to indicate that this is a unicode codepoint a U+ prefix is commonly added: U+0041. One might notice that the same number is also the hex for the character A in the ASCII table. This is by design since the first 256 code points were made identical to the content of the ISO-8859-1 ASCII-based encoding standard in order to make it easy to port existing western texts to unicode. Attention has to be paid to an often misunderstood aspect: unicode was originally designed to use 2 bytes for its codepoints, but this is no longer the case (e.g. emoji characters have been added to the unicode pages and mapped from codepoint U+1F600 onward, which cannot be represented with just 2 bytes).

The getaway from this paragraph is that unicode is a standard way to assign a unique number called code point to each writing character.

Implementing unicode

Since unicode maps each character to a (potentially very large) number, different encodings were developed to map this number (codepoint) to a sequence that could efficiently be transmitted or stored. Business interests and technical arguments caused standardization issues and therefore many different ways were developed.

UCS

UCS stands for Universal Coded Character Set and it was originally a standard with goals similar to those of the unicode consortium. Nowadays efforts have been made to (more or less) synchronize Unicode and UCS. Some of the original encodings are still used though, for instance what is now called UCS-2 consists in encoding a codepoint in two bytes. UCS-2 is now deprecated since, again, 2 bytes cannot suffice to represent every codepoint in Unicode. Another UCS encoding is UCS-4 (now also called UTF-32) which uses 32 bits per codepoint (and this suffices to represent Unicode codepoints).

UTF

UTF stands for Unicode Transformation Format. Instead of a fixed-length encoding that uses the same amount of bytes to encode a codepoint as UCS-2 or UCS-4 do, many UTF encodings prefer to use a variable-width encoding. UTF-8 is one of the most used and famous variable-length encoding. It uses 8-bit code-units (the basic unit of an encoded sequence, in UTF-8 corresponds to one byte which encodes both the character codepoint and some other encoding-specific data) so for example the “Hello” string in UTF-8 would be stored as:

 Hello
 0x48 0x65 0x6C 0x6C 0x6F

which again, maps the contents of directly translating the string into the ASCII-table sequence (5 bytes). Characters whose codepoints are below U+0080 are represented with a single byte. These are also the first 128 ASCII table characters. If the codepoint is higher, as in the case for the € euro sign U+20AC, the most significant bit is set to 1.

 €
 0xE2        0x82        0xAC
 1110 0010   10 000010   10 101100
 ^^^^        ^^          ^^
 bits not part of the character codepoint but only related to the encoding

In this case three bytes are necessary to represent in UTF-8 the € character. The number of bytes needed is described in the specification and decoding it is a straightforward procedure.

UTF-8 therefore takes a variable amount of bytes (1 up to 4 by design) to encode a codepoint. UTF-16 is also variable-length and uses one or two 16-bit code units and it was originally developed as a successor for the now obsolete UCS-2 (codepoints in the so-called BMP plane can be represented by UCS-2, other codepoints could not and as a workaround had to be encoded differently and are commonly referred to as surrogate pairs). UTF-32 uses 32 bits per code unit but since 4 bytes is also the defined maximum to be used to encode a codepoint, this effectively makes UTF-32 a fixed-length encoding. Its usage is discouraged by W3C.

C++11 and Unicode support

Since the C++11 standard some additional Unicode facilities have been integrated into the language. C++ fundamental wchar_t type is dedicated to storing any supported code unit (usually 32 bits on systems that support Unicode with the exception of Windows using 16 bytes). C++11 introduced char16_t and char32_t, types large enough to store UTF-16 code units (2 bytes each) and UTF-32 code units (4 bytes each). The char type remains dedicated to whatever representation can be most efficiently processed on the system and, on a machine where char is 8 bits, it is used for 8-bit code units. One should not assume that plain ASCII-table characters are encoded into a char sequence but rather treat it as containing 8-bit codeunits.

The header provides useful typedefs to work with specializations of the base string class template:

std::string	std::basic_string<char>
std::wstring	std::basic_string<wchar_t>
std::u16string (C++11)	std::basic_string<char16_t>
std::u32string (C++11)	std::basic_string<char32_t>

Converting between byte string std::string and wide string std::basic_string (std::wstring) are also supported natively starting with C++11. An example follows:

#include 
#include 
#include 
#include 
#include 

int main() {
  using namespace std;

  ios_base::sync_with_stdio(false); // Avoids synchronization with C stdio on gcc         
                                    // (either localize both or disable sync)

  wcout.imbue(locale("en_US.UTF-8")); // change default locale

  unsigned char euroUTF8[] = { 0xE2, 0x82, 0xAC, 0x00 }; // Euro sign UTF8

  wstring_convert> converter_UTF8_wchar;
  wstring euroWideStr = converter_UTF8_wchar.from_bytes((char*)euroUTF8);
  wcout << euroWideStr << endl;

  string euroNarrowStr = converter_UTF8_wchar.to_bytes(euroWideStr);
  cout << euroNarrowStr << endl;
}

A locale is an immutable set of so-called facets that help writing localized-aware code (i.e. render some features specific for a geographic area / culture). Examples are formatting time and date in a specific format (US or EU) or currency parsing. Each feature is represented via a class facet that encapsulates the locale-specific logic.

In the code above the default locale for the wcout global object (output for wide strings) is changed to English – US region with UTF-8 code pages.

After the setup, a sequence of UTF-8 encoded bytes which represent the € euro character codepoint are stored into an array.

The class template std::wstring_convert accepts a code conversion facet to perform the conversion to a wide string. The standard facets provided by the standard library suitable to be used are std::codecvt_utf8 and std::codecvt_utf8_utf16. std::codecvt_utf8 manages conversions from/to UTF-8 to/from UCS2 and from/to UTF-8 to/from UCS4. In order to understand why these conversions are available, recall that the fundamental type wchar_t is usually 32 bit with the exception of Windows systems (16 bit). std::codecvt_utf8_utf16 provides conversion from/to UTF-8 to/from UTF-16.

Starting from C++11 new string literals were also added to specify encoding and type of a literal: L for wchar_t, u8 for UTF-8 encoded, u for UTF-16 encoded and U for UTF-32 encoded. Escape sequences for 16 bit and 32 bit codepoints were also added.

unsigned char euroUTF8_1[] = { 0xE2, 0x82, 0xAC, 0x00 };
unsigned char euroUTF8_2[] = u8"\U000020AC"; // Null character is always appended to the literal          
assert(memcmp(euroUTF8_1, euroUTF8_2, sizeof(euroUTF8_2) / sizeof(unsigned char)) == 0);

Using a 32 bit escape sequence to encode a 32 bit codepoint into a UTF-16 sequence will generate a sequence of UTF-16 code units that can be later converted or used:

#include 
#include 
#include 
#include 
#include 

void hex_print(const std::string& s) {
  std::cout << std::hex << std::setfill('0');
  for(unsigned char c : s)
    std::cout << std::setw(2) << static_cast(c) << ' ';
  std::cout << std::dec << '\n';
}

int main() {
  using namespace std;
  ios_base::sync_with_stdio(false);
  cout.imbue(locale("C.UTF-8"));

  u16string globeUTF16 = u"\U0001F30D"; // Globe                                             

  wstring_convert, char16_t> convert_UTF16_UTF8;

  string utf8Str = convert_UTF16_UTF8.to_bytes(globeUTF16);
  cout << "UTF-8 code units: ";
  hex_print(utf8Str);

  cout << utf8Str << endl;
}

It has to be noted that even when string literals are used, compile-time error checking isn’t available and failures should be expected when doing nonsensical operations like converting to UTF-8 a UCS-2 sequence of code units mapped in the range 0xD800-0xDFFF which, by design, are used by UCS-2 to map surrogate pairs to codepoints outside the BMP plane:

char16_t globeUTF16[] = u"\U0001F34C";
wstring_convert, char16_t> convert_UTF8_UCS2;
// std::string u8Str = convert_UTF8_UCS2.to_bytes(globeUTF16); // range_error                  

auto globeCodepoint = (globeUTF16[0] - 0xD800) * 0x400 + globeUTF16[1] - 0xDC00 + 0x10000;

cout << hex << globeCodepoint << endl; // 1F34C

At the time of writing this few compilers actually implement this behavior while others either completely lack the facet features or allow for an illegal conversion.

Compilers settings

A few words might be worth spending talking about compiler behaviors on different systems and architectures. Some compilers might document their default execution charset (gcc has the -fexec-charset option while MSVC has no command-line equivalent but rather a preprocessor directive

#pragma execution_character_set("utf-8")

). This setting is not affected in MSVC by the “multi-byte character set” or “unicode” option in the project property pane that instructs the compiler whether to use wide APIs or not.

Although there are still some rough edges, an ongoing effort to fix standard library implementations and provide more features and algorithms to operate with Unicode is being carried out and will continue to improve with future C++ versions.

A big thanks to Stefano Saraulli, Alessandro Vergani and the reddit community for reviewing this article.

Folding Expressions

Marco Alesiani — Mon, 15 Jun 2015 16:32:38 +0000

People familiar with the new features C++11 brought to the C++ programming language should know what a variadic template is and why they’re important. Variadic templates can have a variable number of parameters of any type:

template  class tuple;

This not only brings type safety to the code but also ensures that all the variadic arguments handling is performed at compile-time. Before their introduction, in order to have a template accepting a variable number of template parameters, programmers were used to write verbose code like:

template
void function( T0 arg0 );

template
void function( T0 arg0, T1 arg1 );

template
void function( T0 arg0, T1 arg1, T2 arg2 );

template
void function( T0 arg0, T1 arg1, T2 arg2, T3 arg3 );

...

Template parameter packs went hand-in-hand with variadic templates and together with constant expressions they enabled a recursive-like style of coding to create more complex compile-time operations:

#include 
#include 

template struct seq{};
template
struct gen_seq : gen_seq{};
template
struct gen_seq<0, Is...> : seq{};

template
// Expansion pack
constexpr std::array concat(const std::array& a1, 
        const std::array& a2, seq, seq){
  return {{ a1[I1]..., a2[I2]... }};
}

template
// Initializer for the recursion
constexpr std::array concat(const std::array& a1, 
                                       const std::array& a2){
  return concat(a1, a2, gen_seq{}, gen_seq{});
}

int main() {
    constexpr std::array a1 = {{1,2,3}};
    constexpr std::array a2 = {{4,5}};

    constexpr std::array res = concat(a1,a2);
    for(int i=0; i

Exploiting a constexpr overload of the operator[] the code above generates an integer sequence, aggregate-initializes an std::array and concatenates the input arrays at compile time (details of the operations involved are available at this link).

The integer generation sequence construct was then provided by the standard library itself in C++14 with std::integer_sequence.

These features allowed new ways to exploit templates, anyway parameter packs could only be used and expanded in a strictly-defined series of contexts. For instance, something like the following wasn’t allowed:

template
void printer(T arg) {
    std::cout << arg << " ";
}

template
static void function(Args &&... args) {
    (printer(std::forward(args)) , ...);
}

Anyway one of those restricted contexts was brace-init-lists, therefore workarounds to have parameter packs be expanded were immediately deployed:

template
void printer(T arg) {
    std::cout << arg << " ";
}

template
static void function(Args &&... args) {
    // Expand the pack into a brace-init-list while discarding the return
    // values and filling an unused array
    int unusedVar[] = { 0, 
              ( (void) printer(std::forward(args)), 0) ... };
}

C++17 ~ fold expressions

C++17, scheduled by 2017 at the time of writing, will introduce fold expressions into play and significantly broaden parameter packs scopes of use (cfr. N4191 paper).

As listed in cppreference, at the time of writing, there are four kinds of fold expressions:

Unary right fold
```
( pack op ... )
```
Unary left fold
```
( ... op pack )
```
Binary right fold
```
( pack op ... op init )
```
Binary left fold
```
( init op ... op pack )
```

being their respective expansions:

```
E_1 op (... op (E_N-1 op E_N))
```
```
((E_1 op E_2) op ...) op E_N
```

E_1 op (... op (E_N−1 op (E_N op init)))

```
 (((init op E_1) op E_2) op ...) op E_N
```

In binary folds the op operators must be the same and init represents an expression without an unexpanded parameter pack (e.g. the init value for the expanded expression).

With fold expressions writing a printer construct becomes straightforward:

#include 

template
void for_each(F fun, T&&... args)
{
    (fun (std::forward(args)), ...);
}

int main() {
     for_each([](auto i) { std::cout << i << " "; }, 4, 5, 6); // 4 5 6
}

The sample above uses fold expressions together with the comma operator to create a simple function that calls the provided lambda per each one of the supplied arguments with perfect forwarding.

A caveat relative to the previous example though: unary right fold expressions and unary left fold expressions applied with the comma operator do yield different expressions but their evaluation order remains the same, e.g.

#include 
#include 

template
void for_each1(F fun, T&&... args)
{
    (fun (std::forward(args)), ...);
}

template
void for_each2(F fun, T&&... args)
{
    (..., fun (std::forward(args)));
}

int main()
{
     for_each1([](auto i) { std::cout << i << " "; }, 4, 5, 6); // 4 5 6
     std::cout << std::endl;
     for_each2([](auto i) { std::cout << i << " "; }, 4, 5, 6); // 4 5 6
}

It has to be noted that one of the main reasons fold expressions were accepted as a C++17 proposal is because of their use in concepts:

template 
  concept bool Integral = std::is_integral::value;

template  // A constrained-parameter pack
  void foo(Ts...);

template 
  requires Integral... // error: requirement is ill-formed
void foo(Ts...);

The problem boiled down to the same issue we talked of some paragraphs ago: the parameter pack cannot expand in that context. Fold expressions provide an elegant and effective way to deal with this issue instead of resorting to other constexpr machineries to ensure requirements are met:

template
  requires (Integral && ...)
void foo(Ts...);

References and sources:
N4191 – Folding expressions
cppreference
constexpr flatten list of std::array into array
Variadic template pack expansion
Why doesn’t a left fold expression invert the output of a right fold expression

Thanks to Marco Arena and Andy for the quick review of this article.