Unicode, localization and C++ support

Marco Alesiani — Wed, 20 Apr 2016 10:06:20 +0000

This document doesn’t attempt to be yet another Unicode article but rather target the fundamental points that should be of interest to a C++ programmer diving into the new standards and into the Unicode subject either for the first time or as a refresher. It isn’t by any means a complete source of information on Unicode (there’s no mention of versioning, BOM sequences or other advanced topics like cluster composition algorithms or language-specific drawbacks addressing techniques) but only meant to provide insights on features that might be relevant for a programmer.

What is Unicode

Unicode is an industry standard to encode text independently from language, platform or program. The idea is to assign a unique number (called code point) to each character used in writing. E.g. the codepoint for the latin capital letter A is 0041 and to indicate that this is a unicode codepoint a U+ prefix is commonly added: U+0041. One might notice that the same number is also the hex for the character A in the ASCII table. This is by design since the first 256 code points were made identical to the content of the ISO-8859-1 ASCII-based encoding standard in order to make it easy to port existing western texts to unicode. Attention has to be paid to an often misunderstood aspect: unicode was originally designed to use 2 bytes for its codepoints, but this is no longer the case (e.g. emoji characters have been added to the unicode pages and mapped from codepoint U+1F600 onward, which cannot be represented with just 2 bytes).

The getaway from this paragraph is that unicode is a standard way to assign a unique number called code point to each writing character.

Implementing unicode

Since unicode maps each character to a (potentially very large) number, different encodings were developed to map this number (codepoint) to a sequence that could efficiently be transmitted or stored. Business interests and technical arguments caused standardization issues and therefore many different ways were developed.

UCS

UCS stands for Universal Coded Character Set and it was originally a standard with goals similar to those of the unicode consortium. Nowadays efforts have been made to (more or less) synchronize Unicode and UCS. Some of the original encodings are still used though, for instance what is now called UCS-2 consists in encoding a codepoint in two bytes. UCS-2 is now deprecated since, again, 2 bytes cannot suffice to represent every codepoint in Unicode. Another UCS encoding is UCS-4 (now also called UTF-32) which uses 32 bits per codepoint (and this suffices to represent Unicode codepoints).

UTF

UTF stands for Unicode Transformation Format. Instead of a fixed-length encoding that uses the same amount of bytes to encode a codepoint as UCS-2 or UCS-4 do, many UTF encodings prefer to use a variable-width encoding. UTF-8 is one of the most used and famous variable-length encoding. It uses 8-bit code-units (the basic unit of an encoded sequence, in UTF-8 corresponds to one byte which encodes both the character codepoint and some other encoding-specific data) so for example the “Hello” string in UTF-8 would be stored as:

 Hello
 0x48 0x65 0x6C 0x6C 0x6F

which again, maps the contents of directly translating the string into the ASCII-table sequence (5 bytes). Characters whose codepoints are below U+0080 are represented with a single byte. These are also the first 128 ASCII table characters. If the codepoint is higher, as in the case for the € euro sign U+20AC, the most significant bit is set to 1.

 €
 0xE2        0x82        0xAC
 1110 0010   10 000010   10 101100
 ^^^^        ^^          ^^
 bits not part of the character codepoint but only related to the encoding

In this case three bytes are necessary to represent in UTF-8 the € character. The number of bytes needed is described in the specification and decoding it is a straightforward procedure.

UTF-8 therefore takes a variable amount of bytes (1 up to 4 by design) to encode a codepoint. UTF-16 is also variable-length and uses one or two 16-bit code units and it was originally developed as a successor for the now obsolete UCS-2 (codepoints in the so-called BMP plane can be represented by UCS-2, other codepoints could not and as a workaround had to be encoded differently and are commonly referred to as surrogate pairs). UTF-32 uses 32 bits per code unit but since 4 bytes is also the defined maximum to be used to encode a codepoint, this effectively makes UTF-32 a fixed-length encoding. Its usage is discouraged by W3C.

C++11 and Unicode support

Since the C++11 standard some additional Unicode facilities have been integrated into the language. C++ fundamental wchar_t type is dedicated to storing any supported code unit (usually 32 bits on systems that support Unicode with the exception of Windows using 16 bytes). C++11 introduced char16_t and char32_t, types large enough to store UTF-16 code units (2 bytes each) and UTF-32 code units (4 bytes each). The char type remains dedicated to whatever representation can be most efficiently processed on the system and, on a machine where char is 8 bits, it is used for 8-bit code units. One should not assume that plain ASCII-table characters are encoded into a char sequence but rather treat it as containing 8-bit codeunits.

The header provides useful typedefs to work with specializations of the base string class template:

std::string	std::basic_string<char>
std::wstring	std::basic_string<wchar_t>
std::u16string (C++11)	std::basic_string<char16_t>
std::u32string (C++11)	std::basic_string<char32_t>

Converting between byte string std::string and wide string std::basic_string (std::wstring) are also supported natively starting with C++11. An example follows:

#include 
#include 
#include 
#include 
#include 

int main() {
  using namespace std;

  ios_base::sync_with_stdio(false); // Avoids synchronization with C stdio on gcc         
                                    // (either localize both or disable sync)

  wcout.imbue(locale("en_US.UTF-8")); // change default locale

  unsigned char euroUTF8[] = { 0xE2, 0x82, 0xAC, 0x00 }; // Euro sign UTF8

  wstring_convert> converter_UTF8_wchar;
  wstring euroWideStr = converter_UTF8_wchar.from_bytes((char*)euroUTF8);
  wcout << euroWideStr << endl;

  string euroNarrowStr = converter_UTF8_wchar.to_bytes(euroWideStr);
  cout << euroNarrowStr << endl;
}

A locale is an immutable set of so-called facets that help writing localized-aware code (i.e. render some features specific for a geographic area / culture). Examples are formatting time and date in a specific format (US or EU) or currency parsing. Each feature is represented via a class facet that encapsulates the locale-specific logic.

In the code above the default locale for the wcout global object (output for wide strings) is changed to English – US region with UTF-8 code pages.

After the setup, a sequence of UTF-8 encoded bytes which represent the € euro character codepoint are stored into an array.

The class template std::wstring_convert accepts a code conversion facet to perform the conversion to a wide string. The standard facets provided by the standard library suitable to be used are std::codecvt_utf8 and std::codecvt_utf8_utf16. std::codecvt_utf8 manages conversions from/to UTF-8 to/from UCS2 and from/to UTF-8 to/from UCS4. In order to understand why these conversions are available, recall that the fundamental type wchar_t is usually 32 bit with the exception of Windows systems (16 bit). std::codecvt_utf8_utf16 provides conversion from/to UTF-8 to/from UTF-16.

Starting from C++11 new string literals were also added to specify encoding and type of a literal: L for wchar_t, u8 for UTF-8 encoded, u for UTF-16 encoded and U for UTF-32 encoded. Escape sequences for 16 bit and 32 bit codepoints were also added.

unsigned char euroUTF8_1[] = { 0xE2, 0x82, 0xAC, 0x00 };
unsigned char euroUTF8_2[] = u8"\U000020AC"; // Null character is always appended to the literal          
assert(memcmp(euroUTF8_1, euroUTF8_2, sizeof(euroUTF8_2) / sizeof(unsigned char)) == 0);

Using a 32 bit escape sequence to encode a 32 bit codepoint into a UTF-16 sequence will generate a sequence of UTF-16 code units that can be later converted or used:

#include 
#include 
#include 
#include 
#include 

void hex_print(const std::string& s) {
  std::cout << std::hex << std::setfill('0');
  for(unsigned char c : s)
    std::cout << std::setw(2) << static_cast(c) << ' ';
  std::cout << std::dec << '\n';
}

int main() {
  using namespace std;
  ios_base::sync_with_stdio(false);
  cout.imbue(locale("C.UTF-8"));

  u16string globeUTF16 = u"\U0001F30D"; // Globe                                             

  wstring_convert, char16_t> convert_UTF16_UTF8;

  string utf8Str = convert_UTF16_UTF8.to_bytes(globeUTF16);
  cout << "UTF-8 code units: ";
  hex_print(utf8Str);

  cout << utf8Str << endl;
}

It has to be noted that even when string literals are used, compile-time error checking isn’t available and failures should be expected when doing nonsensical operations like converting to UTF-8 a UCS-2 sequence of code units mapped in the range 0xD800-0xDFFF which, by design, are used by UCS-2 to map surrogate pairs to codepoints outside the BMP plane:

char16_t globeUTF16[] = u"\U0001F34C";
wstring_convert, char16_t> convert_UTF8_UCS2;
// std::string u8Str = convert_UTF8_UCS2.to_bytes(globeUTF16); // range_error                  

auto globeCodepoint = (globeUTF16[0] - 0xD800) * 0x400 + globeUTF16[1] - 0xDC00 + 0x10000;

cout << hex << globeCodepoint << endl; // 1F34C

At the time of writing this few compilers actually implement this behavior while others either completely lack the facet features or allow for an illegal conversion.

Compilers settings

A few words might be worth spending talking about compiler behaviors on different systems and architectures. Some compilers might document their default execution charset (gcc has the -fexec-charset option while MSVC has no command-line equivalent but rather a preprocessor directive

#pragma execution_character_set("utf-8")

). This setting is not affected in MSVC by the “multi-byte character set” or “unicode” option in the project property pane that instructs the compiler whether to use wide APIs or not.

Although there are still some rough edges, an ongoing effort to fix standard library implementations and provide more features and algorithms to operate with Unicode is being carried out and will continue to improve with future C++ versions.

A big thanks to Stefano Saraulli, Alessandro Vergani and the reddit community for reviewing this article.

ucs – Italian C++ Community

Unicode, localization and C++ support

What is Unicode

Implementing unicode

UCS

UTF

C++11 and Unicode support

Compilers settings