11. Programming languages — Programming with Unicode (original) (raw)

11.1. C language¶

The C language is a low level language, close to the hardware. It has a builtincharacter string type (wchar_t*), but only few libraries support this type. It is usually used as the first “layer” between the kernel (system calls, e.g. open a file) and applications, higher level libraries and other programming languages. This first layer uses the same type as the kernel: except Windows, all kernels use byte strings.

There are higher level libraries, like glib or Qt, offering a Unicode API, even if the underlying kernel uses byte strings. Such libraries use a codec to encode data to the kernel and todecode data from the kernel. The codec is usually the currentlocale encoding.

Because there is no Unicode standard library, most third-party libraries chose the simple solution: use byte strings. For example, the OpenSSL library, an open source cryptography toolkit, expects filenames as byte strings. On Windows, you have to encode Unicode filenames to the current ANSI code page, which is a small subset of the Unicode charset.

11.1.1. Byte API (char)¶

char

For historical reasons, char is the C type for a character (“char” as “character”). In pratical, it’s only true for 7 and 8 bits encodings like ASCIIor ISO 8859-1. With multibyte encodings, a char is only one byte. For example, the character “é” (U+00E9) is encoded as two bytes (0xC3 0xA9) in UTF-8.

char is a 8 bits integer, it is signed or not depending on the operating system and the compiler. On Linux, the GNU compiler (gcc) uses a signed type for Intel CPU. It defines __CHAR_UNSIGNED__ ifchar type is unsigned. Check if the CHAR_MAX constant from <limits.h> is equal to 255 to check if char is unsigned.

A literal byte is written between apostrophes, e.g. 'a'. Some control characters can be written with an backslash plus a letter (e.g. '\n' = 10). It’s also possible to write the value in octal (e.g. '\033' = 27) or hexadecimal (e.g. '\x20' = 32). An apostrophe can be written '\'' or'\x27'. A backslash is written '\\'.

<ctype.h> contains functions to manipulate bytes, liketoupper() or isprint().

11.1.2. Byte string API (char*)¶

char*

char* is a a byte string. This type is used in many places in the C standard library. For example, fopen() useschar* for the filename.

<string.h> is the byte string library. Most functions starts with “str” (string) prefix: strlen(), strcat(), etc. <stdio.h> contains useful string functions like snprintf() to format a message.

The length of a string is stored directly in the string as a nul byte at the end. This is a problem with encodings using nul bytes (e.g. UTF-16 and UTF-32): strlen()cannot be used to get the length of the string, whereas most C functions suppose that strlen() gives the length of the string. To support such encodings, the length should be stored differently (e.g. in another variable or function argument) and str*() functions should be replaced by mem*functions (e.g. replace strcmp(a, b) == 0 by memcmp(a, b) == 0).

A literal byte strings is written between quotes, e.g. "Hello World!". As byte literal, it’s possible to add control characters and characters in octal or hexadecimal, e.g. "Hello World!\n".

11.1.3. Character API (wchar_t)¶

type wchar_t¶

With ISO C99 comes wchar_t: the character type. It can be used to store Unicode characters. As char, it has a library: <wctype.h> contains functions like towupper() oriswprint() to manipulate characters.

wchar_t is a 16 or 32 bits integer, signed or not. Linux uses 32 bits signed integer. Mac OS X uses 32 bits integer. Windows and AIX use 16 bits integer (BMP only). Check if the WCHAR_MAX constant from <wchar.h> is equal to 0xFFFF to check if wchar_t is a 16 bits unsigned integer.

A literal character is written between apostrophes with the L prefix, e.g.L'a'. As byte literal, it’s possible to write control character with an backslash and a character with its value in octal or hexadecimal. For codes bigger than 255, '\uHHHH' syntax can be used. For codes bigger than 65535,'\UHHHHHHHH' syntax can be used with 32 bits wchar_t.

11.1.4. Character string API (wchar_t*)¶

wchar_t*

With ISO C99 comes wchar_t*: the character stringtype. The standard library <wchar.h> contains character string functions like wcslen() or wprintf(), and constants likeWCHAR_MAX. If wchar_t is 16 bits long, non-BMP characters are encoded to UTF-16 as surrogate pairs.

A literal character strings is written between quotes with the Lprefix, e.g. L"Hello World!\n". As character literals, it supports also control character, codes written in octal, hexadecimal, L"\uHHHH" and L"\UHHHHHHHH".

POSIX.1-2001 has no function ignoring case to compare character strings. POSIX.1-2008, a recent standard, adds wcscasecmp(): the GNU libc has it as an extension (if _GNU_SOURCE is defined). Windows has the_wcsnicmp() function.

Windows uses (UTF-16) wchar_t* strings for its Unicode API.

11.1.5. printf functions family¶

int printf(const char *format, ...)¶

int wprintf(const wchar_t *format, ...)¶

Formats of string arguments for the printf functions:

"%s": literal byte string (char*)

"%ls": literal character string (wchar_t*)

printf("%ls") is strict: it stops immediatly if acharacter string argument cannot be encodedto the locale encoding. For example, the following code prints the truncated string “Latin capital letter L with stroke: [” if Ł (U+0141) cannot be encoded to the locale encoding.

printf("Latin capital letter L with stroke: [%ls]\n", L"\u0141");

wprintf("%s") and wprintf("%.<length>s") are strict: they stop immediatly ifa byte string argument cannot be decodedfrom the locale encoding. For example, the following code prints the truncated string “Latin capital letter L with stroke: [” if0xC5 0x81 (U+0141 encoded to UTF-8) cannot be decoded from thelocale encoding.

wprintf(L"Latin capital letter L with stroke): [%s]\n", "\xC5\x81"); wprintf(L"Latin capital letter L with stroke): [%.10s]\n", "\xC5\x81");

wprintf("%ls") replaces unencodable character string arguments by ? (U+003F). For example, the following example print “Latin capital letter L with stroke: [?]” if Ł (U+0141) cannot be encoded to the locale encoding:

wprintf(L"Latin capital letter L with stroke: [%s]\n", L"\u0141");

So to avoid truncated strings, try to use only wprintf() with character string arguments.

Note

There is also "%S" format which is a deprecated alias to the "%ls"format, don’t use it.

11.2. C++¶

std::wstring: character string using thewchar_t type, Unicode version of std::string (byte string)

std::wcin, std::wcout and std::wcerr: standard input, output and error output; Unicode version of std::cin, std::cout andstd::cerr

std::wostringstream: character stream buffer; Unicode version ofstd::ostringstream.

To initialize the locales, equivalent to setlocale(LC_ALL, ""), use:

#include std::locale::global(std::locale(""));

If you use also C and C++ functions (e.g. printf() and std::cout) to access the standard streams, you may have issues with non-ASCII characters. To avoid these issues, you can disable the automatic synchronization between C (std*) and C++ (std::c*) streams using:

#include std::ios_base::sync_with_stdio(false);

Note

Use typedef basic_ostringstream<wchar_t> wostringstream; if wostringstream is not available.

11.3. Python¶

Python supports Unicode since its version 2.0 released in October 2000.Byte and Unicode strings store their length, so it’s possible to embed nul byte/character.

Python can be compiled in two modes: narrow (UTF-16) and wide (UCS-4).sys.maxunicode constant is 0xFFFF in narrow build, and 0x10FFFF in wide build. Python is compiled in narrow mode on Windows, because wchar_t is also 16 bits on Windows and so it is possible to use Python Unicode strings as wchar_t*strings without any (expensive) conversion.

11.3.1. Python 2¶

str is the byte string type and unicode is thecharacter string type. Literal byte strings are written b'abc' (syntax compatible with Python 3) or 'abc' (legacy syntax), \xHH can be used to write a byte by its hexadecimal value (e.g. b'\x80' for 128). Literal Unicode strings are written with the prefix u: u'abc'. Code points can be written as hexadecimal: \xHH (U+0000—U+00FF), \uHHHH(U+0000—U+FFFF) or \UHHHHHHHH (U+0000—U+10FFFF), e.g. 'euro sign:\u20AC'.

In Python 2, str + unicode gives unicode: the byte string isdecoded from the default encoding (ASCII). This coercion was a bad design idea because it was the source of a lot of confusion. At the same time, it was not possible to switch completely to Unicode in 2000: computers were slower and there were fewer Python core developers. It took 8 years to switch completely to Unicode: Python 3 was relased in December 2008.

Narrow build of Python 2 has a partial support of non-BMPcharacters. The unichr() function raises an error for code bigger than U+FFFF, whereas literal strings support non-BMP characters (e.g. '\U0010FFFF'). Non-BMP characters are encoded as surrogate pairs. The disavantage is that len(u'\U00010000') is 2, and u'\U0010FFFF'[0] isu'\uDBFF' (lone surrogate character).

Note

DO NOT CHANGE THE DEFAULT ENCODING! Calling sys.setdefaultencoding() is a very bad idea because it impacts all libraries which suppose that the default encoding is ASCII.

11.3.2. Python 3¶

bytes is the byte string type and str is thecharacter string type. Literal byte strings are written with the b prefix:b'abc'. \xHH can be used to write a byte by its hexadecimal value, e.g. b'\x80' for 128. Literal Unicode strings are written 'abc'. Code points can be used directly in hexadecimal: \xHH(U+0000—U+00FF), \uHHHH (U+0000—U+FFFF) or \UHHHHHHHH(U+0000—U+10FFFF), e.g. 'euro sign:\u20AC'. Each item of a byte string is an integer in range 0—255: b'abc'[0] gives 97, whereas 'abc'[0] gives'a'.

Python 3 has a full support of non-BMP characters, in narrow and wide builds. But as Python 2, chr(0x10FFFF) creates a string of 2 characters (aUTF-16 surrogate pair) in a narrow build. chr() andord() supports non-BMP characters in both modes.

Python 3 uses U+DC80—U+DCFF character range to store undecodable bytes with thesurrogateescape error handler, described in the PEP 383 (Non-decodable Bytes in System Character Interfaces). It is used for filenames and environment variables on UNIX and BSD systems. Example:b'abc\xff'.decode('ASCII', 'surrogateescape') gives 'abc\uDCFF'.

11.3.3. Differences between Python 2 and Python 3¶

str + unicode gives unicode in Python 2 (the byte string is decoded from the default encoding, ASCII) and it raises a TypeError in Python 3. In Python 3, comparing bytes and str gives False, emits a BytesWarning warning or raises a BytesWarning exception depending of the bytes warning flag (-bor -bb option passed to the Python program). In Python 2, the byte string is decoded from the default encoding (ASCII) to Unicode before being compared.

UTF-8 decoder of Python 2 accept surrogate characters, even if there are invalid, to keep backward compatibility with Python 2.0. In Python 3, the UTF-8 decoder is strict: it rejects surrogate characters.

It is possible to make Python 2 behave more like Python 3 withfrom __future__ import unicode_literals.

11.3.4. Codecs¶

The codecs and encodings modules provide text encodings. They support a lot of encodings. Some examples: ASCII, ISO-8859-1, UTF-8, UTF-16-LE, ShiftJIS, Big5, cp037, cp950, EUC_JP, etc.

UTF-8, UTF-16-LE, UTF-16-BE, UTF-32-LE and UTF-32-BE don’t use BOM, whereas UTF-8-SIG, UTF-16 and UTF-32 use BOM.mbcs is only available on Windows: it is the ANSI code page.

Python provides also many error handlers used to specify how to handleundecodable byte sequences and unencodable characters:

strict (default): raise a UnicodeDecodeError or a UnicodeEncodeError

replace: replace undecodable bytes by � (U+FFFD) and unencodable characters by ? (U+003F)

ignore: ignore undecodable bytes and unencodable characters

backslashreplace (only encode): replace unencodable bytes by \xHH

Python 3 has three more error handlers:

surrogateescape: replace undecodable bytes (non-ASCII: 0x80—0xFF) by surrogate characters (in U+DC80—U+DCFF) on decoding, replace characters in range U+DC80—U+DCFF by bytes in0x80—0xFF on encoding. Read the PEP 383 (Non-decodable Bytes in System Character Interfaces) for the details.

surrogatepass, specific to UTF-8 codec: allow encoding/decoding surrogate characters in UTF-8. It is required because UTF-8 decoder of Python 3 rejects surrogate characters by default.

backslashreplace (for decode): replace undecodable bytes by \xHH

Decoding examples in Python 3:

b'abc\xff'.decode('ASCII') uses the strict error handler and raises an UnicodeDecodeError

b'abc\xff'.decode('ASCII', 'ignore') gives 'abc'

b'abc\xff'.decode('ASCII', 'replace') gives 'abc\uFFFD'

b'abc\xff'.decode('ASCII', 'surrogateescape') gives'abc\uDCFF'

Encoding examples in Python 3:

'\u20ac'.encode('UTF-8') gives b'\xe2\x82\xac'

'abc\xff'.encode('ASCII') uses the strict error handler and raises an UnicodeEncodeError

'abc\xff'.encode('ASCII', 'backslashreplace') gives b'abc\\xff'

11.3.5. String methods¶

Byte string (str in Python 2, bytes in Python 3) methods:

.decode(encoding, errors='strict'): decode from the specified encoding and (optional) error handler.

Character string (unicode in Python 2, str in Python 3) methods:

.encode(encoding, errors='strict'): encode to the specified encoding with an (optional) error handler

.isprintable(): False if the character category is other (Cc, Cf, Cn, Co, Cs) or separator (Zl, Zp, Zs),True otherwise. There is an exception: even if U+0020 is a separator,' '.isprintable() gives True.

.toupper(): convert to uppercase

11.3.6. Filesystem¶

Python decodes bytes filenames and encodes Unicode filenames using the filesystem encoding, sys.getfilesystemencoding():

mbcs (ANSI code page) on Windows

UTF-8 on Mac OS X

locale encoding otherwise

Python uses the strict error handler in Python 2, andsurrogateescape (PEP 383) in Python 3. In Python 2, if os.listdir(u'.')cannot decode a filename, it keeps the bytes filename unchanged. Thanks tosurrogateescape, decoding a filename never fails in Python 3. But encoding a filename can fail in Python 2 and 3 depending on the filesystem encoding. For example, on Linux with the C locale, the Unicode filename"h\xe9.py" cannot be encoded because the filesystem encoding is ASCII.

In Python 2, use os.getcwdu() to get the current directory as Unicode.

11.3.7. Windows¶

Encodings used on Windows:

locale.getpreferredencoding(): ANSI code page

'mbcs' codec: ANSI code page

sys.stdout.encoding, sys.stderr.encoding: encoding of theWindows console.

sys.argv, os.environ, subprocess.Popen(args): native Unicode support (no encoding)

11.3.8. Modules¶

codecs module:

BOM_UTF8, BOM_UTF16_BE, BOM_UTF32_LE, …: Byte order marks (BOM) constants

lookup(name): get a Python codec. lookup(name).name gets the Python normalized name of a codec, e.g. codecs.lookup('ANSI_X3.4-1968').namegives 'ascii'.

open(filename, mode='rb', encoding=None, errors='strict', ...): legacy API to open a binary or text file. To open a file in Unicode mode, useio.open() instead

io module:

open(name, mode='r', buffering=-1, encoding=None, errors=None, ...): open a binary or text file in read and/or write mode. For text file,encoding and errors can be used to specify the encoding and theerror handler. By default, it opens text files with the locale encoding in strict mode.

TextIOWrapper(): wrapper to read and/or write text files, encode from/decode to the specified encoding (and error handler) and normalize newlines (\r\n and \r are replaced by \n). It requires a buffered file. Don’t use it directly to open a text file: use open()instead.

locale module (locales):

LC_ALL, LC_CTYPE, …: locale categories

getlocale(category): get the value of a locale category as the tuple (language code, encoding name)

getpreferredencoding(): get the locale encoding

setlocale(category, value): set the value of a locale category

sys module:

getdefaultencoding(): get the default encoding, e.g. used by'abc'.encode(). In Python 3, the default encoding is fixed to'utf-8', in Python 2, it is 'ascii' by default.

getfilesystemencoding(): get the filesystem encoding used to decode and encode filenames

maxunicode: biggest Unicode code point storable in a single Python Unicode character, 0xFFFF in narrow build or 0x10FFFF in wide build.

unicodedata module:

category(char): get the category of a character

name(char): get the name of a character

normalize(string): normalize a string to the NFC, NFD, NFKC or NFKD form

11.4. PHP¶

In PHP 5, a literal string (e.g. "abc") is a byte string. PHP has no character string type, only a “string” type which is abyte string.

PHP has “multibyte” functions to manipulate byte strings using their encoding. These functions have an optional encoding argument. If the encoding is not specified, PHP uses the default encoding (called “internal encoding”). Some multibyte functions:

mb_internal_encoding(): get or set the internal encoding

mb_substitute_character(): change how to handle unencodable characters:

"none": ignore unencodable characters

"long": escape as hexadecimal value, e.g. "U+E9"or "JIS+7E7E"

"entity": escape as HTML entities, e.g. "é"

mb_convert_encoding(): decode from an encoding andencode to another encoding

mb_ereg(): search a pattern using a regular expression

mb_strlen(): get the length in characters

mb_detect_encoding(): guess the encoding of a byte string

Perl compatible regular expressions (PCRE) have an u flag (“PCRE8”) to process byte strings as UTF-8 encoded strings.

PHP also includes a binding for the iconv library.

iconv(): decode a byte string from an encoding and encode to another encoding, you can use//IGNORE or //TRANSLIT suffix to choose the error handler

iconv_mime_decode(): decode a MIME header field

PHP 6 was a project to improve Unicode support of Unicode. This project died at the beginning of 2010. Read The Death of PHP 6/The Future of PHP 6 (May 25, 2010 by Larry Ullman) and Future of PHP6 (March 2010 by Johannes Schlüter) for more information.

11.5. Perl¶

Write a character using its code point written in hexadecimal:

chr(0x1F4A9)

"\x{2639}"

"\N{U+A0}"

Using use charnames qw( :full );, you can use a Unicode character in a string using "\N{name}" syntax. Example:

say "\N{long s} \N{ae} \N{Omega} \N{omega} \N{UPWARDS ARROW}"

Declare that filehandles opened within this lexical scope but not elsewhere are in UTF-8, until and unless you say otherwise. The :std adds in STDIN,STDOUT, and STDERR. This critical step implicitly decodes incoming data and encodes outgoing data as UTF-8:

use open qw( :encoding(UTF-8) :std );

If PERL_UNICODE environment variable is set to AS, the following data will use UTF-8:

@ARGV

STDIN, STDOUT, STDERR

If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF-8, then say:

binmode(DATA, ":encoding(UTF-8)");

Misc:

use feature qw< unicode_strings >; use Unicode::Normalize qw< NFD NFC >; use Encode qw< encode decode >; @ARGV = map { decode("UTF-8", $_) } @ARGV; open(OUTPUT, "> :raw :encoding(UTF-16LE) :crlf", $filename);

Misc:

Encode

Unicode::Normalize

Unicode::Collate

Unicode::Collate::Locale

Unicode::UCD

DBM_Filter::utf8

History:

Perl 5.6 (2000): initial Unicode support, support character strings

Perl 5.8 (2002): regex supports Unicode

use “use utf8;” pragma to specify that your Perl script is encoded toUTF-8

Read perluniintro, perlunicode and perlunifaq manuals.

See Tom Christiansen’s Materials for OSCON 2011 for more information.

11.6. Java¶

char is a character able to store Unicode BMP only characters (U+0000—U+FFFF), whereas Character is a wrapper of the char with static helper functions.Character methods:

.getType(ch): get the category of a character

.isWhitespace(ch): test if a character is a whitespace according to Java

.toUpperCase(ch): convert to uppercase

.codePointAt(CharSequence, int): return the code point at the given index of the CharSequence

String is a character string implemented using achar array and UTF-16. String methods:

String(bytes, encoding): decode a byte string from the specified encoding. The decoder is strict: throw a CharsetDecoder exception if a byte sequence cannot be decoded.

.getBytes(encoding): encode to the specified encoding, throw a CharsetEncoder exception if a character cannot be encoded.

.length(): get the length in UTF-16 units.

As Python compiled in narrow mode, non-BMP characters are stored as UTF-16 surrogate pairs and the length of a string is the number of UTF-16 units, not the number of Unicode characters.

Java, as the Tcl language, uses a variant of UTF-8 which encodes the nul character (U+0000) as the overlong byte sequence 0xC0 0x80, instead of 0x00. So it is possible to use Cfunctions like strlen() on byte string with embeded nul characters.

11.7. Go and D¶

The Go and D languages use UTF-8 as internal encoding to storeUnicode strings.