UTF-8 Encoding (original) (raw)
Summary
UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size).
UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4.
One of the really nice features of UTF-8 is that it is compatible with nul-terminated strings. No character will have a nul (0) byte when encoded. This means that C code that deals with char[] will "just work".
You can try the UTF-8 Test Page to see how well your browser (and default font) support UTF-8.
If you are an application developer, this Joel On Software article on Unicodeis pretty good summary of all you need to know.
More links:
- If you are into the gory details, the official spec is RFC 3629
- Markus Kuhn's FAQ
- Rob Pike's story about the invention of it
Detail
For any character equal to or below 127 (hex 0x7F), the UTF-8 representation is one byte. It is just the lowest 7 bits of the full unicode value. This is also the same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes. The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).
For all characters equal to or greater than 2048 but less that 65535 (0xFFFF), the UTF-8 representation is spread across three bytes.
The following table shows the format of such UTF-8 byte sequences (where the "free bits" shown by x's in the table are combined in the order shown, and interpreted from most significant to least significant).
Binary format of bytes in sequence
1st Byte | 2nd Byte | 3rd Byte | 4th Byte | Number of Free Bits | Maximum Expressible Unicode Value |
---|---|---|---|---|---|
0xxxxxxx | 7 | 007F hex (127) | |||
110xxxxx | 10xxxxxx | (5+6)=11 | 07FF hex (2047) | ||
1110xxxx | 10xxxxxx | 10xxxxxx | (4+6+6)=16 | FFFF hex (65535) | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | (3+6+6+6)=21 | 10FFFF hex (1,114,111) |
The value of each individual byte indicates its UTF-8 function, as follows:
- 00 to 7F hex (0 to 127): first and only byte of a sequence.
- 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
- C2 to DF hex (194 to 223): first byte of a two-byte sequence.
- E0 to EF hex (224 to 239): first byte of a three-byte sequence.
- F0 to FF hex (240 to 255): first byte of a four-byte sequence.
UTF-8 remains a simple, single-byte, ASCII-compatible encoding method, as long as no characters greater than 127 are directly present. This means that an HTML document technically declared to be encoded as UTF-8 can remain a normal single-byte ASCII file. The document can remain so even though it may contain Unicode characters above 127, as long as all characters above 127 are referred to indirectly by ampersand entities.
Examples of encoded Unicode characters (in hexadecimal notation)
16-bit Unicode | UTF-8 Sequence |
---|---|
0001 | 01 |
007F | 7F |
0080 | C2 80 |
07FF | DF BF |
0800 | E0 A0 80 |
FFFF | EF BF BF |
010000 | F0 90 80 80 |
10FFFF | F4 8F BF BF |