[lex.charset] (original) (raw)
5 Lexical conventions [lex]
5.3 Characters [lex.char]
5.3.1 Character sets [lex.charset]
The translation character set consists of the following elements:
- each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and
- a distinct character for each Unicode scalar value not assigned to an abstract character.
[Note 1:
Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).
A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).
A Unicode scalar value is any code point that is not a surrogate code point.
— _end note_]
The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1.
[Note 2:
Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.
— _end note_]
Table 1 — Basic character set [tab:lex.charset.basic]
| 🔗character | glyph | |
|---|---|---|
| 🔗U+0009 | character tabulation | |
| 🔗U+000b | line tabulation | |
| 🔗U+000c | form feed | |
| 🔗U+0020 | space | |
| 🔗U+000a | line feed | new-line |
| 🔗U+0021 | exclamation mark | ! |
| 🔗U+0022 | quotation mark | " |
| 🔗U+0023 | number sign | # |
| 🔗U+0024 | dollar sign | $ |
| 🔗U+0025 | percent sign | % |
| 🔗U+0026 | ampersand | & |
| 🔗U+0027 | apostrophe | ' |
| 🔗U+0028 | left parenthesis | ( |
| 🔗U+0029 | right parenthesis | ) |
| 🔗U+002a | asterisk | * |
| 🔗U+002b | plus sign | + |
| 🔗U+002c | comma | , |
| 🔗U+002d | hyphen-minus | - |
| 🔗U+002e | full stop | . |
| 🔗U+002f | solidus | / |
| 🔗U+0030 .. U+0039 | digit zero .. nine | 0 1 2 3 4 5 6 7 8 9 |
| 🔗U+003a | colon | : |
| 🔗U+003b | semicolon | ; |
| 🔗U+003c | less-than sign | < |
| 🔗U+003d | equals sign | = |
| 🔗U+003e | greater-than sign | > |
| 🔗U+003f | question mark | ? |
| 🔗U+0040 | commercial at | @ |
| 🔗U+0041 .. U+005a | latin capital letter a .. z | A B C D E F G H I J K L M |
| 🔗 | N O P Q R S T U V W X Y Z | |
| 🔗U+005b | left square bracket | [ |
| 🔗U+005c | reverse solidus | \ |
| 🔗U+005d | right square bracket | ] |
| 🔗U+005e | circumflex accent | ^ |
| 🔗U+005f | low line | _ |
| 🔗U+0060 | grave accent | ` |
| 🔗U+0061 .. U+007a | latin small letter a .. z | a b c d e f g h i j k l m |
| 🔗 | n o p q r s t u v w x y z | |
| 🔗U+007b | left curly bracket | { |
| 🔗U+007c | vertical line | | |
| 🔗U+007d | right curly bracket | } |
| 🔗U+007e | tilde | ~ |
The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2.
Table 2 — Additional control characters in the basic literal character set [tab:lex.charset.literal]
| U+0000 | null |
|---|---|
| U+0007 | alert |
| U+0008 | backspace |
| U+000d | carriage return |
The ordinary literal encoding is the encoding applied to an ordinary character or string literal.
The wide literal encoding is the encoding applied to a wide character or string literal.
A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.
[Note 3:
A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.
— _end note_]
The U+0000 null character is encoded as the value 0.
No other element of the translation character set is encoded with a code unit of value 0.
The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous.
The ordinary and wide literal encodings are otherwiseimplementation-defined.
For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.