[lex.charset] (original) (raw)

5 Lexical conventions [lex]

5.3 Characters [lex.char]

5.3.1 Character sets [lex.charset]

The translation character set consists of the following elements:

[Note 1:

Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).

A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).

A Unicode scalar value is any code point that is not a surrogate code point.

— _end note_]

The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1.

[Note 2:

Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

— _end note_]

Table 1 — Basic character set [tab:lex.charset.basic]

🔗character glyph
🔗U+0009 character tabulation
🔗U+000b line tabulation
🔗U+000c form feed
🔗U+0020 space
🔗U+000a line feed new-line
🔗U+0021 exclamation mark !
🔗U+0022 quotation mark "
🔗U+0023 number sign #
🔗U+0024 dollar sign $
🔗U+0025 percent sign %
🔗U+0026 ampersand &
🔗U+0027 apostrophe '
🔗U+0028 left parenthesis (
🔗U+0029 right parenthesis )
🔗U+002a asterisk *
🔗U+002b plus sign +
🔗U+002c comma ,
🔗U+002d hyphen-minus -
🔗U+002e full stop .
🔗U+002f solidus /
🔗U+0030 .. U+0039 digit zero .. nine 0 1 2 3 4 5 6 7 8 9
🔗U+003a colon :
🔗U+003b semicolon ;
🔗U+003c less-than sign <
🔗U+003d equals sign =
🔗U+003e greater-than sign >
🔗U+003f question mark ?
🔗U+0040 commercial at @
🔗U+0041 .. U+005a latin capital letter a .. z A B C D E F G H I J K L M
🔗 N O P Q R S T U V W X Y Z
🔗U+005b left square bracket [
🔗U+005c reverse solidus \
🔗U+005d right square bracket ]
🔗U+005e circumflex accent ^
🔗U+005f low line _
🔗U+0060 grave accent `
🔗U+0061 .. U+007a latin small letter a .. z a b c d e f g h i j k l m
🔗 n o p q r s t u v w x y z
🔗U+007b left curly bracket {
🔗U+007c vertical line |
🔗U+007d right curly bracket }
🔗U+007e tilde ~

The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2.

Table 2 — Additional control characters in the basic literal character set [tab:lex.charset.literal]

U+0000 null
U+0007 alert
U+0008 backspace
U+000d carriage return

The ordinary literal encoding is the encoding applied to an ordinary character or string literal.

The wide literal encoding is the encoding applied to a wide character or string literal.

A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

[Note 3:

A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.

— _end note_]

The U+0000 null character is encoded as the value 0.

No other element of the translation character set is encoded with a code unit of value 0.

The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous.

The ordinary and wide literal encodings are otherwiseimplementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.