[lex.char] (original) (raw)

5 Lexical conventions [lex]

5.3 Characters [lex.char]

5.3.1 Character sets [lex.charset]

The translation character set consists of the following elements:

[Note 1:

Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).

A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).

A Unicode scalar value is any code point that is not a surrogate code point.

β€” _end note_]

The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1.

[Note 2:

Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

β€” _end note_]

Table 1 β€” Basic character set [tab:lex.charset.basic]

πŸ”—character glyph
πŸ”—U+0009 character tabulation
πŸ”—U+000b line tabulation
πŸ”—U+000c form feed
πŸ”—U+0020 space
πŸ”—U+000a line feed new-line
πŸ”—U+0021 exclamation mark !
πŸ”—U+0022 quotation mark "
πŸ”—U+0023 number sign #
πŸ”—U+0024 dollar sign $
πŸ”—U+0025 percent sign %
πŸ”—U+0026 ampersand &
πŸ”—U+0027 apostrophe '
πŸ”—U+0028 left parenthesis (
πŸ”—U+0029 right parenthesis )
πŸ”—U+002a asterisk *
πŸ”—U+002b plus sign +
πŸ”—U+002c comma ,
πŸ”—U+002d hyphen-minus -
πŸ”—U+002e full stop .
πŸ”—U+002f solidus /
πŸ”—U+0030 .. U+0039 digit zero .. nine 0 1 2 3 4 5 6 7 8 9
πŸ”—U+003a colon :
πŸ”—U+003b semicolon ;
πŸ”—U+003c less-than sign <
πŸ”—U+003d equals sign =
πŸ”—U+003e greater-than sign >
πŸ”—U+003f question mark ?
πŸ”—U+0040 commercial at @
πŸ”—U+0041 .. U+005a latin capital letter a .. z A B C D E F G H I J K L M
πŸ”— N O P Q R S T U V W X Y Z
πŸ”—U+005b left square bracket [
πŸ”—U+005c reverse solidus \
πŸ”—U+005d right square bracket ]
πŸ”—U+005e circumflex accent ^
πŸ”—U+005f low line _
πŸ”—U+0060 grave accent `
πŸ”—U+0061 .. U+007a latin small letter a .. z a b c d e f g h i j k l m
πŸ”— n o p q r s t u v w x y z
πŸ”—U+007b left curly bracket {
πŸ”—U+007c vertical line |
πŸ”—U+007d right curly bracket }
πŸ”—U+007e tilde ~

The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2.

Table 2 β€” Additional control characters in the basic literal character set [tab:lex.charset.literal]

U+0000 null
U+0007 alert
U+0008 backspace
U+000d carriage return

The ordinary literal encoding is the encoding applied to an ordinary character or string literal.

The wide literal encoding is the encoding applied to a wide character or string literal.

A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

[Note 3:

A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.

β€” _end note_]

The U+0000 null character is encoded as the value 0.

No other element of the translation character set is encoded with a code unit of value 0.

The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous.

The ordinary and wide literal encodings are otherwiseimplementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.

5.3.2 Universal character names [lex.universal.char]

n-char:
any member of the translation character set except the U+007d right curly bracket or new-line character

The universal-character-name construct provides a way to name any element in the translation character set using just the basic character set.

A universal-character-nameof the form \u hex-quad,\U hex-quad hex-quad, or\u{simple-hexadecimal-digit-sequence}designates the character in the translation character set whose Unicode scalar value is the hexadecimal number represented by the sequence of hexadecimal-digit_s_in the universal-character-name.

The program is ill-formed if that number is not a Unicode scalar value.

A universal-character-namethat is a named-universal-characterdesignates the corresponding character in the Unicode Standard (chapter 4.8 Name) if the n-char-sequence is equal to its character name or to one of its character name aliases of type β€œcontrol”, β€œcorrection”, or β€œalternate”; otherwise, the program is ill-formed.

[Note 2:

These aliases are listed in the Unicode Character Database's NameAliases.txt.

None of these names or aliases have leading or trailing spaces.

β€” _end note_]