[lex.charset] (original) (raw)

5 Lexical conventions [lex]

5.3 Characters [lex.char]

5.3.1 Character sets [lex.charset]

The translation character set consists of the following elements:

each abstract character assigned a code point in the Unicode codespace as specified in the Unicode Standard, and
a distinct character for each Unicode scalar value not assigned to an abstract character.

[Note 1:

Unicode code points are integers in the range [0, 10FFFF] (hexadecimal).

A surrogate code point is a value in the range [D800, DFFF] (hexadecimal).

A Unicode scalar value is any code point that is not a surrogate code point.

— _end note_]

The basic character set is a subset of the translation character set, consisting of 99 characters as specified in Table 1 .

[Note 2:

Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context.

— _end note_]

Table 1 — Basic character set [tab:lex.charset.basic]

🔗character	glyph
🔗U+0009	character tabulation
🔗U+000b	line tabulation
🔗U+000c	form feed
🔗U+0020	space
🔗U+000a	line feed	new-line
🔗U+0021	exclamation mark	!
🔗U+0022	quotation mark	"
🔗U+0023	number sign	#
🔗U+0024	dollar sign	$
🔗U+0025	percent sign	%
🔗U+0026	ampersand	&
🔗U+0027	apostrophe	'
🔗U+0028	left parenthesis	(
🔗U+0029	right parenthesis	)
🔗U+002a	asterisk	*
🔗U+002b	plus sign	+
🔗U+002c	comma	,
🔗U+002d	hyphen-minus	-
🔗U+002e	full stop	.
🔗U+002f	solidus	/
🔗U+0030 .. U+0039	digit zero .. nine	0 1 2 3 4 5 6 7 8 9
🔗U+003a	colon	:
🔗U+003b	semicolon	;
🔗U+003c	less-than sign	<
🔗U+003d	equals sign	=
🔗U+003e	greater-than sign	>
🔗U+003f	question mark	?
🔗U+0040	commercial at	@
🔗U+0041 .. U+005a	latin capital letter a .. z	A B C D E F G H I J K L M
🔗	N O P Q R S T U V W X Y Z
🔗U+005b	left square bracket	[
🔗U+005c	reverse solidus	\
🔗U+005d	right square bracket	]
🔗U+005e	circumflex accent	^
🔗U+005f	low line	_
🔗U+0060	grave accent	`
🔗U+0061 .. U+007a	latin small letter a .. z	a b c d e f g h i j k l m
🔗	n o p q r s t u v w x y z
🔗U+007b	left curly bracket	{
🔗U+007c	vertical line	\|
🔗U+007d	right curly bracket	}
🔗U+007e	tilde	~

The basic literal character set consists of all characters of the basic character set, plus the control characters specified in Table 2 .

Table 2 — Additional control characters in the basic literal character set [tab:lex.charset.literal]

U+0000	null
U+0007	alert
U+0008	backspace
U+000d	carriage return

The ordinary literal encoding is the encoding applied to an ordinary character or string literal.

The wide literal encoding is the encoding applied to a wide character or string literal.

A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.

[Note 3:

A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set.

— _end note_]

The U+0000 null character is encoded as the value 0.

No other element of the translation character set is encoded with a code unit of value 0.

The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous.

The ordinary and wide literal encodings are otherwiseimplementation-defined.

For a UTF-8, UTF-16, or UTF-32 literal, the implementation shall encode the Unicode scalar value corresponding to each character of the translation character set as specified in the Unicode Standard for the respective Unicode encoding form.