[lex.string] (original) (raw)

5 Lexical conventions [lex]

5.13 Literals [lex.literal]

5.13.5 String literals [lex.string]

string-literal: encoding-prefix " s-char-sequence " encoding-prefix R raw-string

s-char-sequence: s-char s-char-sequence s-char

s-char: any member of the basic source character set except the double-quote ", backslash , or new-line character escape-sequence universal-character-name

raw-string: " d-char-sequence ( r-char-sequence ) d-char-sequence "

r-char-sequence: r-char r-char-sequence r-char

r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ".

d-char-sequence: d-char d-char-sequence d-char

d-char: any member of the basic source character set except: space, the left parenthesis (, the right parenthesis ), the backslash , and the control characters representing horizontal tab, vertical tab, form feed, and newline.

A string-literal that has an R in the prefix is a raw string literal.

Thed-char-sequence serves as a delimiter.

The terminatingd-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence .

A d-char-sequenceshall consist of at most 16 characters.

[ Note

The characters '(' and ')' are permitted in araw-string .

Thus, R"delimiter((a|b))delimiter" is equivalent to"(a|b)".

— end note

]

[ Note

A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.

Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:

const char* p = R"(a
b c)"; assert(std::strcmp(p, "a\\nb\nc") == 0);

— end note

]

[ Example

The raw string

R"a( )
a" )a"

is equivalent to "\n)\\\na\"\n".

The raw string

R"(x = ""y"")"

is equivalent to "x = \"\\\"y\\\"\"".

— end example

]

After translation phase 6, a string-literalthat does not begin with an encoding-prefix is anordinary string literal .

An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the given characters.

A string-literal that begins with u8,such as u8"asdf", is a UTF-8 string literal.

A UTF-8 string literal has type “array of n const char8_t”, where n is the size of the string as defined below; each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.

A string-literal that begins with u,such as u"asdf", is a UTF-16 string literal.

A UTF-16 string literal has type “array of n const char16_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-16 encoding of the string.

[ Note

A single c-char may produce more than one char16_t character in the form of surrogate pairs.

A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.

— end note

]

A string-literal that begins with U,such as U"asdf", is a UTF-32 string literal.

A UTF-32 string literal has type “array of n const char32_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-32 encoding of the string.

A string-literal that begins with L,such as L"asdf", is a wide string literal.

A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.

In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.

If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.

Any other concatenations are conditionally-supported with implementation-defined behavior.

[ Note

This concatenation is an interpretation, not a conversion.

Because the interpretation happens in translation phase 6 (after each character from astring-literal has been translated into a value from the appropriate character set), astring-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.

— end note

]

Table 11 has some examples of valid concatenations.

Table 11: String literal concatenations [tab:lex.string.concat]

Source	Means	Source	Means	Source	Means
u"a"	u"b"	u"ab"	U"a"	U"b"	U"ab"	L"a"	L"b"	L"ab"
u"a"	"b"	u"ab"	U"a"	"b"	U"ab"	L"a"	"b"	L"ab"
"a"	u"b"	u"ab"	"a"	U"b"	U"ab"	"a"	L"b"	L"ab"

Characters in concatenated strings are kept distinct.

[ Example

"\xA" "B"

contains the two characters '\xA' and 'B'after concatenation (and not the single hexadecimal character'\xAB').

— end example

]

After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to everystring-literal so that programs that scan a string can find its end.

Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character-literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence\', and the double quote " shall be preceded by a\, and except that a universal-character-name in a UTF-16 string literal may yield a surrogate pair.

In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding.

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' orL'\0'.

The size of a UTF-16 string literal is the total number of escape sequences,universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminatingu'\0'.

[ Note

The size of a char16_tstring literal is the number of code units, not the number of characters.

— end note

]

The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.

Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above.

Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of astring-literal yield the same or a different object is unspecified.

[ Note

The effect of attempting to modify a string-literal is undefined.

— end note

]