[lex.string] (original) (raw)
5 Lexical conventions [lex]
5.13 Literals [lex.literal]
5.13.5 String literals [lex.string]
string-literal: encoding-prefix " s-char-sequence " encoding-prefix R raw-string
s-char-sequence: s-char s-char-sequence s-char
s-char: any member of the basic source character set except the double-quote ", backslash , or new-line character escape-sequence universal-character-name
raw-string: " d-char-sequence ( r-char-sequence ) d-char-sequence "
r-char-sequence: r-char r-char-sequence r-char
r-char: any member of the source character set, except a right parenthesis ) followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char-sequence: d-char d-char-sequence d-char
d-char: any member of the basic source character set except: space, the left parenthesis (, the right parenthesis ), the backslash , and the control characters representing horizontal tab, vertical tab, form feed, and newline.
A string-literal that has an R in the prefix is a raw string literal.
Thed-char-sequence serves as a delimiter.
The terminatingd-char-sequence of a raw-string is the same sequence of characters as the initial d-char-sequence.
A d-char-sequenceshall consist of at most 16 characters.
[ Note
:
The characters '(' and ')' are permitted in araw-string.
Thus, R"delimiter((a|b))delimiter" is equivalent to"(a|b)".
— end note
]
[ Note
:
A source-file new-line in a raw string literal results in a new-line in the resulting execution string literal.
Assuming no whitespace at the beginning of lines in the following example, the assert will succeed:
const char* p = R"(a
b
c)";
assert(std::strcmp(p, "a\\nb\nc") == 0);
— end note
]
[ Example
:
The raw string
R"a(
)
a"
)a"
is equivalent to "\n)\\\na\"\n".
The raw string
R"(x = ""y"")"
is equivalent to "x = \"\\\"y\\\"\"".
— end example
]
After translation phase 6, a string-literalthat does not begin with an encoding-prefix is anordinary string literal.
An ordinary string literal has type “array of n const char” where n is the size of the string as defined below, has static storage duration ([basic.stc]), and is initialized with the given characters.
A string-literal that begins with u8,such as u8"asdf", is a UTF-8 string literal.
A UTF-8 string literal has type “array of n const char8_t”, where n is the size of the string as defined below; each successive element of the object representation ([basic.types]) has the value of the corresponding code unit of the UTF-8 encoding of the string.
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals.
A string-literal that begins with u,such as u"asdf", is a UTF-16 string literal.
A UTF-16 string literal has type “array of n const char16_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-16 encoding of the string.
[ Note
:
A single c-char may produce more than one char16_t character in the form of surrogate pairs.
A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units.
— end note
]
A string-literal that begins with U,such as U"asdf", is a UTF-32 string literal.
A UTF-32 string literal has type “array of n const char32_t”, where n is the size of the string as defined below; each successive element of the array has the value of the corresponding code unit of the UTF-32 encoding of the string.
A string-literal that begins with L,such as L"asdf", is a wide string literal.
A wide string literal has type “array of n const wchar_t”, where n is the size of the string as defined below; it is initialized with the given characters.
In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated.
If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed.
Any other concatenations are conditionally-supported with implementation-defined behavior.
[ Note
:
This concatenation is an interpretation, not a conversion.
Because the interpretation happens in translation phase 6 (after each character from astring-literal has been translated into a value from the appropriate character set), astring-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation.
— end note
]
Table 11 has some examples of valid concatenations.
Table 11: String literal concatenations [tab:lex.string.concat]
| Source | Means | Source | Means | Source | Means | |||
|---|---|---|---|---|---|---|---|---|
| u"a" | u"b" | u"ab" | U"a" | U"b" | U"ab" | L"a" | L"b" | L"ab" |
| u"a" | "b" | u"ab" | U"a" | "b" | U"ab" | L"a" | "b" | L"ab" |
| "a" | u"b" | u"ab" | "a" | U"b" | U"ab" | "a" | L"b" | L"ab" |
Characters in concatenated strings are kept distinct.
[ Example
:
"\xA" "B"
contains the two characters '\xA' and 'B'after concatenation (and not the single hexadecimal character'\xAB').
— end example
]
After any necessary concatenation, in translation phase 7 ([lex.phases]), '\0' is appended to everystring-literal so that programs that scan a string can find its end.
Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character-literals ([lex.ccon]), except that the single quote ' is representable either by itself or by the escape sequence\', and the double quote " shall be preceded by a\, and except that a universal-character-name in a UTF-16 string literal may yield a surrogate pair.
In a narrow string literal, a universal-character-name may map to more than one char or char8_t element due to multibyte encoding.
The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' orL'\0'.
The size of a UTF-16 string literal is the total number of escape sequences,universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminatingu'\0'.
[ Note
:
The size of a char16_tstring literal is the number of code units, not the number of characters.
— end note
]
The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating '\0'.
Evaluating a string-literal results in a string literal object with static storage duration, initialized from the given characters as specified above.
Whether all string-literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of astring-literal yield the same or a different object is unspecified.
[ Note
:
The effect of attempting to modify a string-literal is undefined.
— end note
]