Null-terminated multibyte strings - cppreference.com (original) (raw)

Null-terminated multibyte strings

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character).

Each character stored in the string may occupy more than one byte. The encoding used to represent characters in a multibyte character string is locale-specific: it may be UTF-8, GB18030, EUC-JP, Shift-JIS, etc. For example, the char array {'\xe4','\xbd','\xa0','\xe5','\xa5','\xbd','\0'} is an NTMBS holding the string "你好" in UTF-8 multibyte encoding: the first three bytes encode the character 你, the next three bytes encode the character 好. The same string encoded in GB18030 is the char array {'\xc4', '\xe3', '\xba', '\xc3', '\0'}, where each of the two characters is encoded as a two-byte sequence.

In some multibyte encodings, any given multibyte character sequence may represent different characters depending on the previous byte sequences, known as "shift sequences". Such encodings are known as state-dependent: knowledge of the current shift state is required to interpret each character. An NTMBS is only valid if it begins and ends in the initial shift state: if a shift sequence was used, the corresponding unshift sequence has to be present before the terminating null character. Examples of such encodings are the 7-bit JIS, BOCU-1 and SCSU.

A multibyte character string is layout-compatible with null-terminated byte string (NTBS), that is, can be stored, copied, and examined using the same facilities, except for calculating the number of characters. If the correct locale is in effect, I/O functions also handle multibyte strings. Multibyte strings can be converted to and from wide strings using the std::codecvt member functions, std::wstring_convert, or the following locale-dependent conversion functions:

Contents

[edit] Functions

Multibyte/wide character conversions
Defined in header
mblen returns the number of bytes in the next multibyte character (function) [edit]
mbtowc converts the next multibyte character to wide character (function) [edit]
wctomb converts a wide character to its multibyte representation (function) [edit]
mbstowcs converts a narrow multibyte character string to wide string (function) [edit]
wcstombs converts a wide string to narrow multibyte character string (function) [edit]
Defined in header
mbrlen returns the number of bytes in the next multibyte character, given state (function) [edit]
mbsinit checks if the std::mbstate_t object represents initial shift state (function) [edit]
btowc widens a single-byte narrow character to wide character, if possible (function) [edit]
wctob narrows a wide character to a single-byte narrow character, if possible (function) [edit]
mbrtowc converts the next multibyte character to wide character, given state (function) [edit]
wcrtomb converts a wide character to its multibyte representation, given state (function) [edit]
mbsrtowcs converts a narrow multibyte character string to wide string, given state (function) [edit]
wcsrtombs converts a wide string to narrow multibyte character string, given state (function) [edit]
Defined in header
mbrtoc8(C++20) converts a narrow multibyte character to UTF-8 encoding (function) [edit]
c8rtomb(C++20) converts UTF-8 string to narrow multibyte encoding (function) [edit]
mbrtoc16(C++11) converts a narrow multibyte character to UTF-16 encoding (function) [edit]
c16rtomb(C++11) converts a UTF-16 character to narrow multibyte encoding (function) [edit]
mbrtoc32(C++11) converts a narrow multibyte character to UTF-32 encoding (function) [edit]
c32rtomb(C++11) converts a UTF-32 character to narrow multibyte encoding (function) [edit]

[edit] Types

[edit] Macros

Defined in header
MB_LEN_MAX maximum number of bytes in a multibyte character (macro constant) [edit]
Defined in header
MB_CUR_MAX maximum number of bytes in a multibyte character in the current C locale(macro variable)[edit]
Defined in header
__STDC_UTF_16__(C++11) indicates that UTF-16 encoding is used by mbrtoc16 and c16rtomb (macro constant)
__STDC_UTF_32__(C++11) indicates that UTF-32 encoding is used by mbrtoc32 and c32rtomb (macro constant)

[edit] See also