[text.encoding.class] (original) (raw)

28.4.2.1 Overview [text.encoding.overview]

The class text_encoding describes an interface for accessing the IANA Character Sets registry[bib].

namespace std { struct text_encoding { static constexpr size_t max_name_length = 63;enum class id : int_least32_t { see below };using enum id;constexpr text_encoding() = default;constexpr explicit text_encoding(string_view enc) noexcept;constexpr text_encoding(id i) noexcept;constexpr id mib() const noexcept;constexpr const char* name() const noexcept;struct aliases_view;constexpr aliases_view aliases() const noexcept;friend constexpr bool operator==(const text_encoding& a,const text_encoding& b) noexcept;friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept;static consteval text_encoding literal() noexcept;static text_encoding environment();template<id i> static bool environment_is();private: id mib_ = id::unknown; char _name__[max_name_length + 1] = {0}; static constexpr bool comp-name(string_view a, string_view b); };}

28.4.2.2 General [text.encoding.general]

A registered character encoding is a character encoding scheme in the IANA Character Sets registry.

[Note 1:

The IANA Character Sets registry uses the term “character sets” to refer to character encodings.

— _end note_]

The primary name of a registered character encoding is the name of that encoding specified in the IANA Character Sets registry.

The set of known registered character encodings contains every registered character encoding specified in the IANA Character Sets registry except for the following:

NATS-DANO (33)
NATS-DANO-ADD (34)

Each known registered character encoding is identified by an enumerator in text_encoding::id, and has a set of zero or more aliases.

The set of aliases of a known registered character encoding is animplementation-defined superset of the aliases specified in the IANA Character Sets registry.

The set of aliases for US-ASCII includes “ASCII”.

No two aliases or primary names of distinct registered character encodings are equivalent when compared by text_encoding::comp-name.

How a text_encoding object is determined to be representative of a character encoding scheme implemented in the translation or execution environment isimplementation-defined.

An object e of type text_encoding such thate.mib() == text_encoding::id::unknown is false ande.mib() == text_encoding::id::other is falsemaintains the following invariants:

*e.name() == '\0' is false, and
e.mib() == text_encoding(e.name()).mib() is true.

Recommended practice:

Implementations should not consider registered encodings to be interchangeable.
[Example 1:
Shift_JIS and Windows-31J denote different encodings.
— _end example_]
Implementations should not use the name of a registered encoding to describe another similar yet different non-registered encoding unless there is a precedent on that implementation.
[Example 2:
Big5
— _end example_]

28.4.2.3 Members [text.encoding.members]

constexpr explicit text_encoding(string_view enc) noexcept;

Preconditions:

enc represents a string in the ordinary literal encoding consisting only of elements of the basic character set ([lex.charset]).
enc.size() <= max_name_length is true.
enc.contains('\0') is false.

Postconditions:

If there exists a primary name or alias aof a known registered character encoding such that_comp-name_(a, enc) is true,mib_ has the value of the enumerator of idassociated with that registered character encoding.
Otherwise, mib_ == id::other is true.
enc.compare(name_) == 0 is true.

constexpr text_encoding(id i) noexcept;

Preconditions: i has the value of one of the enumerators of id.

Postconditions:

If (mib_ == id::unknown || mib_ == id::other)is true,strlen(name_) == 0 is true.
Otherwise,ranges::contains(aliases(), string_view(name_))is true.

constexpr id mib() const noexcept;

constexpr const char* name() const noexcept;

Remarks: name() is an ntbs and accessing elements of _name__outside of the range is undefined behavior.

constexpr aliases_view aliases() const noexcept;

Let r denote an instance of aliases_view.

If *this represents a known registered character encoding, then:

r.front() is the primary name of the registered character encoding,
r contains the aliases of the registered character encoding, and
r does not contain duplicate values when compared with strcmp.

Otherwise, r is an empty range.

Each element in ris a non-null, non-empty ntbs encoded in the literal character encoding and comprising only characters from the basic character set.

[Note 1:

The order of aliases in r is unspecified.

— _end note_]

static consteval text_encoding literal() noexcept;

Mandates: CHAR_BIT == 8 is true.

Returns: A text_encoding object representing the ordinary character literal encoding ([lex.charset]).

static text_encoding environment();

Mandates: CHAR_BIT == 8 is true.

Returns: A text_encoding object representing the implementation-defined character encoding scheme of the environment.

On a POSIX implementation, this is the encoding scheme associated with the POSIX locale denoted by the empty string "".

[Note 2:

This function is not affected by calls to setlocale.

— _end note_]

Recommended practice: Implementations should return a value that is not affected by calls to the POSIX function setenv and other functions which can modify the environment ([support.runtime]).

template<id i> static bool environment_is();

Mandates: CHAR_BIT == 8 is true.

Returns: environment() == i.

static constexpr bool _comp-name_(string_view a, string_view b);

Returns: true if the two strings a and bencoded in the ordinary literal encoding are equal, ignoring, from left-to-right,

all elements that are not digits or letters ([character.seq.general]),
character case, and
any sequence of one or more 0 characters not immediately preceded by a numeric prefix, where a numeric prefix is a sequence consisting of a digit in the range [1, 9] optionally followed by one or more elements which are not digits or letters,

and false otherwise.

[Note 3:

This comparison is identical to the “Charset Alias Matching” algorithm described in the Unicode Technical Standard 22[bib].

— _end note_]

[Example 1: static_assert(comp-name("UTF-8", "utf8") == true);static_assert(comp-name("u.t.f-008", "utf8") == true);static_assert(comp-name("ut8", "utf8") == false);static_assert(comp-name("utf-80", "utf8") == false); — _end example_]

28.4.2.4 Comparison functions [text.encoding.cmp]

friend constexpr bool operator==(const text_encoding& a, const text_encoding& b) noexcept;

Returns: If a.mib_ == id::other && b.mib_ == id::otheris true, then comp-name(a.name_,
b.name_).

Otherwise, a.mib_ == b.mib_.

friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept;

Returns: encoding.mib_ == i.

Remarks: This operator induces an equivalence relation on its arguments if and only if i != id::other is true.

28.4.2.5 Class text_encoding::aliases_view [text.encoding.aliases]

struct text_encoding::aliases_view : ranges::view_interface<text_encoding::aliases_view> { constexpr _implementation-defined_ begin() const;constexpr _implementation-defined_ end() const;};

Bothranges::range_value_t<text_encoding::aliases_view> andranges::range_reference_t<text_encoding::aliases_view>denote const char*.

28.4.2.6 Enumeration text_encoding::id [text.encoding.id]

namespace std { enum class text_encoding::id : int_least32_t { other = 1, unknown = 2, ASCII = 3, ISOLatin1 = 4, ISOLatin2 = 5, ISOLatin3 = 6, ISOLatin4 = 7, ISOLatinCyrillic = 8, ISOLatinArabic = 9, ISOLatinGreek = 10, ISOLatinHebrew = 11, ISOLatin5 = 12, ISOLatin6 = 13, ISOTextComm = 14, HalfWidthKatakana = 15, JISEncoding = 16, ShiftJIS = 17, EUCPkdFmtJapanese = 18, EUCFixWidJapanese = 19, ISO4UnitedKingdom = 20, ISO11SwedishForNames = 21, ISO15Italian = 22, ISO17Spanish = 23, ISO21German = 24, ISO60DanishNorwegian = 25, ISO69French = 26, ISO10646UTF1 = 27, ISO646basic1983 = 28, INVARIANT = 29, ISO2IntlRefVersion = 30, NATSSEFI = 31, NATSSEFIADD = 32, ISO10Swedish = 35, KSC56011987 = 36, ISO2022KR = 37, EUCKR = 38, ISO2022JP = 39, ISO2022JP2 = 40, ISO13JISC6220jp = 41, ISO14JISC6220ro = 42, ISO16Portuguese = 43, ISO18Greek7Old = 44, ISO19LatinGreek = 45, ISO25French = 46, ISO27LatinGreek1 = 47, ISO5427Cyrillic = 48, ISO42JISC62261978 = 49, ISO47BSViewdata = 50, ISO49INIS = 51, ISO50INIS8 = 52, ISO51INISCyrillic = 53, ISO54271981 = 54, ISO5428Greek = 55, ISO57GB1988 = 56, ISO58GB231280 = 57, ISO61Norwegian2 = 58, ISO70VideotexSupp1 = 59, ISO84Portuguese2 = 60, ISO85Spanish2 = 61, ISO86Hungarian = 62, ISO87JISX0208 = 63, ISO88Greek7 = 64, ISO89ASMO449 = 65, ISO90 = 66, ISO91JISC62291984a = 67, ISO92JISC62991984b = 68, ISO93JIS62291984badd = 69, ISO94JIS62291984hand = 70, ISO95JIS62291984handadd = 71, ISO96JISC62291984kana = 72, ISO2033 = 73, ISO99NAPLPS = 74, ISO102T617bit = 75, ISO103T618bit = 76, ISO111ECMACyrillic = 77, ISO121Canadian1 = 78, ISO122Canadian2 = 79, ISO123CSAZ24341985gr = 80, ISO88596E = 81, ISO88596I = 82, ISO128T101G2 = 83, ISO88598E = 84, ISO88598I = 85, ISO139CSN369103 = 86, ISO141JUSIB1002 = 87, ISO143IECP271 = 88, ISO146Serbian = 89, ISO147Macedonian = 90, ISO150 = 91, ISO151Cuba = 92, ISO6937Add = 93, ISO153GOST1976874 = 94, ISO8859Supp = 95, ISO10367Box = 96, ISO158Lap = 97, ISO159JISX02121990 = 98, ISO646Danish = 99, USDK = 100, DKUS = 101, KSC5636 = 102, Unicode11UTF7 = 103, ISO2022CN = 104, ISO2022CNEXT = 105, UTF8 = 106, ISO885913 = 109, ISO885914 = 110, ISO885915 = 111, ISO885916 = 112, GBK = 113, GB18030 = 114, OSDEBCDICDF0415 = 115, OSDEBCDICDF03IRV = 116, OSDEBCDICDF041 = 117, ISO115481 = 118, KZ1048 = 119, UCS2 = 1000, UCS4 = 1001, UnicodeASCII = 1002, UnicodeLatin1 = 1003, UnicodeJapanese = 1004, UnicodeIBM1261 = 1005, UnicodeIBM1268 = 1006, UnicodeIBM1276 = 1007, UnicodeIBM1264 = 1008, UnicodeIBM1265 = 1009, Unicode11 = 1010, SCSU = 1011, UTF7 = 1012, UTF16BE = 1013, UTF16LE = 1014, UTF16 = 1015, CESU8 = 1016, UTF32 = 1017, UTF32BE = 1018, UTF32LE = 1019, BOCU1 = 1020, UTF7IMAP = 1021, Windows30Latin1 = 2000, Windows31Latin1 = 2001, Windows31Latin2 = 2002, Windows31Latin5 = 2003, HPRoman8 = 2004, AdobeStandardEncoding = 2005, VenturaUS = 2006, VenturaInternational = 2007, DECMCS = 2008, PC850Multilingual = 2009, PC8DanishNorwegian = 2012, PC862LatinHebrew = 2013, PC8Turkish = 2014, IBMSymbols = 2015, IBMThai = 2016, HPLegal = 2017, HPPiFont = 2018, HPMath8 = 2019, HPPSMath = 2020, HPDesktop = 2021, VenturaMath = 2022, MicrosoftPublishing = 2023, Windows31J = 2024, GB2312 = 2025, Big5 = 2026, Macintosh = 2027, IBM037 = 2028, IBM038 = 2029, IBM273 = 2030, IBM274 = 2031, IBM275 = 2032, IBM277 = 2033, IBM278 = 2034, IBM280 = 2035, IBM281 = 2036, IBM284 = 2037, IBM285 = 2038, IBM290 = 2039, IBM297 = 2040, IBM420 = 2041, IBM423 = 2042, IBM424 = 2043, PC8CodePage437 = 2011, IBM500 = 2044, IBM851 = 2045, PCp852 = 2010, IBM855 = 2046, IBM857 = 2047, IBM860 = 2048, IBM861 = 2049, IBM863 = 2050, IBM864 = 2051, IBM865 = 2052, IBM868 = 2053, IBM869 = 2054, IBM870 = 2055, IBM871 = 2056, IBM880 = 2057, IBM891 = 2058, IBM903 = 2059, IBM904 = 2060, IBM905 = 2061, IBM918 = 2062, IBM1026 = 2063, IBMEBCDICATDE = 2064, EBCDICATDEA = 2065, EBCDICCAFR = 2066, EBCDICDKNO = 2067, EBCDICDKNOA = 2068, EBCDICFISE = 2069, EBCDICFISEA = 2070, EBCDICFR = 2071, EBCDICIT = 2072, EBCDICPT = 2073, EBCDICES = 2074, EBCDICESA = 2075, EBCDICESS = 2076, EBCDICUK = 2077, EBCDICUS = 2078, Unknown8BiT = 2079, Mnemonic = 2080, Mnem = 2081, VISCII = 2082, VIQR = 2083, KOI8R = 2084, HZGB2312 = 2085, IBM866 = 2086, PC775Baltic = 2087, KOI8U = 2088, IBM00858 = 2089, IBM00924 = 2090, IBM01140 = 2091, IBM01141 = 2092, IBM01142 = 2093, IBM01143 = 2094, IBM01144 = 2095, IBM01145 = 2096, IBM01146 = 2097, IBM01147 = 2098, IBM01148 = 2099, IBM01149 = 2100, Big5HKSCS = 2101, IBM1047 = 2102, PTCP154 = 2103, Amiga1251 = 2104, KOI7switched = 2105, BRF = 2106, TSCII = 2107, CP51932 = 2108, windows874 = 2109, windows1250 = 2250, windows1251 = 2251, windows1252 = 2252, windows1253 = 2253, windows1254 = 2254, windows1255 = 2255, windows1256 = 2256, windows1257 = 2257, windows1258 = 2258, TIS620 = 2259, CP50220 = 2260 };}

[Note 1:

The text_encoding::id enumeration contains an enumerator for each known registered character encoding.

For each encoding, the corresponding enumerator is derived from the alias beginning with “cs”, as follows

csUnicode is mapped to text_encoding::id::UCS2,
csIBBM904 is mapped to text_encoding::id::IBM904, and
the “cs” prefix is removed from other names.

— _end note_]

28.4.2.7 Hash support [text.encoding.hash]

template<> struct hash<text_encoding>;

[text.encoding.class] (original) (raw)

28.4.2.1 Overview [text.encoding.overview]

28.4.2.2 General [text.encoding.general]

28.4.2.3 Members [text.encoding.members]

28.4.2.4 Comparison functions [text.encoding.cmp]

28.4.2.5 Class text_encoding​::​aliases_view [text.encoding.aliases]

28.4.2.6 Enumeration text_encoding​::​id [text.encoding.id]

28.4.2.7 Hash support [text.encoding.hash]

28.4.2.5 Class text_encoding::aliases_view [text.encoding.aliases]

28.4.2.6 Enumeration text_encoding::id [text.encoding.id]