Encoding.h - mozsearch (original) (raw)

* Return value from `Decoder`/`Encoder` to indicate that input

const uint32_t kInputEmpty = INPUT_EMPTY;

* Return value from `Decoder`/`Encoder` to indicate that output

* space was insufficient.

const uint32_t kOutputFull = OUTPUT_FULL;

* An encoding as defined in the Encoding Standard

* An _encoding_ defines a mapping from a byte sequence to a Unicode code point

* sequence and, in most cases, vice versa. Each encoding has a name, an output

* encoding, and one or more labels.

* _Labels_ are ASCII-case-insensitive strings that are used to identify an

* encoding in formats and protocols. The _name_ of the encoding is the

* preferred label in the case appropriate for returning from the

* `characterSet` property of the `Document` DOM interface, except for

* the replacement encoding whose name is not one of its labels.

* The _output encoding_ is the encoding used for form submission and URL

* parsing on Web pages in the encoding. This is UTF-8 for the replacement,

* UTF-16LE and UTF-16BE encodings and the encoding itself for other

* # Streaming vs. Non-Streaming

* When you have the entire input in a single buffer, you can use the

* methods `Decode()`, `DecodeWithBOMRemoval()`,

* `DecodeWithoutBOMHandling()`,

* `DecodeWithoutBOMHandlingAndWithoutReplacement()` and

* `Encode()`. Unlike the rest of the API (apart from the `NewDecoder()` and

* NewEncoder()` methods), these methods perform heap allocations. You should

* the `Decoder` and `Encoder` objects when your input is split into multiple

* buffers or when you want to control the allocation of the output buffers.

* All instances of `Encoding` are statically allocated and have the process's

* lifetime. There is precisely one unique `Encoding` instance for each

* encoding defined in the Encoding Standard.

* To obtain a reference to a particular encoding whose identity you know at

* compile time, use a `static` that refers to encoding. There is a `static`

* for each encoding. The `static`s are named in all caps with hyphens

* replaced with underscores and with `_ENCODING` appended to the

* name. For example, if you know at compile time that you will want to

* decode using the UTF-8 encoding, use the `UTF_8_ENCODING` `static`.

* If you don't know what encoding you need at compile time and need to

* dynamically get an encoding by label, use `Encoding::for_label()`.

* Pointers to `Encoding` can be compared with `==` to check for the sameness

* A pointer to a `mozilla::Encoding` in C++ is the same thing as a pointer

* to an `encoding_rs::Encoding` in Rust. When writing FFI code, use

* `const mozilla::Encoding*` in the C signature and

* `*const encoding_rs::Encoding` is the corresponding Rust signature.

* Implements the _get an encoding_ algorithm

* If, after ASCII-lowercasing and removing leading and trailing

* whitespace, the argument matches a label defined in the Encoding

* Standard, `const Encoding*` representing the corresponding

* encoding is returned. If there is no match, `nullptr` is returned.

* This is the right method to use if the action upon the method returning

* `nullptr` is to use a fallback encoding (e.g. `WINDOWS_1252_ENCODING`)

* instead. When the action upon the method returning `nullptr` is not to

* proceed with a fallback but to refuse processing,

* `ForLabelNoReplacement()` is more appropriate.

static inline const Encoding* ForLabel(Span<const char> aLabel) {

return encoding_for_label(

reinterpret_cast<const uint8_t*>(aLabel.Elements()), aLabel.Length());

* `nsAString` argument version. See above for docs.

static inline const Encoding* ForLabel(const nsAString& aLabel) {

return Encoding::ForLabel(NS_ConvertUTF16toUTF8(aLabel));

* This method behaves the same as `ForLabel()`, except when `ForLabel()`

* would return `REPLACEMENT_ENCODING`, this method returns `nullptr` instead.

* This method is useful in scenarios where a fatal error is required

* upon invalid label, because in those cases the caller typically wishes

* to treat the labels that map to the replacement encoding as fatal

* It is not OK to use this method when the action upon the method returning

* `nullptr` is to use a fallback encoding (e.g. `WINDOWS_1252_ENCODING`). In

* such a case, the `ForLabel()` method should be used instead in order to

* avoid unsafe fallback for labels that `ForLabel()` maps to

* `REPLACEMENT_ENCODING`.

static inline const Encoding* ForLabelNoReplacement(Span<const char> aLabel) {

return encoding_for_label_no_replacement(

reinterpret_cast<const uint8_t*>(aLabel.Elements()), aLabel.Length());

* `nsAString` argument version. See above for docs.

static inline const Encoding* ForLabelNoReplacement(const nsAString& aLabel) {

return Encoding::ForLabelNoReplacement(NS_ConvertUTF16toUTF8(aLabel));

* Performs non-incremental BOM sniffing.

* The argument must either be a buffer representing the entire input

* stream (non-streaming case) or a buffer representing at least the first

* three bytes of the input stream (streaming case).

* Returns `{UTF_8_ENCODING, 3}`,

* `{UTF_16LE_ENCODING, 2}` or

* `{UTF_16BE_ENCODING, 3}` if the argument starts with the

* UTF-8, UTF-16LE or UTF-16BE BOM or `{nullptr, 0}` otherwise.

static inline std::tuple<const Encoding*, size_t> ForBOM(

Span<const uint8_t> aBuffer) {

size_t len = aBuffer.Length();

const Encoding* encoding = encoding_for_bom(aBuffer.Elements(), &len);

* Writes the name of this encoding into `aName`.

* This name is appropriate to return as-is from the DOM

* `document.characterSet` property.

inline void Name(nsACString& aName) const {

aName.SetLength(ENCODING_NAME_MAX_LENGTH);

encoding_name(this, reinterpret_cast<uint8_t*>(aName.BeginWriting()));

aName.SetLength(length); // truncation is the 64-bit case is OK

* Checks whether the _output encoding_ of this encoding can encode every

* Unicode code point. (Only true if the output encoding is UTF-8.)

inline bool CanEncodeEverything() const {

return encoding_can_encode_everything(this);

* Checks whether this encoding maps one byte to one Basic Multilingual

* Plane code point (i.e. byte length equals decoded UTF-16 length) and

* vice versa (for mappable characters).

* `true` iff this encoding is on the list of Legacy single-byte

* in the spec or x-user-defined.

inline bool IsSingleByte() const { return encoding_is_single_byte(this); }

* Checks whether the bytes 0x00...0x7F map exclusively to the characters

* U+0000...U+007F and vice versa.

inline bool IsAsciiCompatible() const {

return encoding_is_ascii_compatible(this);

* Checks whether this is a Japanese legacy encoding.

inline bool IsJapaneseLegacy() const {

return this == SHIFT_JIS_ENCODING || this == EUC_JP_ENCODING ||

this == ISO_2022_JP_ENCODING;

* Returns the _output encoding_ of this encoding. This is UTF-8 for

* UTF-16BE, UTF-16LE and replacement and the encoding itself otherwise.

inline NotNull<const mozilla::Encoding*> OutputEncoding() const {

return WrapNotNull(encoding_output_encoding(this));

* Decode complete input to `nsACString` _with BOM sniffing_ and with

* malformed sequences replaced with the REPLACEMENT CHARACTER when the

* entire input is available as a single buffer (i.e. the end of the

* buffer marks the end of the stream).

* This method implements the (non-streaming version of) the

* The second item in the returned tuple is the encoding that was actually

* used (which may differ from this encoding thanks to BOM sniffing).

* Returns `NS_ERROR_OUT_OF_MEMORY` upon OOM, `NS_OK_HAD_REPLACEMENTS`

* if there were malformed sequences (that were replaced with the

* REPLACEMENT CHARACTER) and `NS_OK` otherwise as the first item of the

* The backing buffer of the string isn't copied if the input buffer

* is heap-allocated and decoding from UTF-8 and the input is valid

* BOMless UTF-8, decoding from an ASCII-compatible encoding and

* the input is valid ASCII or decoding from ISO-2022-JP and the

* input stays in the ASCII state of ISO-2022-JP. It is OK to pass

* the same string as both arguments.

* _Note:_ It is wrong to use this when the input buffer represents only

* a segment of the input instead of the whole input. Use `NewDecoder()`

* when decoding segmented input.

inline std::tuple<nsresult, NotNull<const mozilla::Encoding*>> Decode(

const nsACString& aBytes, nsACString& aOut) const {

const Encoding* encoding = this;

const nsACString* bytes = &aBytes;

nsAutoCString temp(aBytes);

rv = mozilla_encoding_decode_to_nscstring(&encoding, &temp, out);

rv = mozilla_encoding_decode_to_nscstring(&encoding, bytes, out);

return {rv, WrapNotNull(encoding)};

* Decode complete input to `nsAString` _with BOM sniffing_ and with