unicode NSE Library — Nmap Scripting Engine documentation (original) (raw)

Functions

Library methods for handling unicode strings.

Author:

Copyright © Same as Nmap--See https://nmap.org/book/man-legal.html

Source: https://svn.nmap.org/nmap/nselib/unicode.lua

Functions

chardet (buf, len)

Determine (poorly) the character encoding of a string

cp437_dec (buf, pos)

Decodes a CP437 character

cp437_enc (cp)

Encode a Unicode code point to CP437

decode (buf, decoder, bigendian)

Decode a buffer containing Unicode data.

encode (list, encoder, bigendian)

Encode a list of Unicode code points

transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)

Transcode a string from one format to another

utf16_dec (buf, pos, bigendian)

Decodes a UTF-16 character.

utf16_enc (cp, bigendian)

Encode a Unicode code point to UTF-16. See RFC 2781.

utf16to8 (from)

Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.

utf8_dec (buf, pos)

Decodes a UTF-8 character.

utf8_enc (cp)

Encode a Unicode code point to UTF-8. See RFC 3629.

utf8to16 (from)

Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Functions

chardet (buf, len)

Determine (poorly) the character encoding of a string

First, the string is checked for a Byte-order Mark (BOM). This can be examined to determine UTF-16 with endianness or UTF-8. If no BOM is found, the string is examined.

If null bytes are encountered, UTF-16 is assumed. Endianness is determined by byte position, assuming the null is the high-order byte. Otherwise, if byte values over 127 are found, UTF-8 decoding is attempted. If this fails, the result is 'other', otherwise it is 'utf-8'. If no high bytes are found, the result is 'ascii'.

Parameters

buf

The string/buffer to be identified

len

The number of bytes to inspect in order to identify the string. Default: 100

Return value:

A string describing the encoding: 'ascii', 'utf-8', 'utf-16be', 'utf-16le', or 'other' meaning some unidentified 8-bit encoding

cp437_dec (buf, pos)

Decodes a CP437 character

Parameters

buf

A string containing the character

pos

The index in the string where the character begins

Return values:

  1. pos The index in the string where the character ended
  2. cp The code point of the character as a number

cp437_enc (cp)

Encode a Unicode code point to CP437

Returns nil if the code point cannot be found in CP437

Parameters

cp

The Unicode code point as a number

Return value:

A string containing the related CP437 character

decode (buf, decoder, bigendian)

Decode a buffer containing Unicode data.

Parameters

buf

The string/buffer to be decoded

decoder

A Unicode decoder function (such as utf8_dec)

bigendian

For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

A list-table containing the code points as numbers

encode (list, encoder, bigendian)

Encode a list of Unicode code points

Parameters

list

A list-table of code points as numbers

encoder

A Unicode encoder function (such as utf8_enc)

bigendian

For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

An encoded string

transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)

Transcode a string from one format to another

The string will be decoded and re-encoded in one pass. This saves some overhead vs simply passing the output of unicode.encode tounicode.decode.

Parameters

buf

The string/buffer to be transcoded

decoder

A Unicode decoder function (such as utf16_dec)

encoder

A Unicode encoder function (such as utf8_enc)

bigendian_dec

Set this to true to force big-endian decoding.

bigendian_enc

Set this to true to force big-endian encoding.

Return value:

An encoded string

utf16_dec (buf, pos, bigendian)

Decodes a UTF-16 character.

Does not check that the returned code point is a real character. Specifically, it can be fooled by out-of-order lead- and trail-surrogate characters.

Parameters

buf

A string containing the character

pos

The index in the string where the character begins

bigendian

Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return values:

  1. pos The index in the string where the character ended
  2. cp The code point of the character as a number

utf16_enc (cp, bigendian)

Encode a Unicode code point to UTF-16. See RFC 2781.

Windows OS prior to Windows 2000 only supports UCS-2, so beware using this function to encode code points above 0xFFFF.

Parameters

cp

The Unicode code point as a number

bigendian

Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return value:

A string containing the code point in UTF-16 encoding.

utf16to8 (from)

Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.

Parameters

from

A string in UTF-16, little-endian

Return value:

The string in UTF-8

utf8_dec (buf, pos)

Decodes a UTF-8 character.

Does not check that the returned code point is a real character.

Parameters

buf

A string containing the character

pos

The index in the string where the character begins

Return values:

  1. pos The index in the string where the character ended or nil on error
  2. cp The code point of the character as a number, or an error string

utf8_enc (cp)

Encode a Unicode code point to UTF-8. See RFC 3629.

Does not check that cp is a real character; that is, doesn't exclude the surrogate range U+D800 - U+DFFF and a handful of others.

Parameters

cp

The Unicode code point as a number

Return value:

A string containing the code point in UTF-8 encoding.

utf8to16 (from)

Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Parameters

from

A string in UTF-8

Return value:

The string in UTF-16, little-endian