Issue 1706460: access to unicodedata (via codepoints or 2-char surrogates) (original) (raw)

Currently, most functions of the unicodedata module require the unichr - unicode string of length 1 - as a parameter; for most uses it's ok, but especially while working with characters outside the BMP - (the code point over FFFF) on a narrow python build it could be quite handy, to access the properties of these characters simply using the codepoint or ordinal (since the simple unichr(x) only works for x <= FFFF on a narrow build, hence the other unicode planes are unaccessible this way).

I belive, the unicode database could be allready indexed using some numerical values like codepoints, or isn't it true?

With this improvement, the whole database could be effectively accessible also on narrow python builds, where it isn't possible to pass one-character string for codepoints over FFFF (even if the explicit limitation of unichr is bypassed, eg. by creating an unicode literal u'\Uxxxxxxxx', the resulting string consist of a surrogate pair and has obviously the length 2)

Alternatively, it could be possible, that the respective functions would also accept a two-character string, provided, this sequence can be correcly interpretted as a surrogate-pair representation of some valid unicode codepoint.

Currently such behaviour (e.g. codepoint access) can be emulated with custom datasets derived from the unicode database, but I belive, that it should be possible to access the allready present data somehow (also on narrow builds), rather than having to duplicate it.