[Python-Dev] Help with Unicode arrays in NumPy (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Feb 8 08:19:30 CET 2006


"Travis" == Travis E Oliphant <oliphant.travis at ieee.org> writes:

Travis> Numpy supports arrays of arbitrary fixed-length "records".
Travis> It is much more than numeric-only data now.  One of the
Travis> fields that a record can contain is a string.  If strings
Travis> are supported, it makes sense to support unicode strings
Travis> as well.

That is not obvious. A string is really an array of bytes, which for historical reasons in some places (primarily the U.S. of A.) can be used to represent text. Unicode, on the other hand, is intended to represent text streams robustly and does so in a universal but flexible way ... but all of the different Unicode transformation formats are considered to represent the identical text stream. Some applications may specify a transformation format, others will not.

In any case, internally Python is only going to support one; all the others must be read in through codecs anyway. See below.

Travis> This allows NumPy to memory-map arbitrary data-files on
Travis> disk.

In the case where a transformation format is specified, I don't see why you can't use a byte array field (ie, ordinary "string") of appropriate size for this purpose, and read it through a codec when it needs to be treated as text. This is going to be necessary in essentially all of the cases I encounter, because the files are UTF-8 and sane internal representations are either UTF-16 or UTF-32. In particular, Python's internal representation is 16 or 32 bits wide.

Travis> Perhaps you should explain why you think NumPy "shouldn't
Travis> support Unicode"

Because it can't, not in the way you would like to, if I understand you correctly. Python chooses one of the many standard representations for internal use, and because of the way the standard is specified, it doesn't matter which one! And none of the others can be represented directly, all must be decoded for internal use and encoded when written back to external media. So any memory mapping application is inherently nonportable, even across Python implementations.

Travis> And Python does not support arbitrary Unicode characters
Travis> on narrow builds?  Then how is \U0010FFFF represented?

In a way incompatible with the concept of character array. Now what do you do?

The point is that Unicode is intentionally designed in such a way that a plethora of representations is possible, but all are easily and reliably interconverted. Implementations are then free to choose an appropriate internal representation, knowing that conversion from external representations is "cheap" and standardized.

-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.



More information about the Python-Dev mailing list