[Python-Dev] Help with Unicode arrays in NumPy (original) (raw)

Travis E. Oliphant oliphant.travis at ieee.org
Tue Feb 7 21:23:13 CET 2006

Previous message: [Python-Dev] Help with Unicode arrays in NumPy
Next message: [Python-Dev] Help with Unicode arrays in NumPy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Martin v. Löwis wrote:

Travis E. Oliphant wrote:

Currently that means that they are "unicode" strings of basic size UCS2 or UCS4 depending on the platform. It is this duality that has some people concerned. For all other data-types, NumPy allows the user to explicitly request a bit-width for the data-type. Why is that a desirable property? Also: Why does have NumPy support for Unicode arrays in the first place?

Numpy supports arrays of arbitrary fixed-length "records". It is much more than numeric-only data now. One of the fields that a record can contain is a string. If strings are supported, it makes sense to support unicode strings as well.

This allows NumPy to memory-map arbitrary data-files on disk.

Perhaps you should explain why you think NumPy "shouldn't support Unicode"

My initial reaction is: use whatever Python uses in "NumPy Unicode". Upon closer inspection, it is not all that clear what operations are supported on a Unicode array, and how these operations relate to the Python Unicode type.

That is currently what is done. The current unicode data-type is exactly what Python uses.

The chararray subclass gives to unicode and string arrays all the methods of unicode and strings (operating on an element-by-element basis).

When you extract an element from the unicode data-type you get a Python unicode object (every NumPy data-type has a corresponding "type-object" that determines what is returned when an element is extracted). All of these types are in a hierarchy of data-types which inherit from the basic Python types when available.

In any case, I think NumPy should have only a single "Unicode array" type (please do explain why having zero of them is insufficient).

Please explain why having zero of them is sufficient.

If the purpose of the type is to interoperate with a Python unicode object, it should use the same width (as this will allow for mempcy).

If the purpose is to support arbitrary Unicode characters, it should use 4 bytes (as two bytes are insufficient to represent arbitrary Unicode characters).

And Python does not support arbitrary Unicode characters on narrow builds? Then how is \U0010FFFF represented?

If the purpose is something else, please explain what the purpose is.

The purpose is to represent bytes as they might exist in a file or data-stream according to the users specification. The purpose is whatever the user wants them for. It's the same purpose as having an unsigned 64-bit data-type --- because users may need it to represent data as it exists in a file.

Previous message: [Python-Dev] Help with Unicode arrays in NumPy
Next message: [Python-Dev] Help with Unicode arrays in NumPy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list