[Python-Dev] Help with Unicode arrays in NumPy (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Feb 7 21:53:16 CET 2006

Previous message: [Python-Dev] Help with Unicode arrays in NumPy
Next message: [Python-Dev] Help with Unicode arrays in NumPy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Travis E. Oliphant wrote:

Numpy supports arrays of arbitrary fixed-length "records". It is much more than numeric-only data now. One of the fields that a record can contain is a string. If strings are supported, it makes sense to support unicode strings as well.

Hmm. How do you support strings in fixed-length records? Strings are variable-sized, after all.

On common application is that you have a C struct in some API which has a fixed-size array for string data (either with a length field, or null-terminated), in this case, it is moderately useful to model such a struct in Python. However, transferring this to Unicode is pointless - there aren't any similar Unicode structs that need support.

This allows NumPy to memory-map arbitrary data-files on disk.

Ok, so this is the "C struct" case. Then why do you need Unicode support there? Which common file format has embedded fixed-size Unicode data?

Perhaps you should explain why you think NumPy "shouldn't support Unicode"

I think I said "Unicode arrays", not Unicode. Unicode arrays are a pointless data type, IMO. Unicode always comes in strings (i.e. variable sized, either null-terminated or with an introducing length). On disk/on the wire Unicode comes as UTF-8 more often than not.

Using UCS-2/UCS-2 as an on-disk represenationis also questionable practice (although admittedly Microsoft uses that a lot).

That is currently what is done. The current unicode data-type is exactly what Python uses.

Then I wonder how this goes along with the use case "allow to map arbitrary files".

The chararray subclass gives to unicode and string arrays all the methods of unicode and strings (operating on an element-by-element basis).

For strings, I can see use cases (although I wonder how you deal with data formats that also support variable-sized strings, as most data formats supporting strings do).

Please explain why having zero of them is sufficient.

Because I (still) cannot imagine any specific application that might need such a feature (IOWYAGNI).

If the purpose is to support arbitrary Unicode characters, it should use 4 bytes (as two bytes are insufficient to represent arbitrary Unicode characters).

And Python does not support arbitrary Unicode characters on narrow builds? Then how is \U0010FFFF represented?

It's represented using UTF-16. Try this for yourself:

py> len(u"\U0010FFFF") 2 py> u"\U0010FFFF"[0] u'\udbff' py> u"\U0010FFFF"[1] u'\udfff'

This has all kinds of non-obvious implications.

The purpose is to represent bytes as they might exist in a file or data-stream according to the users specification.

See, and this is precisely the statement that I challenge. Sure, they "might" exist - but I'd rather expect that they don't.

If they exist, "Unicode" might come as variable-sized UTF-8, UTF-16, or UTF-32. In either case, NumPy should already support that by mapping a string object onto the encoded bytes, to which you then can apply .decode() should you need to process the actual Unicode data.

The purpose is whatever the user wants them for. It's the same purpose as having an unsigned 64-bit data-type --- because users may need it to represent data as it exists in a file.

No. I would expect you have 64-bit longs because users do need them, and because there wouldn't be an easy work-around if users wouldn't have them. For Unicode, it's different: users don't directly need them (atleast not many users), and if they do, there is an easy work-around for their absence.

Say I want to process NTFS run lists. In NTFS run lists, there are 24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles). Can I represent them all in NumPy? Can I have NumPy transparently map a sequence of run list records (which are variable-sized) map as an array of run list records?

Regards, Martin

Previous message: [Python-Dev] Help with Unicode arrays in NumPy
Next message: [Python-Dev] Help with Unicode arrays in NumPy
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list