[Python-Dev] Unicode support in getargs.c (original) (raw)

M.-A. Lemburg mal@lemburg.com
Sun, 06 Jan 2002 17:58:41 +0100


Jack Jansen wrote:

I'm going to jump out of this discussion for a while. Martin and Mark have a completely different view on Unicode than I do, apparently, and I think I should first try and see if I can use the current implementation.

For the record: my view of Unicode is really "ascii done right", i.e. a datatype that allows you to get richer characters than what 1960s ascii gives you. For this it should be as backward-compatible as possible, i.e. if some API expects a unicode filename and I pass "a.out" it should interpret it as u"a.out". All the converting to different charsets is icing on the cake, the number one priority should be that unicode is as compatible as possible with the 8-bit convention used on the platform (whatever it may be). No, make that the number 2 priority: the number one pritority is compatibility with 7-bit ascii. Using Python StringObjects as binary buffers is also far less common than using StringObjects to store plain old strings, so if either of these uses bites the other it's the binary buffer that needs to suffer. UnicodeObjects and StringObjects should behave pretty orthogonal to how FloatObjects and IntObjects behave.

It would be nice if Unicode could be made to behave that way, but unfortunately, the 8-bit world is so differentiated with lots of different encodings that not even Harry Potter would have much luck finding the right magic to apply.

Another problem is that of the getargs.c API itself: since it returns

pointers to data buffers, auto-conversions (if at all possible) which involve temporary objects must be handled differently than normal Python string objects.

Now, the question is whether you are willing to pay for the comfort of getting direct access to a Py_UNICODE buffer (or char buffer) with extra copy-action and additional PyMem_Free() cleanup overhead or not. The "O" parser marker doesn't provide any magic on its own, but also reduces the need for copying data and handling memory management in you APIs.

In my last message on this thread, I proposed to add "eu#" which returns a Py_UNICODE buffer, possibly decoding a string object using the given encoding first. As Martin noted, this option requires extra copying but simplifies the C coding somewhat.

-- Marc-Andre Lemburg CEO eGenix.com Software GmbH


Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/