[Python-Dev] Unicode support in getargs.c (original) (raw)

M.-A. Lemburg mal@lemburg.com
Thu, 03 Jan 2002 11:34:17 +0100

Previous message: [Python-Dev] Unicode support in getargs.c
Next message: [Python-Dev] Unicode support in getargs.c
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Martin v. Loewis" wrote:

> I have a number of MacOSX API's that expect Unicode buffers, passed as > "long count, UniChar *buffer". Well, my first question would be: Are you sure that UniChar has the same underlying integral type as PyUNICODE? If not, you lose. So you may need to do even more conversion.

This should be the first thing to check. Also note that Python has two different flavors of Unicode support: UCS-2 and UCS-4, so you'll have to be careful about this too.

> I have the machinery in bgen to generate code for this, iff "u#" (or > something else) would work the same as "s#", i.e. it returns you a > pointer and a size, and it would work equally well for unicode > objects as for classic strings (after conversion).

I see. u# could be made work for Unicode objects alone, but it would have to reject string objects.

Martin, I don't agree here: string objects could hold binary UCS-2/UCS-4 data.

Jack, u# cannot auto-convert strings to Unicode since this would require allocation of a temporary object and there's no logic there to free that object after usage.

es# has logic in place which allows either copying the raw data to a buffer you provide or have it allocate a buffer of the right size for you. That's why I proposed to extend it support Unicode raw data as well.

> But as a general solution it doesn't look right: "How do I call a C > routine with a string parameter?" "Use the "s" format and you get the > string pointer to pass". "How do I call a C routine with a unicode string > parameter?"

For that, the answer is u. But you want the length also. So for that, the answer is u#. But your question is "How do I call a C routine with either a Unicode object or a string object, getting a reasonable PyUNICODE* and the length?". For that, I'd recommend to use O&, with a conversion function PyObject *PyUnicodeOrString(PyObject *o, void *ignored)){ if (PyUnicodeCheck(o)){ PyINCREF(o);return o; } if (PyStringCheck(o)){ return PyUnicodeFromObject(o); } PyErrSetString(PyExcTypeError,"unicode object expecpected"); return NULL; }

Martin, note that PyUnicode_FromObject() already does the Unicode pass-through (even more: it makes sure that you get a true Unicode object, not a subclass).

> "Use O and PyUnicodeFromObject() and PyUnicodeAsUnicode and > make sure you get all your decrefs right and.....".

With the function above, this becomes Use O&, passing a PyObject**, the function, and a NULL pointer, using PyUnicodeASUNICODE and PyUnicodeSIZE, performing a single DECREF at the end [allowing to specify an encoding is optional] In this scenario, somebody has to deallocate memory, you cannot get around this. It is your choice whether this is PyDECREF or PyMemFree that you have to call (as with the "esomething" conversions); the DECREF is more efficient as it will not copy a Unicode object. > The "es#" is a very strange beast, and a similar "eu#" would help me a > little, but it has some serious drawbacks. Aside from it being completely > different from the other converters (being a prefix operator in stead of a > postfix one, and having a value-return argument) I would also have to > pre-allocate the buffer in advance, and that sort of defeats the purpose. You don't. If you set the buffer to NULL before invoking getargs, you have to PyMemFree it afterwards.

Right.

Let me see if I can summarize this:

Jack wants to get string and Unicode objects converted to Unicode automagically and then receive a pointer to a Py_UNICODE buffer and a size.

The current solution for this is to use the "O" parser, fetch the object, pass it through PyUnicode_FromObject(), then use PyUnicode_GET_SIZE() and PyUnicode_AS_UNICODE() to access the Py_UNICODE buffer and finally to Py_DECREF() the object returned by PyUnicode_FromObject().

What I proposed was to extend the "es#" parser marker with a new modifier: "eu#" which does all of the above except that it either copies the Py_UNICODE data to a buffer you provide or a newly allocated buffer which you then have to PyMem_Free() after usage.

How does this sound ?

-- Marc-Andre Lemburg CEO eGenix.com Software GmbH

Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Previous message: [Python-Dev] Unicode support in getargs.c
Next message: [Python-Dev] Unicode support in getargs.c
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]