[Python-Dev] Encoding of PyFrameObject members (original) (raw)

Chris Angelico rosuav at gmail.com
Fri Feb 6 01:56:36 CET 2015


On Fri, Feb 6, 2015 at 10:27 AM, Francis Giraldeau <francis.giraldeau at gmail.com> wrote:

Instead, I access members directly: char *str = PyUnicodeDATA(frame->fcode->cofilename); sizet len = PyUnicodeGETDATASIZE(frame->fcode->cofilename);

Is it safe to assume that unicode objects cofilename and coname are always UTF-8 data for loaded code? I looked at the PyTokenizerFromString() and it seems to convert everything to UTF-8 upfront, and I would like to make sure this assumption is valid.

I don't think you should be using _GET_DATA_SIZE with _DATA - they're mix-and-matched from old and new APIs. If you want a raw, no-allocation look at the data, you'd need to check PyUnicode_KIND and then read Latin-1, UCS-2, or UCS-4 data:

https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_1BYTE_DATA

(By the way, I don't think the name "UCS-1" is part of the Unicode spec. But it's an obvious parallel to UCS-2 and UCS-4.)

Getting UTF-8 data out of the structure, if it had indeed been cached, ought to be possible. But I can't see a documented function or macro for doing it. Is there a way? Reaching into the structure and grabbing the utf8 and utf8_length members seems like a bad idea.

ChrisA



More information about the Python-Dev mailing list