[Python-Dev] Unicode source code (original) (raw)
M.-A. Lemburg mal@lemburg.com
Sun, 09 Feb 2003 17:39:59 +0100
- Previous message: [Python-Dev] Unicode source code
- Next message: [Python-Dev] Unicode source code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Just van Rossum wrote:
M.-A. Lemburg wrote:
Just van Rossum wrote:
Now that PEP 263 is in place (yet hotly debated on c.l.py ;-), wouldn't it be fairly small step to fully support unicode strings in compile(), eval() and exec? I notice these still attempt to convert unicode to 8 bit with the default encoding, which isn't very useful. Patches are most welcome. Some guidance on where to look is more than welcome.
The tokenizer/compiler works as follows (quote from another email):
""" source code using encoding ENC -> via codec for ENC into Unicode -> via UTF-8 codec into UTF-8 string -> tokenizer -> compiler for 8-bit string literals in the source code -> UTF-8 string is converted back into encoding ENC
Provided that the encoding ENC is roundtrip safe for all 256 base character ordinals, 8-bit strings will turn out as-is in the compiled byte code. """
Now, to accept Unicode it would probably be worthwhile hooking into this chain at step 2 rather than step 1 (the code for the tokenizer is in Parser/tokenizer.c, the compiler code in Python/compiler.c), however, this is difficult because most APIs for compiling code are built on char* buffers.
A short-term solution would probably be to convert Unicode to UTF-8 and prepend a UTF-8 BOM mark so that the tokenizer knows that it is getting UTF-8. Haven't tested this though.
A slightly better solution (on narrow Unicode Python builds) would be to use UTF-16 for this. The UTF-16 support in the tokenizer would have to be enabled for this, though. It is currently disabled for some reason I don't remember. Martin should know... but he's on vacation.
-- Marc-Andre Lemburg eGenix.com
Professional Python Software directly from the Source (#1, Feb 09 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 51 days left EuroPython 2003, Charleroi, Belgium: 135 days left
- Previous message: [Python-Dev] Unicode source code
- Next message: [Python-Dev] Unicode source code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]