[Python-Dev] PEP 528: Change Windows console encoding to UTF-8 (original) (raw)

Martin Panter vadmium+py at gmail.com
Tue Sep 6 06:34:01 EDT 2016

Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 5 September 2016 at 21:40, eryk sun <eryksun at gmail.com> wrote:

On Mon, Sep 5, 2016 at 7:54 PM, Steve Dower <steve.dower at python.org> wrote:

On 05Sep2016 1234, eryk sun wrote:

It would probably be simpler to use UTF-16 in the main pipeline and implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16 buffer could be renamed as "wbuffer", for expert use. However, if you're fully committed to transcoding in the raw layer, I'm certain that these problems can be addressed with small buffers and using Python's codec machinery for a flexible mix of "surrogatepass" and "replace" error handling.

I don't think it actually makes things simpler. Having two buffers is generally a bad idea unless they are perfectly synced, which would be impossible here without data corruption (if you read half a utf-8 character sequence and then read the wide buffer, do you get that character or not?). Martin's idea, as I understand it, is a UTF-8 buffer that reads from and writes to the text wrapper.

Yes, that was basically it. Though I had only thought as far as simple encodings like ASCII, where one byte corresponds to one character. I wonder if you really need UTF-8 support. Are the encoding values currently encountered for Windows consoles all single-byte encodings or are they more complicated?

It necessarily consumes at least one character and buffers it to allow reading per byte. Likewise for writing, it buffers bytes until it can write a character to the text wrapper. ISTM, it has to look for incomplete lead-continuation byte sequences at the tail end, to hold them until the sequence is complete, at which time it either decodes to a valid character or the U+FFFD replacement character.

This buffering behaviour would be necessary for a multi-byte encodings like UTF-8.

Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list