[Python-Dev] PEP 528: Change Windows console encoding to UTF-8 (original) (raw)

eryk sun eryksun at gmail.com
Mon Sep 5 17:40:32 EDT 2016


On Mon, Sep 5, 2016 at 7:54 PM, Steve Dower <steve.dower at python.org> wrote:

On 05Sep2016 1234, eryk sun wrote:

Also, the console is UCS-2, which can't be transcoded between UTF-16 and UTF-8. Supporting UCS-2 in the console would integrate nicely with the filesystem PEP. It makes it always possible to print os.listdir('.'), copy and paste, and read it back without data loss. Supporting UTF-8 actually works better for this. We already use surrogatepass explicitly (on the filesystem side, with PEP 529) and implicitly (on the console side, using the Windows conversion API).

CP_UTF8 requires valid UTF-16 text. MultiByteToWideChar and WideCharToMultiByte are of no practical use here. For example:

>>> raw_read = sys.stdin.buffer.raw.read
>>> _ = write_console_input('\ud800\ud800\r\n'); raw_read(16)
��
b'\xef\xbf\xbd\xef\xbf\xbd\r\n'

This requires Python's "surrogatepass" error handler. It's also required to decode UTF-8 that's potentially WTF-8 from os.listdir(b'.'). Coming from the wild, there's a chance that arbitrary bytes have invalid sequences other than lone surrogates, so it needs to fall back on "replace" to deal with errors that "surrogatepass" doesn't handle.

Writing a partial character is easily avoidable by the user. We can either fail with an error or print garbage, and currently printing garbage is the most compatible behaviour. (Also occurs on Linux - I have a VM running this week for testing this stuff.)

Are you sure about that? The internal screen buffer of a Linux terminal is bytes; it doesn't transcode to a wide-character format. In the Unix world, almost everything is "get a byte, get a byte, get a byte, byte, byte". Here's what I see in Ubuntu using GNOME Terminal, for example:

>>> raw_write = sys.stdout.buffer.raw.write
>>> b = 'αβψδε\n'.encode()
>>> b
b'\xce\xb1\xce\xb2\xcf\x88\xce\xb4\xce\xb5\n'
>>> for c in b: _ = raw_write(bytes([c]))
...
αβψδε

Here it is on Windows with your patch:

>>> raw_write = sys.stdout.buffer.raw.write
>>> b = 'αβψδε\n'.encode()
>>> b
b'\xce\xb1\xce\xb2\xcf\x88\xce\xb4\xce\xb5\n'
>>> for c in b: _ = raw_write(bytes([c]))
...
����������

For the write case this can be addressed by identifying an incomplete sequence at the tail end and either buffering it as 'written' or rejecting it for the user/buffer to try again with the complete sequence. I think rejection isn't a good option when the incomplete sequence starts at index 0. That should be buffered. I prefer buffering in all cases.

It would probably be simpler to use UTF-16 in the main pipeline and implement Martin's suggestion to mix in a UTF-8 buffer. The UTF-16 buffer could be renamed as "wbuffer", for expert use. However, if you're fully committed to transcoding in the raw layer, I'm certain that these problems can be addressed with small buffers and using Python's codec machinery for a flexible mix of "surrogatepass" and "replace" error handling. I don't think it actually makes things simpler. Having two buffers is generally a bad idea unless they are perfectly synced, which would be impossible here without data corruption (if you read half a utf-8 character sequence and then read the wide buffer, do you get that character or not?).

Martin's idea, as I understand it, is a UTF-8 buffer that reads from and writes to the text wrapper. It necessarily consumes at least one character and buffers it to allow reading per byte. Likewise for writing, it buffers bytes until it can write a character to the text wrapper. ISTM, it has to look for incomplete lead-continuation byte sequences at the tail end, to hold them until the sequence is complete, at which time it either decodes to a valid character or the U+FFFD replacement character.

Also, I found that read(n) has to read a character at a time. That's the only way to emulate line-input mode to detect "\n" and stop reading. Technically this is implemented in a RawIOBase, which dictates that operations should use a single system call, but since it's interfacing with a text wrapper around a buffer around the actual UCS-2 raw console stream, any notion of a 'system call' would be a sham.

Because of the UTF-8 buffering there is a synchronization issue, but it has character granularity. For example, when decoding UTF-8, you don't get half of a surrogate pair. You decode the full character, and write that as a discrete unit to the text wrapper. I'd have to experiment to see how bad this can get. If it's too confusing the idea isn't practical.

On the plus side, when working with text it's all native UCS-2 up to the TextIOWrapper, so it's as efficient as possible, and as simple as possible. You don't have to worry about transcoding and dealing with partial surrogate pairs and partial UTF-8 sequences. All of that complexity is exported to the pure-Python UTF-8 buffer mixin, but it's not as bad there either because the interface is Text <=> WTF-8 instead of UCS-2 <=> WTF-8, and you don't have to worry about limiting yourself to a single read or write. But that's detrimental for anyone using the buffer's raw stream with the presumption that it does only make one system call that's thread safe.



More information about the Python-Dev mailing list