[Python-Dev] PEP 528: Change Windows console encoding to UTF-8 (original) (raw)

Adam Bartoš drekin at gmail.com
Sat Sep 3 05:48:58 EDT 2016

Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Paul Moore (p.f.moore at gmail.com) on Fri Sep 2 05:23:04 EDT 2016 wrote

On 2 September 2016 at 03:35, Steve Dower <steve.dower at python.org <https://mail.python.org/mailman/listinfo/python-dev>> wrote: >* I'd need to test to be sure, but writing an incomplete code point should > just truncate to before that point. It may currently raise OSError if that > truncated to zero length, as I believe that's not currently distinguished > from an error. What behavior would you propose? * For "correct" behaviour, you should retain the unwritten bytes, and write them as part of the next call (essentially making the API stateful, in the same way that incremental codecs work). I'm pretty sure that this could cause actual problems, for example I think invoke (https://github.com/pyinvoke/invoke) gets byte streams from subprocesses and dumps them direct to stdout in blocks (so could easily end up splitting multibyte sequences). It''s arguable that it should be decoding the bytes from the subprocess and then re-encoding them, but that gets us into "guess the encoding used by the subprocess" territory. The problem is that we're not going to simply drop some bad data in the common case - it's not so much the dropping of the start of an incomplete code point that bothers me, as the encoding error you hit at the start of the next block of data you send. So people will get random, unexplained, encoding errors. I don't see an easy answer here other than a stateful API. Isn't the buffered IO wrapper for this?

>* Reads of less than four bytes fail instantly, as in the worst case we need > four bytes to represent one Unicode character. This is an unfortunate > reality of trying to limit it to one system call - you'll never get a full > buffer from a single read, as there is no simple mapping between > length-as-utf8 and length-as-utf16 for an arbitrary string. * And here - "read a single byte" is a not uncommon way of getting some data. Once again see invoke: https://github.com/pyinvoke/invoke/blob/master/invoke/platform.py#L147

used at https://github.com/pyinvoke/invoke/blob/master/invoke/runners.py#L548 I'm not saying that there's an easy answer here, but this will break code. And actually, it's in violation of the documentation: seehttps://docs.python.org/3/library/io.html#io.RawIOBase.read """ read(size=-1) Read up to size bytes from the object and return them. As a convenience, if size is unspecified or -1, readall() is called. Otherwise, only one system call is ever made. Fewer than size bytes may be returned if the operating system call returns fewer than size bytes. If 0 bytes are returned, and size was not 0, this indicates end of file. If the object is in non-blocking mode and no bytes are available, None is returned. """ You're not allowed to return 0 bytes if the requested size was not 0, and you're not at EOF.

That's why it should be rather signaled by an exception. Even when one doesn't transcode UTF-16 to UTF-8, reading just one byte is still impossible I would argue that also incorrect here. I raise ValueError in win_unicode_console.

Adam Bartoš -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20160903/4f714c1d/attachment.html>

Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list