[Python-Dev] PEP 528: Change Windows console encoding to UTF-8 (original) (raw)
Steve Dower steve.dower at python.org
Mon Sep 5 01:54:32 EDT 2016
- Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
- Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I posted a minor update to PEP 528 at https://github.com/python/peps/blob/master/pep-0528.txt and a diff below.
While there are likely to be technical and compatibility issues to resolve after the changes are applied, I don't believe they impact the decision to accept the change at the PEP-level (everyone who has raised potential issues has also been supportive of the change). Without real experience during the beta period, it's really hard to determine whether fixes should be made on our side or their side, so I think it's worth going ahead with the change, even if specific implementation details change between now and release.
Cheers, Steve
@@ -21,8 +21,7 @@ This PEP proposes changing the default standard stream implementation on Windows to use the Unicode APIs. This will allow users to print and input the full range of Unicode characters at the default Windows console. This also requires a -subtle change to how the tokenizer parses text from readline hooks, that should -have no backwards compatibility issues. +subtle change to how the tokenizer parses text from readline hooks.
Specific Changes
@@ -46,7 +45,7 @@
The use of an ASCII compatible encoding is required to maintain
compatibility
with code that bypasses the TextIOWrapper
and directly writes
ASCII bytes to
-the standard streams (for example, [process_stdinreader.py]). Code
that assumes
+the standard streams (for example, Twisted's process_stdinreader.py
). Code that assumes
a particular encoding for the standard streams other than ASCII will
likely
break.
@@ -78,8 +77,9 @@ Alternative Approaches
-The win_unicode_console
package [win_unicode_console]_ is a pure-Python
-alternative to changing the default behaviour of the console.
+The win_unicode_console package
_ is a pure-Python alternative to
changing the
+default behaviour of the console. It implements essentially the same
+modifications as described here using pure Python code.
Code that may break
@@ -94,21 +94,21 @@
Code that assumes that the encoding required by sys.stdin.buffer
or
sys.stdout.buffer
is 'mbcs'
or a more specific encoding may
currently be
-working by chance, but could encounter issues under this change. For
example::
+working by chance, but could encounter issues under this change. For
example:
- sys.stdout.buffer.write(text.encode('mbcs'))
- r = sys.stdin.buffer.read(16).decode('cp437')
sys.stdout.buffer.write(text.encode('mbcs'))
r = sys.stdin.buffer.read(16).decode('cp437')
To correct this code, the encoding specified on the TextIOWrapper
should be
-used, either implicitly or explicitly::
+used, either implicitly or explicitly:
Fix 1: Use wrapper correctly
- sys.stdout.write(text)
- r = sys.stdin.read(16)
Fix 1: Use wrapper correctly
sys.stdout.write(text)
r = sys.stdin.read(16)
Fix 2: Use encoding explicitly
- sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
- r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
Fix 2: Use encoding explicitly
sys.stdout.buffer.write(text.encode(sys.stdout.encoding))
r = sys.stdin.buffer.read(16).decode(sys.stdin.encoding)
Incorrectly using the raw object
@@ -117,32 +117,57 @@ writes may be affected. This is particularly important for reads, where the number of characters read will never exceed one-fourth of the number of bytes allowed, as there is no feasible way to prevent input from encoding as much -longer utf-8 strings:: +longer utf-8 strings.
stdin = open(sys.stdin.fileno(), 'rb')
data = stdin.raw.read(15)
raw_stdin = sys.stdin.buffer.raw
data = raw_stdin.read(15) abcdefghijklm b'abc'
data contains at most 3 characters, and never more than 12 bytes
error, as "defghijklm\r\n" is passed to the interactive prompt
To correct this code, the buffered reader/writer should be used, or the caller -should continue reading until its buffer is full.:: +should continue reading until its buffer is full.
Fix 1: Use the buffered reader/writer
stdin = open(sys.stdin.fileno(), 'rb')
Fix 1: Use the buffered reader/writer
stdin = sys.stdin.buffer data = stdin.read(15) abcedfghijklm b'abcdefghijklm\r\n'
Fix 2: Loop until enough bytes have been read
stdin = open(sys.stdin.fileno(), 'rb')
Fix 2: Loop until enough bytes have been read
raw_stdin = sys.stdin.buffer.raw b = b'' while len(b) < 15:
- ... b += stdin.raw.read(15)
- ... b += raw_stdin.read(15) abcedfghijklm b'abcdefghijklm\r\n'
+Using the raw object with small buffers
+---------------------------------------
+
+Code that uses the raw IO object and attempts to read less than four
characters
+will now receive an error. Because it's possible that any single
character may
+require up to four bytes when represented in utf-8, requests must fail.
+
+ >>> raw_stdin = sys.stdin.buffer.raw
+ >>> data = raw_stdin.read(3)
+ Traceback (most recent call last):
+ File "", line 1, in
+ ValueError: must read at least 4 bytes
+
+The only workaround is to pass a larger buffer.
+
+ >>> # Fix: Request at least four bytes
+ >>> raw_stdin = sys.stdin.buffer.raw
+ >>> data = raw_stdin.read(4)
+ a
+ b'a'
+ >>> >>>
+
+(The extra >>>
is due to the newline remaining in the input buffer
and is
+expected in this situation.)
+
Copyright
@@ -151,7 +176,5 @@ References
-.. [process_stdinreader.py] Twisted's process_stdinreader.py
(https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py) -.. [win_unicode_console] win_unicode_console package
- (https://pypi.org/project/win_unicode_console/) +.. _Twisted's process_stdinreader.py: https://github.com/twisted/twisted/blob/trunk/src/twisted/test/process_stdinreader.py +.. _win_unicode_console package: https://pypi.org/project/win_unicode_console/
- Previous message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
- Next message (by thread): [Python-Dev] PEP 528: Change Windows console encoding to UTF-8
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]