Issue 5110: Printing Unicode chars from the interpreter in a non-UTF8 terminal raises an error (Py3) (original) (raw)

Created on 2009-01-30 15:26 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
display_hook_ascii.patch vstinner,2009-01-30 16:04
issue5110.txt ezio.melotti,2009-01-31 03:35 Some tests to show the effects of the patch
Messages (15)
msg80820 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-01-30 15:26
In Py2.x >>> u'\2620' outputs u'\2620' whereas >>> print u'\2620' raises an error. Instead, in Py3.x, both >>> '\u2620' and >>> print('\u2620') raise an error if the terminal doesn't use an encoding able to display the character (e.g. the windows terminal used for these examples). This is caused by the new string representation defined in the PEP3138[1]. Consider also the following example: Py2: >>> [u'\u2620'] [u'\u2620'] Py3: >>> ['\u2620'] UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in position 9: character maps to This means that there is no way to print lists (or other objects) that contain characters that can't be encoded. Two workarounds may be: 1) encode all the elements of the list, but it's not practical; 2) use ascii(), but it adds extra "" around the output and escape backslashes and apostrophes (and it won't be possible to use _[0] in the next line). Also note that in Py3 >>> ['\ud800'] ['\ud800'] >>> _[0] '\ud800' works, because U+D800 belongs to the category "Cs (Other, Surrogate)" and it is escaped[2]. The best solution is probably to change the default error-handler of the Python3 interactive interpreter to 'backslashreplace' in order to avoid this behavior, but I don't know if it's possible only for ">>> foo" and not for ">>> print(foo)" (print() should still raise an error as it does in Py2). This proposal has already been refused in the PEP3138[3] but there are no links to the discussion that led to this decision. I think this should be rediscussed and possibly changed, because, even if can't see the "listOfJapaneseStrings"[4], I still prefer to see a sequence of escaped chars than a UnicodeEncodeError. [1]: http://www.python.org/dev/peps/pep-3138/ [2]: http://www.python.org/dev/peps/pep-3138/#specification [3]: http://www.python.org/dev/peps/pep-3138/#rejected-proposals [4]: http://www.python.org/dev/peps/pep-3138/#motivation
msg80822 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-30 15:43
To be clear, this issue only affects the interpreter. > 2) use ascii(), but it adds extra "" around the output It doesn't ass extra "" if you replace repr() by ascii() in the interpreter code (sys.displayhook)? > The best solution is probably to change the default error-handler > of the Python3 interactive interpreter to 'backslashreplace' > in order to avoid this behavior, (...) Hum, it implies that sys.stdout has a different behaviour in the interpreter and when running a script. We can expect many bugs ports from newbies "the example works in the terminal/IDLE, but not in my script, HELP!". So I prefer ascii().
msg80823 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-30 15:46
You change change the display hook with a site.py script (which have to be in sys.path) : --------- import sys def hook(message): print(ascii(message)) sys.displayhook = hook --------- Example (run python in an empty environment to get ASCII charset): --------- $ env -i PYTHONPATH=$PWD ./python Python 3.1a0 (py3k:69105M, Jan 30 2009, 10:36:27) >>> import sys >>> sys.stdout.encoding 'ANSI_X3.4-1968' >>> "\xe9" '\xe9' >>> print("\xe9") Traceback (most recent call last): (...) UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' (...) ---------
msg80824 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-01-30 15:55
This seems to solve the problem, but apparently the interactive "_" doesn't work anymore.
msg80825 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-30 15:57
Oh yeah, original sys.displayhook uses a special hack for the _ global variable: --------- import sys import builtins def hook(message): if message is None: return builtins._ = message print(ascii(message)) sys.displayhook = hook ---------
msg80826 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-30 16:04
Here is a patch to use ascii() directly in sys_displayhook() (with an unit test!).
msg80845 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-01-31 00:57
Victor, I'm not sure whether you are proposing that display_hook_ascii.patch is included into Python. IIUC, this patch breaks PEP3138, so it clearly must be rejected. Overall, I fail to see the bug in this report. Python 3.0 works as designed as shown here.
msg80852 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-01-31 03:35
This seems to fix the problem: ------------------------------ import sys import builtins def hook(message): if message is None: return builtins._ = message try: print(repr(message)) except UnicodeEncodeError: print(ascii(message)) sys.displayhook = hook ------------------------------ Just to clarify: * The current Py3 behavior works fine in UTF8 terminals * It doesn't work on non-UTF8 terminals if they can't encode the chars (they raise an error) * It only affects the interactive interpreter * This new patch escapes the chars instead of raise an error only on non-UTF8 terminal and only when printed as ">>> foo" (without print()) and leaves the other behaviors unchanged * This is related to Py3 only Apparently the patch provided by Victor always escapes the non-ascii chars. This new hook function prints the Unicode chars if possible and escapes them if not. On a UTF8 terminal the behavior is unchanged, on a non-UTF8 terminal all the chars that can not be encoded will now be escaped. This only changes the behavior of ">>> foo", so it can not lead to confusion ("It works in the interpreter but not in the script"). In a script one can't write "foo" alone but "print(foo)" and the behavior of "print(foo)" is the same in both the interpreter and the scripts (with the patch applied): >>> ['\u2620'] ['\u2620'] >>> print(['\u2620']) UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in position 2: character maps to I think that the PEP3138 didn't consider this issue. Its purpose is to have a better output (Unicode chars instead of escaped chars), but it only works with UTF8 terminals, on non-UTF8 terminals the output is worse (UnicodeEncodeError instead of escaped chars). This is an improvement and I can't see any negative side-effect. Attached there's a txt with more example, on Py2 and Py3, on Windows(non-UTF8 terminal) and Linux (UTF8 terminal), with and without my patch.
msg81056 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-02-03 13:48
> Victor, I'm not sure whether you are proposing that > display_hook_ascii.patch is included into Python. IIUC, this patch > breaks PEP3138, so it clearly must be rejected. > > Overall, I fail to see the bug in this report. Python 3.0 works as > designed as shown here. The idea is to avoid unicode error (by replacing not printable characters by their code in hexadecimal) when the display hook tries to display a message which is not printable in the terminal charset. It's just to make Python3 interpreter a little bit more "user friendly" on Windows. Problem: use different (encoding) rule for the display hook and for print() may disturb new users (Why does ">>> chr(...)" work whereas ">>> print(chr(...))" fails?).
msg81059 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-03 14:01
> Problem: use different (encoding) rule for the display hook and for > print() may disturb new users (Why does ">>> chr(...)" work whereas > ">>> print(chr(...))" fails?). This is the same behavior that Python2.x has (with the only difference that Py2 always show the char as u'\uXXXX' if >0x7F whereas Py3 /tries/ to display it): >>> unichr(0x0100) u'\u0100' >>> print unichr(0x0100) UnicodeEncodeError: 'charmap' codec can't encode character u'\u0100' in position 0: character maps to
msg81841 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 00:17
I've also noticed that if an error contains non-encodable characters, they are escaped: >>> raise ValueError("\u2620 can't be printed here, but '\u00e8' works fine!") Traceback (most recent call last): File "", line 1, in ValueError: \u2620 can't be printed here, but 'è' works fine! but: >>> "\u2620 can't be printed here, but '\u00e8' works fine!" UnicodeEncodeError: 'charmap' codec can't encode character '\u2620' in position 1: character maps to The mechanism used to escape errors is even better than my patch, because it escapes only the chars that can't be encoded, instead of escaping every non-ascii chars when at least one char can't be encoded: >>> "\u2620 can't be printed here, but '\u00e8' works fine!" "\u2620 can't be printed here, but '\xe8' works fine!" I wonder if we can reuse the same mechanism here. By the way, the patch I proposed in is just a proof of concept, if you think it's OK, someone will probably have to implement it in C.
msg84248 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-03-27 01:34
martin> IIUC, this patch breaks PEP3138, martin> so it clearly must be rejected. After reading the PEP3138, it's clear that this issue is not bug, and that we can not accept any patch fixing the issue without breaking the PEP. Windows user who want to get the Python2 behaviour can use my display hook proposed in Message80823. We can not fix this issue, so I choose to close it. If anyone wants to change the PEP, start a discussion on python-dev first.
msg84965 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-04-01 02:42
In the first message I said that this breaks the PEP3138 because I thought that the solution was to change the default error-handler to 'backslashreplace', but this was already proposed and refused. sys.displayhook provides a way to change the behavior of the interactive interpreter only when ">>> foo" is used. The PEP doesn't seem to say anything about how ">>> foo" should behave. Moreover, in the alternate solutions [1] they considered to use sys.displayhook (and sys.excepthook) but they didn't because "these hooks are called only when printing the result of evaluating an expression entered in an interactive Python session, and doesn't work for the print() function, for non-interactive sessions or for logging.debug("%r", ...), etc." This is exactly the behavior I intended to have, and, being a unique feature of the interactive interpreter, it doesn't lead to inconsistence with other situations. [1]: http://www.python.org/dev/peps/pep-3138/#alternate-solutions
msg84986 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2009-04-01 07:15
My proposal to make backslashreplace a default error handler for interactive session was rejected by Guido [1]. Does something like PYTHONIOENCODING=ascii:backslashreplace work for you? With PYTHONIOENCODING, you can effectively make backslashreplace a default error handler for your environment. [1]: http://mail.python.org/pipermail/python-3000/2008-May/013928.html
msg85372 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-04-04 06:10
What I'm proposing is not to change the default error handler to 'backslashreplace', but just the behavior of sys.displayhook.
History
Date User Action Args
2022-04-11 14:56:45 admin set github: 49360
2010-08-07 11:44:42 eric.araujo set nosy: + eric.araujo
2009-04-04 11:02:35 vstinner set nosy: - vstinner
2009-04-04 06:10:28 ezio.melotti set messages: +
2009-04-01 07:15:11 ishimoto set nosy: + ishimotomessages: +
2009-04-01 02:42:01 ezio.melotti set nosy: + atsuoimessages: +
2009-03-27 01:34:13 vstinner set status: open -> closedresolution: not a bugmessages: +
2009-02-13 00:17:26 ezio.melotti set messages: +
2009-02-03 14:01:49 ezio.melotti set messages: + title: Printing Unicode chars from the interpreter in a non-UTF8 terminal (Py3) -> Printing Unicode chars from the interpreter in a non-UTF8 terminal raises an error (Py3)
2009-02-03 13:48:14 vstinner set messages: +
2009-01-31 03:35:36 ezio.melotti set files: + issue5110.txtmessages: +
2009-01-31 00:57:30 loewis set nosy: + loewismessages: +
2009-01-30 18:46:03 giampaolo.rodola set nosy: + giampaolo.rodola
2009-01-30 16:04:12 vstinner set files: + display_hook_ascii.patchkeywords: + patchmessages: +
2009-01-30 15:57:25 vstinner set messages: +
2009-01-30 15:55:32 ezio.melotti set messages: +
2009-01-30 15:46:50 vstinner set messages: +
2009-01-30 15:43:18 vstinner set nosy: + vstinnermessages: +
2009-01-30 15:26:47 ezio.melotti create