Issue 29907: Unicode encoding failure (original) (raw)
Issue29907
Created on 2017-03-26 02:26 by Robert Baker, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (5) | ||
---|---|---|
msg290503 - (view) | Author: Robert Baker (Robert Baker) | Date: 2017-03-26 02:26 |
Using Python 2.7 (not IDLE) on Windows 10. I have tried to use a Python 2.7 program to print the name of Czech composer Antonín Dvořák. I remembered to add the "u" before the string, but regardless of whether I encode the caron-r as a literal character (pasted from Windows Character Map) or as \u0159, it gives the error that character 0159 is undefined. This is incorrect; that character has been defined as "lower case r with caron above" for several years now. (The interpreter has no problem with the ANSI characters in the string.) | ||
msg290506 - (view) | Author: Martin Panter (martin.panter) * ![]() |
Date: 2017-03-26 06:06 |
I presume you are trying to print to the normal Windows console. I understand the console was not well supported until Python 3.6 (see Issue 1602). Have you tried that version? I’ll leave this open for someone more experienced to confirm, but I suspect what you want may not be possible with 2.7. | ||
msg290513 - (view) | Author: Paul Moore (paul.moore) * ![]() |
Date: 2017-03-26 07:41 |
Also, you need to: 1. Ensure you are using characters that are available in the encoding that sys.stdout uses - in Python prior to 3.6, this would be your Windows *console* code page, and in 3.6+ would be UTF-8. 2. Declare the encoding of your source code if you are not using the default (which is ASCII in Python 2, and UTF-8 in Python 3). Specifically, if you write your source in UTF-8, or use an encoding declaration or \u escapes, and you use Python 3.6, this problem will likely have gone away. | ||
msg290517 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2017-03-26 08:24 |
For Python 2, there is https://pypi.python.org/pypi/win_unicode_console | ||
msg290528 - (view) | Author: Eryk Sun (eryksun) * ![]() |
Date: 2017-03-26 13:40 |
I'm closing this issue since Python's encodings in this case -- 852 (OEM) and 1250 (ANSI) -- both correctly map U+0159: >>> u'\u0159'.encode('852') '\xfd' >>> u'\u0159'.encode('1250') '\xf8' You must be using an encoding that doesn't map U+0159. If you're using the console's default codepage (i.e. you haven't run chcp.com, mode.com, or called SetConsoleOutputCP), then Python started with stdout.encoding set to your locale's OEM codepage encoding. For example, if you're using a U.S. locale, it's cp437, and if you're using a Western Europe locale, it's cp850. Neither of these includes U+0159. We're presented with this codepage hell because the WriteFile and WriteConsoleA functions write a stream of bytes to the console, and it needs to be told how to decode these bytes to get Unicode text. It would be nice if the console's UTF-8 implementation (codepage 65001) wasn't buggy, but Microsoft has never cared enough to fix it (at least not completely; it's still broken for input in Windows 10). That leaves the wide-character UTF-16 function, WriteConsoleW, as the best alternative. Using this function requires bypassing Python's normal standard I/O implementation. This has been implemented as of 3.6. But for older versions you'll need to install and enable win_unicode_console. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:44 | admin | set | github: 74093 |
2017-03-26 13:40:02 | eryksun | set | status: open -> closednosy: + eryksunmessages: + stage: resolved |
2017-03-26 08:24:40 | vstinner | set | messages: + |
2017-03-26 07:41:15 | paul.moore | set | messages: + |
2017-03-26 06:06:32 | martin.panter | set | nosy: + ezio.melotti, paul.moore, tim.golden, vstinner, martin.panter, zach.ware, steve.dowermessages: + superseder: windows console doesn't print or input Unicodecomponents: + Unicode, Windowsresolution: out of date |
2017-03-26 02:26:08 | Robert Baker | create |