Message 235655 - Python tracker (original) (raw)

This isn't a Python bug. The Windows console doesn't properly support UTF-8. See issue 1602 and Drekin's win-unicode-console, an alternative REPL based on the wide-character (UCS-2) console API.

FWIW, I attached a debugger to conhost.exe under Windows 7 to inspect what's happening here. In the client, the CRT's read() function calls WinAPI ReadFile. For a console handle this calls either ReadConsoleA or (in Windows 8+) NtReadFile. Either way, most of the action happens in the server process, conhost.exe.

The server's input buffer is Unicode, which gets encoded to CP 65001 (UTF-8) by calling WideCharToMultibyte. However the server incorrectly assumes the current codepage is a Windows ANSI codepage with a one-to-one mapping, i.e. that each 16-bit wchar_t maps to an 8-bit char in the current codepage. Since 'ł' gets UTF-8 encoded as the two-byte string b'\xc5\x82', the allocated buffer is too small by a byte. The server doesn't recover from this failure by allocating a larger buffer. It just reports back to the client process that it read 0 bytes. The CRT in turn sets the end-of-file (EOF) flag on the stdin FILE stream, which causes Python to exit 'normally'.