Issue 28246: Unable to read simple text file (original) (raw)

Created on 2016-09-22 08:15 by AndreyTomsk, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
ResourceStrings.rc AndreyTomsk,2016-09-22 08:15 problematic text file
Messages (5)
msg277206 - (view) Author: (AndreyTomsk) Date: 2016-09-22 08:15
File read operation fails when gets specific cyrillic symbol. Tested with script: testFile = open('ResourceStrings.rc', 'r') for line in testFile: print(line) Exception message: Traceback (most recent call last): File "min_test.py", line 6, in for line in testFile: File "C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: character maps to
msg277207 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-22 08:29
The default encoding on your system is Windows codepage 1251. However, your file is encoded using UTF-8: >>> lines = open('ResourceStrings.rc', 'rb').read().splitlines() >>> print(*lines, sep='\n') b'\xef\xbb\xbf\xd0\x90 (cyrillic A)' b'\xd0\x98 (cyrillic I) <<< line read fails' b'\xd0\x91 (cyrillic B)' It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding to built-in open(): >>> print(open('ResourceStrings.rc', encoding='utf-8').read()) А (cyrillic A) И (cyrillic I) <<< line read fails Б (cyrillic B)
msg277210 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-09-22 08:50
It would be good to add a FAQ / HowTo entry for this question.
msg277214 - (view) Author: (AndreyTomsk) Date: 2016-09-22 10:18
Thanks for quick reply. I'm new to python, just used tutorial docs and didn't read carefully enough to notice encoding info. Still, IMHO behaviour not consistent. In three sequential symbols in russian alphabet - З, И, К, it crashes on И, and displays other in two-byte form.
msg277215 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-22 10:33
Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 produces nonsense, otherwise known as mojibake. It happens that codepage 1251 maps every one of the 256 possible byte values, except for 0x98 (152). The exception can't be made any clearer.
History
Date User Action Args
2022-04-11 14:58:37 admin set github: 72433
2016-09-22 10:33:53 eryksun set messages: +
2016-09-22 10🔞17 AndreyTomsk set messages: +
2016-09-22 08:50:38 SilentGhost set nosy: + SilentGhostmessages: +
2016-09-22 08:29:10 eryksun set status: open -> closednosy: + eryksunmessages: + resolution: not a bugstage: resolved
2016-09-22 08:15:11 AndreyTomsk create