Issue 28246: Unable to read simple text file (original) (raw)

Created on 2016-09-22 08:15 by AndreyTomsk, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
ResourceStrings.rc	AndreyTomsk,2016-09-22 08:15	problematic text file

Messages (5)
msg277206 - (view)	Author: (AndreyTomsk)	Date: 2016-09-22 08:15
File read operation fails when gets specific cyrillic symbol. Tested with script: testFile = open('ResourceStrings.rc', 'r') for line in testFile: print(line) Exception message: Traceback (most recent call last): File "min_test.py", line 6, in for line in testFile: File "C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: character maps to
msg277207 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-22 08:29
The default encoding on your system is Windows codepage 1251. However, your file is encoded using UTF-8: >>> lines = open('ResourceStrings.rc', 'rb').read().splitlines() >>> print(*lines, sep='\n') b'\xef\xbb\xbf\xd0\x90 (cyrillic A)' b'\xd0\x98 (cyrillic I) <<< line read fails' b'\xd0\x91 (cyrillic B)' It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding to built-in open(): >>> print(open('ResourceStrings.rc', encoding='utf-8').read()) А (cyrillic A) И (cyrillic I) <<< line read fails Б (cyrillic B)
msg277210 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2016-09-22 08:50
It would be good to add a FAQ / HowTo entry for this question.
msg277214 - (view)	Author: (AndreyTomsk)	Date: 2016-09-22 10:18
Thanks for quick reply. I'm new to python, just used tutorial docs and didn't read carefully enough to notice encoding info. Still, IMHO behaviour not consistent. In three sequential symbols in russian alphabet - З, И, К, it crashes on И, and displays other in two-byte form.
msg277215 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-22 10:33
Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 produces nonsense, otherwise known as mojibake. It happens that codepage 1251 maps every one of the 256 possible byte values, except for 0x98 (152). The exception can't be made any clearer.

History
Date	User	Action	Args
2022-04-11 14:58:37	admin	set	github: 72433
2016-09-22 10:33:53	eryksun	set	messages: +
2016-09-22 10🔞17	AndreyTomsk	set	messages: +
2016-09-22 08:50:38	SilentGhost	set	nosy: + SilentGhostmessages: +
2016-09-22 08:29:10	eryksun	set	status: open -> closednosy: + eryksunmessages: + resolution: not a bugstage: resolved
2016-09-22 08:15:11	AndreyTomsk	create