File read operation fails when gets specific cyrillic symbol. Tested with script: testFile = open('ResourceStrings.rc', 'r') for line in testFile: print(line) Exception message: Traceback (most recent call last): File "min_test.py", line 6, in for line in testFile: File "C:\Users\afi\AppData\Local\Programs\Python\Python36\lib\encodings\cp1251.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 24: character maps to
The default encoding on your system is Windows codepage 1251. However, your file is encoded using UTF-8: >>> lines = open('ResourceStrings.rc', 'rb').read().splitlines() >>> print(*lines, sep='\n') b'\xef\xbb\xbf\xd0\x90 (cyrillic A)' b'\xd0\x98 (cyrillic I) <<< line read fails' b'\xd0\x91 (cyrillic B)' It even has a UTF-8 BOM (i.e. b'\xef\xbb\xbf'). You need to pass the encoding to built-in open(): >>> print(open('ResourceStrings.rc', encoding='utf-8').read()) А (cyrillic A) И (cyrillic I) <<< line read fails Б (cyrillic B)
Thanks for quick reply. I'm new to python, just used tutorial docs and didn't read carefully enough to notice encoding info. Still, IMHO behaviour not consistent. In three sequential symbols in russian alphabet - З, И, К, it crashes on И, and displays other in two-byte form.
Codepage 1251 is a single-byte encoding and a superset of ASCII (i.e. ordinals 0-127). UTF-8 is also a superset of ASCII, so there's no problem as long as the encoded text is strictly ASCII. But decoding non-ASCII UTF-8 as codepage 1251 produces nonsense, otherwise known as mojibake. It happens that codepage 1251 maps every one of the 256 possible byte values, except for 0x98 (152). The exception can't be made any clearer.