Message 228191 - Python tracker (original) (raw)
New here, but I think this is the correct issue to get info about this unicode problem. On the windows console:
chcp Active code page: 437
type utf.txt Привет
chcp 65001 Active code page: 65001
type utf.txt Привет
python --version Python 3.5.0a0
cat utf.py f = open('utf.txt') l = f.readline() print(l) print(len(l))
python utf.py Привет �²ÐµÑ‚ �‚
13
cat utf_explicit.py import codecs f = codecs.open('utf.txt', encoding='utf-8', mode='r') l = f.readline() print(l) print(len(l))
python utf_explicit.py Привет ет
7
I partly read through the page but these things are a bit above my head. Could anyone explain
- how to figure out what codec files returned by open()?
- is there a way to change it globally to utf-8?
- the last case is almost correct: it has the correct number of characters, but the print() still does something wrong. I got this working by using the stream patch, but got another example on which is is not correct, see below. Any way around this?
type utf2.txt aαbβcγdδ
cat utf2.py import streams import codecs streams.enable() f = codecs.open('utf2.txt', encoding='utf-8', mode='r') print(f.read(1)) print(f.read(1)) print(f.read(2)) print(f.read(4))
python utf2.py a α bβc γdδ