[Python-ideas] TextIOWrapper callable encoding parameter (original) (raw)
Rurpy rurpy at yahoo.com
Mon Jun 11 17:06:18 CEST 2012
- Previous message: [Python-ideas] TextIOWrapper callable encoding parameter
- Next message: [Python-ideas] TextIOWrapper callable encoding parameter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
As a followup, here are some timing data that seem to confirm a modest increase in speed as a result of implementing the callable encoding parameter I proposed (although that would not be the main reason for wanting to do it.) These are just for illustration. (Among many other reasons, _pyio benchmarks are not very useful.)
I read four short test files using four methods for determining the test file's encoding. The test files are a simplified model of a python coding declaration (always on first line in our case with no BOM present [*1]) followed by mixed english and japanese text.
Method 0 (reopen0): Use the encoding callable I am proposing.
def reopen0 (fname): def hook (data,buf): return get_encoding (data) t = io.open (fname, encoding=hook)
Method 1 (reopen1): Open in binary to determine encoding, then rewrap in a TextIOWrapper with the correct encoding.
def reopen1 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
b.seek (0)
t = io.TextIOWrapper (b, enc, line_buffering=True)
t.mode = 'r'
Method 2 (reopen2): Open in binary to determine encoding, then reopen in text mode with correct encoding.
def reopen2 (fname):
b = io.open (fname, 'rb')
line = b.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)
Method 3 (reopen3): Open in text mode (latin1) to determine encoding, then reopen in text mode with correct encoding.
def reopen3 (fname):
f = io.open (fname, encoding='latin1')
line = f.readline()
enc = get_encoding (line)
t = io.open (fname, encoding=enc)
The same get_encoding() function is used in all methods [*1].
The input test data are all small files (because we want to measure encoding detection, not how fast read() runs.) Each has a python/emacs coding declaration in the first line.
test.utf8 -- Tiny python program with coding declaration and single print statement in main() function that prints a short word (literal) in Japanese. Encoding is utf-8 (122 bytes). test.sjis -- Identical to test.utf8 but sjis encoding (111 bytes). test2.utf8 -- A python coding declaration followed by approximately 50 long lines with mixed English and Japanese (4274 bytes). test2.sjis -- Identical to test2.utf8 but sjis encoding (3401 bytes).
Results:
$ python3 bm.py test.utf8 test.utf8 / reopen0: total time (10000 reps) was 1.188323 test.utf8 / reopen1: total time (10000 reps) was 1.490757 test.utf8 / reopen2: total time (10000 reps) was 1.766081 test.utf8 / reopen3: total time (10000 reps) was 2.141996 $ python3 bm.py test.sjis test.sjis / reopen0: total time (10000 reps) was 1.175914 test.sjis / reopen1: total time (10000 reps) was 1.471780 test.sjis / reopen2: total time (10000 reps) was 1.764444 test.sjis / reopen3: total time (10000 reps) was 2.122550 $ python3 bm.py test2.utf8 test2.utf8 / reopen0: total time (10000 reps) was 1.690255 test2.utf8 / reopen1: total time (10000 reps) was 1.996235 test2.utf8 / reopen2: total time (10000 reps) was 2.278798 test2.utf8 / reopen3: total time (10000 reps) was 2.727867 $ python3 bm.py test2.sjis test2.sjis / reopen0: total time (10000 reps) was 1.841388 test2.sjis / reopen1: total time (10000 reps) was 2.147142 test2.sjis / reopen2: total time (10000 reps) was 2.426701 test2.sjis / reopen3: total time (10000 reps) was 2.873278
Here is what happen when a test data file is piped into a program using the four methods above:
$ cat test.utf8 | python3 stdin.py reopen0 read 102 characters
$ cat test.utf8 | python3 stdin.py reopen1 got exception: [Errno 29] Illegal seek
$ cat test.utf8 | python3 stdin.py reopen2 read 0 characters
$ cat test.utf8 | python3 stdin.py reopen3 read 0 characters
[*1] Here is the get_encoding function used above. It is a toy simplified python source encoding line reader. Toy, in that is looks at only one line, doesn't consider a BOM, etc. It purpose was to allow me to sanity check the benefits of having a callable encoding parameter.
def get_encoding (line):
if isinstance (line, bytes):
nlpos = line.index(b'\n')
mo = ENC_PATTERN_B.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1).decode ('latin1')
else:
nlpos = line.index('\n')
mo = ENC_PATTERN_S.search (line, 0, nlpos)
if not mo: return None
enc = mo.group(1)
return enc
- Previous message: [Python-ideas] TextIOWrapper callable encoding parameter
- Next message: [Python-ideas] TextIOWrapper callable encoding parameter
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]