[Python-ideas] TextIOWrapper callable encoding parameter (original) (raw)

Rurpy rurpy at yahoo.com
Mon Jun 11 17:06:18 CEST 2012

Previous message: [Python-ideas] TextIOWrapper callable encoding parameter
Next message: [Python-ideas] TextIOWrapper callable encoding parameter
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

As a followup, here are some timing data that seem to confirm a modest increase in speed as a result of implementing the callable encoding parameter I proposed (although that would not be the main reason for wanting to do it.) These are just for illustration. (Among many other reasons, _pyio benchmarks are not very useful.)

I read four short test files using four methods for determining the test file's encoding. The test files are a simplified model of a python coding declaration (always on first line in our case with no BOM present [*1]) followed by mixed english and japanese text.

Method 0 (reopen0): Use the encoding callable I am proposing.

def reopen0 (fname): def hook (data,buf): return get_encoding (data) t = io.open (fname, encoding=hook)

Method 1 (reopen1): Open in binary to determine encoding, then rewrap in a TextIOWrapper with the correct encoding.

def reopen1 (fname):
    b = io.open (fname, 'rb')
    line = b.readline()
    enc = get_encoding (line)
    b.seek (0)
    t = io.TextIOWrapper (b, enc, line_buffering=True)
    t.mode = 'r'

Method 2 (reopen2): Open in binary to determine encoding, then reopen in text mode with correct encoding.

def reopen2 (fname):
    b = io.open (fname, 'rb')
    line = b.readline()
    enc = get_encoding (line)
    t = io.open (fname, encoding=enc)

Method 3 (reopen3): Open in text mode (latin1) to determine encoding, then reopen in text mode with correct encoding.

def reopen3 (fname):
    f = io.open (fname, encoding='latin1')
    line = f.readline()
    enc = get_encoding (line)
    t = io.open (fname, encoding=enc)

The same get_encoding() function is used in all methods [*1].

The input test data are all small files (because we want to measure encoding detection, not how fast read() runs.) Each has a python/emacs coding declaration in the first line.

test.utf8 -- Tiny python program with coding declaration and single print statement in main() function that prints a short word (literal) in Japanese. Encoding is utf-8 (122 bytes). test.sjis -- Identical to test.utf8 but sjis encoding (111 bytes). test2.utf8 -- A python coding declaration followed by approximately 50 long lines with mixed English and Japanese (4274 bytes). test2.sjis -- Identical to test2.utf8 but sjis encoding (3401 bytes).

Results:

$ python3 bm.py test.utf8 test.utf8 / reopen0: total time (10000 reps) was 1.188323 test.utf8 / reopen1: total time (10000 reps) was 1.490757 test.utf8 / reopen2: total time (10000 reps) was 1.766081 test.utf8 / reopen3: total time (10000 reps) was 2.141996 $ python3 bm.py test.sjis test.sjis / reopen0: total time (10000 reps) was 1.175914 test.sjis / reopen1: total time (10000 reps) was 1.471780 test.sjis / reopen2: total time (10000 reps) was 1.764444 test.sjis / reopen3: total time (10000 reps) was 2.122550 $ python3 bm.py test2.utf8 test2.utf8 / reopen0: total time (10000 reps) was 1.690255 test2.utf8 / reopen1: total time (10000 reps) was 1.996235 test2.utf8 / reopen2: total time (10000 reps) was 2.278798 test2.utf8 / reopen3: total time (10000 reps) was 2.727867 $ python3 bm.py test2.sjis test2.sjis / reopen0: total time (10000 reps) was 1.841388 test2.sjis / reopen1: total time (10000 reps) was 2.147142 test2.sjis / reopen2: total time (10000 reps) was 2.426701 test2.sjis / reopen3: total time (10000 reps) was 2.873278

Here is what happen when a test data file is piped into a program using the four methods above:

$ cat test.utf8 | python3 stdin.py reopen0 read 102 characters

$ cat test.utf8 | python3 stdin.py reopen1 got exception: [Errno 29] Illegal seek

$ cat test.utf8 | python3 stdin.py reopen2 read 0 characters

$ cat test.utf8 | python3 stdin.py reopen3 read 0 characters

[*1] Here is the get_encoding function used above. It is a toy simplified python source encoding line reader. Toy, in that is looks at only one line, doesn't consider a BOM, etc. It purpose was to allow me to sanity check the benefits of having a callable encoding parameter.

def get_encoding (line):
    if isinstance (line, bytes):
        nlpos = line.index(b'\n')
        mo = ENC_PATTERN_B.search (line, 0, nlpos)
        if not mo: return None
        enc = mo.group(1).decode ('latin1')
    else:
        nlpos = line.index('\n')
        mo = ENC_PATTERN_S.search (line, 0, nlpos)
        if not mo: return None
        enc = mo.group(1)
    return enc

Previous message: [Python-ideas] TextIOWrapper callable encoding parameter
Next message: [Python-ideas] TextIOWrapper callable encoding parameter
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-ideas mailing list