[Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Wed May 25 13:10:51 CEST 2011
- Previous message: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
- Next message: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit :
You are missing the point: we have StreamReader and StreamWriter APIs on codecs to allow each codecs to implement more efficient ways of encoding and decoding streams.
Examples of such optimizations are reading the stream in chunks that can be decoded in one piece, or writing to the stream in a way that doesn't generate encoding state problems on the receiving end by ending transmission half-way through a shift block. ... We don't have many such specialized implementations in the stdlib, but this doesn't mean that there's no use for them. It just means that developers and users are simply unaware of the possibilities opened by these stateful stream APIs.
Does at least one codec implement such implementation in its StreamReader or StreamWriter class? And can't we implement such optimization in incremental encoders and decoders (or in TextIOWrapper)?
I checked all multibyte codecs (UTF and CJK codecs) and I don't see any of such optimization. UTF codecs handle the BOM, but don't have anything looking like an optimization. CJK codecs use multibytecodec, MultibyteStreamReader and MultibyteStreamWriter, which don't look to be optimized. But I missed maybe something?
TextIOWrapper has an advanced buffer algorithm to prefetch (readahead) some bytes at each read to speed up small read. It is difficult to implement such algorithm, but it's done and it works.
--
Ok, let's stop to speak about theorical optimizations, and let's do a benchmark to compare codecs and the io modules on reading files!
I tested Python 3.3 (70370:178d367c9733) compiled in release mode (gcc -O3) on a Pentium4 @ 3 GHz with 2 GB of memory. I tunned manually the number of loops to ensure that the faster test takes at least one second. I only ran my benchmark once. See the attached bench.py file.
(1) Decode Objects/unicodeobject.c (317336 characters) from utf-8
test_io.readline(): 89.6 ms test_codecs.readline(): 1272.8 ms -> codecs 1320% slower than io
test_io.read(1): 1728.9 ms test_codecs.read(1): 36395.0 ms -> codecs 2005% slower than io
test_io.read(100): 460.7 ms test_codecs.read(100): 3897.0 ms -> codecs 746% slower than io
test_io.read(-1): 1911.7 ms test_codecs.read(-1): 1740.7 ms -> codecs 10% FASTER than io
(2) Decode README (6613 characters) from ascii
test_io.readline(): 109.9 ms test_codecs.readline(): 1023.8 ms -> codecs 832% slower than io
test_io.read(1): 1560.4 ms test_codecs.read(1): 29402.6 ms -> codecs 1784% slower than io
test_io.read(100): 866.9 ms test_codecs.read(100): 3699.5 ms -> codecs 327% slower than io
test_io.read(-1): 5140.2 ms test_codecs.read(-1): 4817.9 ms -> codecs 7% FASTER than io
(3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030
test_io.readline(): 1193.7 ms test_codecs.readline(): 1474.3 ms -> codecs 24% slower than io
test_io.read(1): 3847.7 ms test_codecs.read(1): 27103.9 ms -> codecs 604% slower than io
test_io.read(100): 12839.5 ms test_codecs.read(100): 13444.2 ms -> codecs 5% slower than io
test_io.read(-1): 2183.3 ms test_codecs.read(-1): 1906.1 ms -> codecs 15% FASTER than io
The readahead code does really help read(1): io is between 6 and 20 times faster than the codecs. But it does really use a more common usecase, readline: io is between 1.2 and 13 times faster than the codecs.
codecs is always faster (between 1.07 and 1.15 times faster than io) to read the whole content of file using read(-1). Something should maybe be optimized in TextIOWrapper.read() ;-) But the gain is minor if you compare it to the gain on read(1) and readline()!
Please check my bench.py script and redo the benchmark on your own computer!
Victor -------------- next part -------------- A non-text attachment was scrubbed... Name: bench.py Type: text/x-python Size: 1867 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-dev/attachments/20110525/dadd9dd4/attachment.py>
- Previous message: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
- Next message: [Python-Dev] Deprecate codecs.open() and StreamWriter/StreamReader
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]