msg149695 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-17 18:49 |
iobench benchmarking tool showed that the UTF-8 encoder is slower in Python 3.3 than Python 3.2. The performance depends on the characters of the input string: * 8x faster (!) for a string of 50.000 ASCII characters * 1.5x slower for a string of 50.000 UCS-1 characters * 2.5x slower for a string of 50.000 UCS-2 characters The bottleneck looks to be the the PyUnicode_READ() macro. * Python 3.2: s[i++] * Python 3.3: PyUnicode_READ(kind, data, i++) Because encoding string to UTF-8 is a very common operation, performances do matter. Antoine suggests to have different versions of the function for each Unicode kind (1, 2, 4). |
|
|
msg149699 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2011-12-17 19:50 |
Can you please provide your exact testing procedure? Standard iobench.py doesn't support testing for separate ASCII, UCS-1 and UCS-2 data, so you must have used some other tool. Exact code, command line parameters, hardware description and timing results would be appreciated. Looking at the encoder, I think the first thing to change is to reduce the over-allocation for UCS-1 and UCS-2 strings. This may or may not help the run-time, but should reduce memory consumption. I wonder whether making two passes over the string (one to compute the size, and the other one with an allocated result buffer) could improve the performance. If there is further special-casing, I'd only special-case UCS-1. I doubt that the _READ() macro really is the bottleneck, and would rather expect that loop unrolling can help. Because of unallowed surrogates, unrolling is not practical for UCS-2. |
|
|
msg149702 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-17 20:13 |
> Can you please provide your exact testing procedure? Here you have. $ cat bench.sh echo -n "ASCII: " ./python -m timeit 'x="A"*50000' 'x.encode("utf-8")' echo -n "UCS-1: " ./python -m timeit 'x="\xe9"*50000' 'x.encode("utf-8")' echo -n "UCS-2: " ./python -m timeit 'x="\u20ac"*50000' 'x.encode("utf-8")' echo -n "UCS-4: " ./python -m timeit 'x="\U0010FFFF"*50000' 'x.encode("utf-8")' Python 3.2: ASCII: 10000 loops, best of 3: 31.5 usec per loop UCS-1: 10000 loops, best of 3: 62.2 usec per loop UCS-2: 10000 loops, best of 3: 91.3 usec per loop UCS-4: 1000 loops, best of 3: 267 usec per loop Python 3.3: ASCII: 100000 loops, best of 3: 3.56 usec per loop UCS-1: 10000 loops, best of 3: 98.2 usec per loop UCS-2: 1000 loops, best of 3: 201 usec per loop UCS-4: 10000 loops, best of 3: 168 usec per loop Comparaison: ASCII: Python 3.3 is 8.8x faster UCS-1: Python 3.3 is 1.6x SLOWER UCS-2: Python 3.3 is 2.2x SLOWER UCS-4: Python 3.3 is 1.6x faster iobench uses more realistic data. > Standard iobench.py doesn't support testing for separate ASCII, > UCS-1 and UCS-2 data, so you must have used some other tool. According to Antoine, iobench is slower because of the UTF-8 encoder. > hardware description i7-2600 CPU @ 3.40GHz (8 cores) with 12 GB of RAM. > I doubt that the _READ() macro really is the bottleneck It is the only difference between Python 3.2 and 3.3. Or did I miss something? The body of the loop is very small, so each instruction is important. |
|
|
msg149703 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-17 20:24 |
Oh, Antoine told me that I missed the -s command line argument to timeit: $ cat bench.sh echo -n "ASCII: " ./python -m timeit -s 'x="A"*50000' 'x.encode("utf-8")' echo -n "UCS-1: " ./python -m timeit -s 'x="\xe9"*50000' 'x.encode("utf-8")' echo -n "UCS-2: " ./python -m timeit -s 'x="\u20ac"*50000' 'x.encode("utf-8")' echo -n "UCS-4: " ./python -m timeit -s 'x="\U0010FFFF"*50000' 'x.encode("utf-8")' Python 3.2: ASCII: 10000 loops, best of 3: 28.2 usec per loop UCS-1: 10000 loops, best of 3: 59.1 usec per loop UCS-2: 10000 loops, best of 3: 88.8 usec per loop UCS-4: 1000 loops, best of 3: 254 usec per loop Python 3.3: ASCII: 1000000 loops, best of 3: 2.01 usec per loop UCS-1: 10000 loops, best of 3: 95.8 usec per loop UCS-2: 1000 loops, best of 3: 201 usec per loop UCS-4: 10000 loops, best of 3: 151 usec per loop The results look to be similar. |
|
|
msg149705 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-17 21:19 |
Python 3.2 (narrow): ASCII: 10000 loops, best of 3: 28.2 usec per loop UCS-1: 10000 loops, best of 3: 59.1 usec per loop UCS-2: 10000 loops, best of 3: 88.8 usec per loop UCS-4: 1000 loops, best of 3: 254 usec per loop Python 3.2 (wide): ASCII: 10000 loops, best of 3: 28.5 usec per loop UCS-1: 10000 loops, best of 3: 60.8 usec per loop UCS-2: 10000 loops, best of 3: 114 usec per loop UCS-4: 10000 loops, best of 3: 129 usec per loop Python 3.3 (specialized UTF-8 encoder): ASCII: 100000 loops, best of 3: 2 usec per loop UCS-1: 10000 loops, best of 3: 45.4 usec per loop UCS-2: 10000 loops, best of 3: 96.4 usec per loop UCS-4: 10000 loops, best of 3: 140 usec per loop Attached patch adds UTF-8 encoder for UCS1, UCS2 and UCS4. |
|
|
msg149706 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-17 21:25 |
> 8x faster (!) for a string of 50.000 ASCII characters Oooh, it's just faster because encoding ASCII to UTF-8 is now O(1). The ASCII data is shared with the UTF-8 data thanks to the PEP 393! |
|
|
msg149747 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-18 12:20 |
Updated patch to fix also the size of the small buffer on the stack, as suggested by Antoine. |
|
|
msg149748 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-18 12:56 |
utf8_encoder_prescan.patch: precompute the size of the output to avoid a PyBytes_Resize() at exit. It is much slower: ASCII: 100000 loops, best of 3: 2.06 usec per loop UCS-1: 10000 loops, best of 3: 123 usec per loop UCS-2: 10000 loops, best of 3: 171 usec per loop UCS-4: 1000 loops, best of 3: 254 usec per loop |
|
|
msg149750 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-18 13:06 |
Patch version 3 to fix compiler warnings (avoid variables used for the error handler, unneeded for UCS-1). |
|
|
msg149752 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-12-18 13:20 |
New changeset fbd797fc3809 by Victor Stinner in branch 'default': Issue #13624: Write a specialized UTF-8 encoder to allow more optimization http://hg.python.org/cpython/rev/fbd797fc3809 |
|
|
msg149799 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2011-12-18 19:31 |
> Oooh, it's just faster because encoding ASCII to UTF-8 is now O(1) It's actually still O(n): the UTF-8 data still need to be copied into a bytes object. |
|
|
msg149800 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-12-18 19:44 |
> It's actually still O(n): the UTF-8 data still need to be copied > into a bytes object. Hum, correct, but a memory copy is much faster than having to decode UTF-8. |
|
|