Issue 14738: Amazingly faster UTF-8 decoding (original) (raw)

process

Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2012-05-06 18:00 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decode_utf8_4.patch serhiy.storchaka,2012-05-06 18:00 review
decode_utf8_5.patch serhiy.storchaka,2012-05-06 22:11 review
Messages (15)
msg160103 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 18:00
I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases). Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places). As a side effect ASCII decoding now faster on some platforms (). Related issues: [] Faster utf-8 decoding [] faster utf-8 decoding [] Faster ascii decoding [] Faster utf-16 decoder [] Faster utf-32 decoder [] Faster utf-8 decoding Here are the results of benchmarking (numbers is speed in MB/s). On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz: 3.2 3.3(vanilla) patched utf-8 'A'*10000 1199 (+69%) 1721 (+18%) 2032 utf-8 'A'*9999+'\x80' 1189 (+25%) 996 (+49%) 1488 utf-8 'A'*9999+'\u0100' 1192 (-25%) 887 (+1%) 894 utf-8 'A'*9999+'\u8000' 1178 (-24%) 888 (+0%) 890 utf-8 'A'*9999+'\U00010000' 1177 (-29%) 872 (-4%) 837 utf-8 '\x80'*10000 220 (+74%) 172 (+122%) 382 utf-8 '\x80'+'A'*9999 1192 (+5%) 376 (+232%) 1250 utf-8 '\x80'*9999+'\u0100' 220 (+54%) 160 (+112%) 339 utf-8 '\x80'*9999+'\u8000' 220 (+54%) 160 (+112%) 339 utf-8 '\x80'*9999+'\U00010000' 221 (+49%) 176 (+88%) 330 utf-8 '\u0100'*10000 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'+'A'*9999 1177 (+4%) 382 (+219%) 1220 utf-8 '\u0100'+'\x80'*9999 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'*9999+'\u8000' 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'*9999+'\U00010000' 220 (+50%) 180 (+83%) 330 utf-8 '\u8000'*10000 261 (+66%) 191 (+126%) 432 utf-8 '\u8000'+'A'*9999 1197 (+1%) 384 (+216%) 1212 utf-8 '\u8000'+'\x80'*9999 216 (+77%) 163 (+134%) 382 utf-8 '\u8000'+'\u0100'*9999 215 (+77%) 164 (+132%) 381 utf-8 '\u8000'*9999+'\U00010000' 261 (+46%) 201 (+89%) 380 utf-8 '\U00010000'*10000 248 (+44%) 198 (+80%) 357 utf-8 '\U00010000'+'A'*9999 1192 (-5%) 383 (+196%) 1135 utf-8 '\U00010000'+'\x80'*9999 220 (+73%) 180 (+111%) 380 utf-8 '\U00010000'+'\u0100'*9999 220 (+73%) 180 (+111%) 380 utf-8 '\U00010000'+'\u8000'*9999 261 (+54%) 201 (+100%) 403 ascii 'A'*10000 233 (+971%) 1876 (+33%) 2496 On 32-bit Linux, Intel Atom N570 @ 1.66GHz: 3.2 3.3(vanilla) patched utf-8 'A'*10000 345 (+81%) 596 (+5%) 623 utf-8 'A'*9999+'\x80' 335 (+41%) 303 (+56%) 474 utf-8 'A'*9999+'\u0100' 336 (-23%) 123 (+110%) 258 utf-8 'A'*9999+'\u8000' 337 (-24%) 123 (+108%) 256 utf-8 'A'*9999+'\U00010000' 336 (-24%) 261 (-3%) 254 utf-8 '\x80'*10000 88 (+66%) 65 (+125%) 146 utf-8 '\x80'+'A'*9999 334 (+8%) 124 (+190%) 360 utf-8 '\x80'*9999+'\u0100' 88 (+43%) 65 (+94%) 126 utf-8 '\x80'*9999+'\u8000' 88 (+43%) 65 (+94%) 126 utf-8 '\x80'*9999+'\U00010000' 89 (+40%) 65 (+92%) 125 utf-8 '\u0100'*10000 88 (+85%) 65 (+151%) 163 utf-8 '\u0100'+'A'*9999 336 (+2%) 77 (+345%) 343 utf-8 '\u0100'+'\x80'*9999 88 (+86%) 65 (+152%) 164 utf-8 '\u0100'*9999+'\u8000' 88 (+86%) 65 (+152%) 164 utf-8 '\u0100'*9999+'\U00010000' 88 (+57%) 65 (+112%) 138 utf-8 '\u8000'*10000 98 (+79%) 69 (+154%) 175 utf-8 '\u8000'+'A'*9999 339 (+3%) 77 (+353%) 349 utf-8 '\u8000'+'\x80'*9999 89 (+84%) 66 (+148%) 164 utf-8 '\u8000'+'\u0100'*9999 88 (+86%) 65 (+152%) 164 utf-8 '\u8000'*9999+'\U00010000' 98 (+58%) 69 (+125%) 155 utf-8 '\U00010000'*10000 104 (+46%) 79 (+92%) 152 utf-8 '\U00010000'+'A'*9999 339 (-5%) 124 (+160%) 323 utf-8 '\U00010000'+'\x80'*9999 88 (+84%) 68 (+138%) 162 utf-8 '\U00010000'+'\u0100'*9999 88 (+83%) 68 (+137%) 161 utf-8 '\U00010000'+'\u8000'*9999 98 (+63%) 72 (+122%) 160 ascii 'A'*10000 132 (+499%) 758 (+4%) 791
msg160107 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-06 20:01
64-bit Linux, Intel Core i5 2500K: 3.2 3.3 patched utf-8 'A'*10000 2550 (+198%) 6828 (+11%) 7607 utf-8 'A'*9999+'\x80' 2501 (+118%) 2415 (+126%) 5456 utf-8 'A'*9999+'\u0100' 2501 (-20%) 2297 (-13%) 1996 utf-8 'A'*9999+'\u8000' 2494 (-14%) 2291 (-7%) 2133 utf-8 'A'*9999+'\U00010000' 2494 (-11%) 2293 (-3%) 2219 utf-8 '\x80'*10000 422 (+135%) 517 (+92%) 991 utf-8 '\x80'+'A'*9999 2513 (+12%) 860 (+228%) 2820 utf-8 '\x80'*9999+'\u0100' 426 (+102%) 525 (+64%) 862 utf-8 '\x80'*9999+'\u8000' 426 (+104%) 538 (+62%) 871 utf-8 '\x80'*9999+'\U00010000' 428 (+105%) 523 (+68%) 878 utf-8 '\u0100'*10000 425 (+140%) 517 (+97%) 1019 utf-8 '\u0100'+'A'*9999 2488 (+2%) 820 (+211%) 2549 utf-8 '\u0100'+'\x80'*9999 426 (+139%) 517 (+97%) 1019 utf-8 '\u0100'*9999+'\u8000' 426 (+139%) 529 (+93%) 1019 utf-8 '\u0100'*9999+'\U00010000' 426 (+106%) 509 (+72%) 876 utf-8 '\u8000'*10000 573 (+28%) 490 (+50%) 733 utf-8 '\u8000'+'A'*9999 2500 (+1%) 822 (+208%) 2528 utf-8 '\u8000'+'\x80'*9999 426 (+139%) 530 (+92%) 1018 utf-8 '\u8000'+'\u0100'*9999 428 (+138%) 509 (+100%) 1018 utf-8 '\u8000'*9999+'\U00010000' 573 (+17%) 447 (+51%) 673 utf-8 '\U00010000'*10000 562 (+24%) 552 (+26%) 696 utf-8 '\U00010000'+'A'*9999 2512 (+3%) 939 (+175%) 2584 utf-8 '\U00010000'+'\x80'*9999 423 (+140%) 553 (+84%) 1017 utf-8 '\U00010000'+'\u0100'*9999 426 (+139%) 549 (+85%) 1017 utf-8 '\U00010000'+'\u8000'*9999 572 (+18%) 479 (+41%) 674
msg160110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 21:48
Thank your, Antoine. Finally Intel Core is defeated! If someone wants to repeat tests, see benchmark tools in .
msg160112 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-06 22:11
The patch updated in accordance with Antoine cosmetic comments.
msg160305 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 16:50
There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.
msg160306 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:05
It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.
msg160307 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 18:32
Issue4388 is related to this Mac-specific portion of the patch.
msg160308 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-09 18:41
> It would be good if someone checked on Macs work with command line > arguments, including non-valid utf8. The difficulty is that you need > to check on both Macs with 16-bit and with 32-bit wchar_t. Actually, it should be enough to run the test suite, since we should have tests for this. As for different wchar_t widths, that's the kind of thing we can leave to the buildbots (assuming our OS X buildbots come back alive some day :-)).
msg160309 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-09 19:29
I hacked the code (commented out "#if __APPLE__" in Objects/unicodeobject.c and Modules/python.c) to start this branch on Linux and ran the test (test_cmd_line) with C locale. It passed. Then I broke decoder and ran the test again to get the error. I can now confirm that the code works correctly on a platform with a 32-bit wchar_t.
msg160311 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-05-09 20:13
> Actually, it should be enough to run the test suite, since we should > have tests for this. I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied. (64-bit --with-pydebug build of Python.) No test failures. test header: == CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)] == Darwin-10.8.0-i386-64bit little-endian == /Users/mdickinson/Python/cpython/build/test_python_39794 Fragment of configure output relevant to wchar looked like this: checking wchar.h usability... yes checking wchar.h presence... yes checking for wchar.h... yes checking size of wchar_t... 4 checking for UCS-4 tcl... no checking whether wchar_t is signed... yes no usable wchar_t found
msg160312 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-09 20:18
> The difficulty is that you need to check on both Macs > with 16-bit and with 32-bit wchar_t. I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.
msg160346 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-05-10 14:38
New changeset e08c3791f035 by Antoine Pitrou in branch 'default': Issue #14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka. http://hg.python.org/cpython/rev/e08c3791f035
msg160347 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-10 14:38
The patch is now committed. Well done and thanks for your contribution.
msg160447 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-11 19:45
Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.
msg160462 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-05-12 07:09
If the commit makes Python 3.3 faster than Python 3.2, it is an optimisation that should be documented in the What's New in Python 3.3 document.
History
Date User Action Args
2022-04-11 14:57:29 admin set github: 58943
2012-05-12 07:09:09 vstinner set messages: +
2012-05-11 21:58:22 pitrou link issue14419 superseder
2012-05-11 21:58:22 pitrou unlink issue14419 dependencies
2012-05-11 21:58:14 pitrou link issue14419 dependencies
2012-05-11 19:45:44 serhiy.storchaka set messages: +
2012-05-10 14:38:47 pitrou set status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2012-05-10 14:38:11 python-dev set nosy: + python-devmessages: +
2012-05-09 20🔞21 vstinner set messages: +
2012-05-09 20:13:57 mark.dickinson set nosy: + mark.dickinsonmessages: +
2012-05-09 19:29:53 serhiy.storchaka set messages: +
2012-05-09 18:41:36 pitrou set nosy: + janssen
2012-05-09 18:41:16 pitrou set messages: +
2012-05-09 18:32:09 serhiy.storchaka set messages: +
2012-05-09 18:05:08 serhiy.storchaka set messages: +
2012-05-09 16:50:50 pitrou set nosy: + ronaldoussoren, ned.deilymessages: +
2012-05-06 22:11:07 serhiy.storchaka set files: + decode_utf8_5.patchmessages: +
2012-05-06 21:48:10 serhiy.storchaka set messages: +
2012-05-06 20:01:02 pitrou set messages: +
2012-05-06 18:30:06 ezio.melotti set nosy: + ezio.melotticomponents: + Unicodestage: patch review
2012-05-06 18:00:54 serhiy.storchaka create