Issue 14738: Amazingly faster UTF-8 decoding (original) (raw)

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, ezio.melotti, janssen, jcea, loewis, mark.dickinson, ned.deily, pitrou, python-dev, ronaldoussoren, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-05-06 18:00 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
decode_utf8_4.patch	serhiy.storchaka,2012-05-06 18:00	review
decode_utf8_5.patch	serhiy.storchaka,2012-05-06 22:11	review

Messages (15)
msg160103 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-06 18:00
I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases). Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places). As a side effect ASCII decoding now faster on some platforms (). Related issues: [] Faster utf-8 decoding [] faster utf-8 decoding [] Faster ascii decoding [] Faster utf-16 decoder [] Faster utf-32 decoder [] Faster utf-8 decoding Here are the results of benchmarking (numbers is speed in MB/s). On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz: 3.2 3.3(vanilla) patched utf-8 'A'10000 1199 (+69%) 1721 (+18%) 2032 utf-8 'A'9999+'\x80' 1189 (+25%) 996 (+49%) 1488 utf-8 'A'9999+'\u0100' 1192 (-25%) 887 (+1%) 894 utf-8 'A'9999+'\u8000' 1178 (-24%) 888 (+0%) 890 utf-8 'A'9999+'\U00010000' 1177 (-29%) 872 (-4%) 837 utf-8 '\x80'10000 220 (+74%) 172 (+122%) 382 utf-8 '\x80'+'A'9999 1192 (+5%) 376 (+232%) 1250 utf-8 '\x80'9999+'\u0100' 220 (+54%) 160 (+112%) 339 utf-8 '\x80'9999+'\u8000' 220 (+54%) 160 (+112%) 339 utf-8 '\x80'9999+'\U00010000' 221 (+49%) 176 (+88%) 330 utf-8 '\u0100'10000 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'+'A'9999 1177 (+4%) 382 (+219%) 1220 utf-8 '\u0100'+'\x80'9999 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'9999+'\u8000' 220 (+74%) 163 (+134%) 382 utf-8 '\u0100'9999+'\U00010000' 220 (+50%) 180 (+83%) 330 utf-8 '\u8000'10000 261 (+66%) 191 (+126%) 432 utf-8 '\u8000'+'A'9999 1197 (+1%) 384 (+216%) 1212 utf-8 '\u8000'+'\x80'9999 216 (+77%) 163 (+134%) 382 utf-8 '\u8000'+'\u0100'9999 215 (+77%) 164 (+132%) 381 utf-8 '\u8000'9999+'\U00010000' 261 (+46%) 201 (+89%) 380 utf-8 '\U00010000'10000 248 (+44%) 198 (+80%) 357 utf-8 '\U00010000'+'A'9999 1192 (-5%) 383 (+196%) 1135 utf-8 '\U00010000'+'\x80'9999 220 (+73%) 180 (+111%) 380 utf-8 '\U00010000'+'\u0100'9999 220 (+73%) 180 (+111%) 380 utf-8 '\U00010000'+'\u8000'9999 261 (+54%) 201 (+100%) 403 ascii 'A'10000 233 (+971%) 1876 (+33%) 2496 On 32-bit Linux, Intel Atom N570 @ 1.66GHz: 3.2 3.3(vanilla) patched utf-8 'A'10000 345 (+81%) 596 (+5%) 623 utf-8 'A'9999+'\x80' 335 (+41%) 303 (+56%) 474 utf-8 'A'9999+'\u0100' 336 (-23%) 123 (+110%) 258 utf-8 'A'9999+'\u8000' 337 (-24%) 123 (+108%) 256 utf-8 'A'9999+'\U00010000' 336 (-24%) 261 (-3%) 254 utf-8 '\x80'10000 88 (+66%) 65 (+125%) 146 utf-8 '\x80'+'A'9999 334 (+8%) 124 (+190%) 360 utf-8 '\x80'9999+'\u0100' 88 (+43%) 65 (+94%) 126 utf-8 '\x80'9999+'\u8000' 88 (+43%) 65 (+94%) 126 utf-8 '\x80'9999+'\U00010000' 89 (+40%) 65 (+92%) 125 utf-8 '\u0100'10000 88 (+85%) 65 (+151%) 163 utf-8 '\u0100'+'A'9999 336 (+2%) 77 (+345%) 343 utf-8 '\u0100'+'\x80'9999 88 (+86%) 65 (+152%) 164 utf-8 '\u0100'9999+'\u8000' 88 (+86%) 65 (+152%) 164 utf-8 '\u0100'9999+'\U00010000' 88 (+57%) 65 (+112%) 138 utf-8 '\u8000'10000 98 (+79%) 69 (+154%) 175 utf-8 '\u8000'+'A'9999 339 (+3%) 77 (+353%) 349 utf-8 '\u8000'+'\x80'9999 89 (+84%) 66 (+148%) 164 utf-8 '\u8000'+'\u0100'9999 88 (+86%) 65 (+152%) 164 utf-8 '\u8000'9999+'\U00010000' 98 (+58%) 69 (+125%) 155 utf-8 '\U00010000'10000 104 (+46%) 79 (+92%) 152 utf-8 '\U00010000'+'A'9999 339 (-5%) 124 (+160%) 323 utf-8 '\U00010000'+'\x80'9999 88 (+84%) 68 (+138%) 162 utf-8 '\U00010000'+'\u0100'9999 88 (+83%) 68 (+137%) 161 utf-8 '\U00010000'+'\u8000'9999 98 (+63%) 72 (+122%) 160 ascii 'A'10000 132 (+499%) 758 (+4%) 791
msg160107 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-06 20:01
64-bit Linux, Intel Core i5 2500K: 3.2 3.3 patched utf-8 'A'10000 2550 (+198%) 6828 (+11%) 7607 utf-8 'A'9999+'\x80' 2501 (+118%) 2415 (+126%) 5456 utf-8 'A'9999+'\u0100' 2501 (-20%) 2297 (-13%) 1996 utf-8 'A'9999+'\u8000' 2494 (-14%) 2291 (-7%) 2133 utf-8 'A'9999+'\U00010000' 2494 (-11%) 2293 (-3%) 2219 utf-8 '\x80'10000 422 (+135%) 517 (+92%) 991 utf-8 '\x80'+'A'9999 2513 (+12%) 860 (+228%) 2820 utf-8 '\x80'9999+'\u0100' 426 (+102%) 525 (+64%) 862 utf-8 '\x80'9999+'\u8000' 426 (+104%) 538 (+62%) 871 utf-8 '\x80'9999+'\U00010000' 428 (+105%) 523 (+68%) 878 utf-8 '\u0100'10000 425 (+140%) 517 (+97%) 1019 utf-8 '\u0100'+'A'9999 2488 (+2%) 820 (+211%) 2549 utf-8 '\u0100'+'\x80'9999 426 (+139%) 517 (+97%) 1019 utf-8 '\u0100'9999+'\u8000' 426 (+139%) 529 (+93%) 1019 utf-8 '\u0100'9999+'\U00010000' 426 (+106%) 509 (+72%) 876 utf-8 '\u8000'10000 573 (+28%) 490 (+50%) 733 utf-8 '\u8000'+'A'9999 2500 (+1%) 822 (+208%) 2528 utf-8 '\u8000'+'\x80'9999 426 (+139%) 530 (+92%) 1018 utf-8 '\u8000'+'\u0100'9999 428 (+138%) 509 (+100%) 1018 utf-8 '\u8000'9999+'\U00010000' 573 (+17%) 447 (+51%) 673 utf-8 '\U00010000'10000 562 (+24%) 552 (+26%) 696 utf-8 '\U00010000'+'A'9999 2512 (+3%) 939 (+175%) 2584 utf-8 '\U00010000'+'\x80'9999 423 (+140%) 553 (+84%) 1017 utf-8 '\U00010000'+'\u0100'9999 426 (+139%) 549 (+85%) 1017 utf-8 '\U00010000'+'\u8000'*9999 572 (+18%) 479 (+41%) 674
msg160110 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-06 21:48
Thank your, Antoine. Finally Intel Core is defeated! If someone wants to repeat tests, see benchmark tools in .
msg160112 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-06 22:11
The patch updated in accordance with Antoine cosmetic comments.
msg160305 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-09 16:50
There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.
msg160306 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-09 18:05
It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.
msg160307 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-09 18:32
Issue4388 is related to this Mac-specific portion of the patch.
msg160308 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-09 18:41
> It would be good if someone checked on Macs work with command line > arguments, including non-valid utf8. The difficulty is that you need > to check on both Macs with 16-bit and with 32-bit wchar_t. Actually, it should be enough to run the test suite, since we should have tests for this. As for different wchar_t widths, that's the kind of thing we can leave to the buildbots (assuming our OS X buildbots come back alive some day :-)).
msg160309 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-09 19:29
I hacked the code (commented out "#if __APPLE__" in Objects/unicodeobject.c and Modules/python.c) to start this branch on Linux and ran the test (test_cmd_line) with C locale. It passed. Then I broke decoder and ran the test again to get the error. I can now confirm that the code works correctly on a platform with a 32-bit wchar_t.
msg160311 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-05-09 20:13
> Actually, it should be enough to run the test suite, since we should > have tests for this. I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied. (64-bit --with-pydebug build of Python.) No test failures. test header: == CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)] == Darwin-10.8.0-i386-64bit little-endian == /Users/mdickinson/Python/cpython/build/test_python_39794 Fragment of configure output relevant to wchar looked like this: checking wchar.h usability... yes checking wchar.h presence... yes checking for wchar.h... yes checking size of wchar_t... 4 checking for UCS-4 tcl... no checking whether wchar_t is signed... yes no usable wchar_t found
msg160312 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-09 20:18
> The difficulty is that you need to check on both Macs > with 16-bit and with 32-bit wchar_t. I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.
msg160346 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-05-10 14:38
New changeset e08c3791f035 by Antoine Pitrou in branch 'default': Issue #14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka. http://hg.python.org/cpython/rev/e08c3791f035
msg160347 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-10 14:38
The patch is now committed. Well done and thanks for your contribution.
msg160447 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-11 19:45
Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.
msg160462 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-05-12 07:09
If the commit makes Python 3.3 faster than Python 3.2, it is an optimisation that should be documented in the What's New in Python 3.3 document.

History
Date	User	Action	Args
2022-04-11 14:57:29	admin	set	github: 58943
2012-05-12 07:09:09	vstinner	set	messages: +
2012-05-11 21:58:22	pitrou	link	issue14419 superseder
2012-05-11 21:58:22	pitrou	unlink	issue14419 dependencies
2012-05-11 21:58:14	pitrou	link	issue14419 dependencies
2012-05-11 19:45:44	serhiy.storchaka	set	messages: +
2012-05-10 14:38:47	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2012-05-10 14:38:11	python-dev	set	nosy: + python-devmessages: +
2012-05-09 20🔞21	vstinner	set	messages: +
2012-05-09 20:13:57	mark.dickinson	set	nosy: + mark.dickinsonmessages: +
2012-05-09 19:29:53	serhiy.storchaka	set	messages: +
2012-05-09 18:41:36	pitrou	set	nosy: + janssen
2012-05-09 18:41:16	pitrou	set	messages: +
2012-05-09 18:32:09	serhiy.storchaka	set	messages: +
2012-05-09 18:05:08	serhiy.storchaka	set	messages: +
2012-05-09 16:50:50	pitrou	set	nosy: + ronaldoussoren, ned.deilymessages: +
2012-05-06 22:11:07	serhiy.storchaka	set	files: + decode_utf8_5.patchmessages: +
2012-05-06 21:48:10	serhiy.storchaka	set	messages: +
2012-05-06 20:01:02	pitrou	set	messages: +
2012-05-06 18:30:06	ezio.melotti	set	nosy: + ezio.melotticomponents: + Unicodestage: patch review
2012-05-06 18:00:54	serhiy.storchaka	create