Issue 15026: Faster UTF-16 encoding (original) (raw)

Created on 2012-06-07 13:56 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
encode-utf16.patch	serhiy.storchaka,2012-06-07 13:56	review
encode-utf16-2.patch	serhiy.storchaka,2012-06-15 19:35	review

Messages (11)
msg162473 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-07 13:56
In pair to here is a patch than speed up UTF-16 encoding in several times. In addition, it fixes an unsafe check of an integer overflow. Here are the results of benchmarking. See benchmark tools in https://bitbucket.org/storchaka/cpython-stuff repository. On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz: Py2.7 Py3.2 Py3.3 patched 457 (+575%) 458 (+573%) 1077 (+186%) 3083 encode utf-16le 'A'10000 457 (+579%) 493 (+529%) 1084 (+186%) 3102 encode utf-16le '\x80'10000 489 (+534%) 458 (+577%) 1081 (+187%) 3102 encode utf-16le '\x80'+'A'9999 457 (+1261%) 493 (+1161%) 1116 (+457%) 6219 encode utf-16le '\u0100'10000 489 (+1266%) 458 (+1358%) 1126 (+493%) 6678 encode utf-16le '\u0100'+'A'9999 489 (+1263%) 458 (+1355%) 1129 (+490%) 6666 encode utf-16le '\u0100'+'\x80'9999 457 (+1240%) 493 (+1142%) 1118 (+448%) 6125 encode utf-16le '\u8000'10000 489 (+1271%) 458 (+1363%) 1127 (+495%) 6702 encode utf-16le '\u8000'+'A'9999 489 (+1271%) 458 (+1364%) 1129 (+494%) 6705 encode utf-16le '\u8000'+'\x80'9999 489 (+1135%) 458 (+1218%) 1136 (+432%) 6038 encode utf-16le '\u8000'+'\u0100'9999 498 (+128%) 505 (+125%) 630 (+80%) 1137 encode utf-16le '\U00010000'10000 489 (+35%) 458 (+44%) 360 (+83%) 659 encode utf-16le '\U00010000'+'A'9999 489 (+35%) 458 (+44%) 359 (+84%) 660 encode utf-16le '\U00010000'+'\x80'9999 489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u0100'9999 489 (+36%) 458 (+45%) 361 (+84%) 663 encode utf-16le '\U00010000'+'\u8000'9999 447 (+507%) 493 (+450%) 1086 (+150%) 2712 encode utf-16be 'A'10000 447 (+513%) 493 (+456%) 1080 (+154%) 2739 encode utf-16be '\x80'10000 489 (+458%) 458 (+496%) 1079 (+153%) 2729 encode utf-16be '\x80'+'A'9999 447 (+498%) 494 (+441%) 1118 (+139%) 2672 encode utf-16be '\u0100'10000 489 (+464%) 458 (+502%) 1128 (+144%) 2756 encode utf-16be '\u0100'+'A'9999 489 (+463%) 458 (+502%) 1131 (+144%) 2755 encode utf-16be '\u0100'+'\x80'9999 447 (+500%) 493 (+444%) 1119 (+139%) 2680 encode utf-16be '\u8000'10000 489 (+463%) 458 (+502%) 1126 (+145%) 2755 encode utf-16be '\u8000'+'A'9999 489 (+464%) 458 (+502%) 1129 (+144%) 2757 encode utf-16be '\u8000'+'\x80'9999 489 (+479%) 458 (+518%) 1137 (+149%) 2829 encode utf-16be '\u8000'+'\u0100'9999 499 (+102%) 506 (+99%) 630 (+60%) 1009 encode utf-16be '\U00010000'10000 489 (+6%) 458 (+13%) 360 (+44%) 519 encode utf-16be '\U00010000'+'A'9999 489 (+6%) 458 (+13%) 359 (+44%) 518 encode utf-16be '\U00010000'+'\x80'9999 489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u0100'9999 489 (+6%) 458 (+13%) 361 (+44%) 519 encode utf-16be '\U00010000'+'\u8000'9999
msg162701 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-13 09:37
Here are results under 64-bit Linux on a Core i5-2500K: 3.3 patched 3327 (+360%) 15304 encode utf-16le 'A'10000 3314 (+335%) 14413 encode utf-16le '\x80'10000 3315 (+578%) 22472 encode utf-16le '\x80'+'A'9999 2390 (+668%) 18345 encode utf-16le '\u0100'10000 2390 (+668%) 18364 encode utf-16le '\u0100'+'A'9999 2324 (+684%) 18219 encode utf-16le '\u0100'+'\x80'9999 2385 (+664%) 18227 encode utf-16le '\u8000'10000 2390 (+669%) 18383 encode utf-16le '\u8000'+'A'9999 2390 (+663%) 18232 encode utf-16le '\u8000'+'\x80'9999 2385 (+601%) 16708 encode utf-16le '\u8000'+'\u0100'9999 1601 (-4%) 1542 encode utf-16le '\U00010000'10000 1209 (+20%) 1448 encode utf-16le '\U00010000'+'A'9999 1210 (+20%) 1447 encode utf-16le '\U00010000'+'\x80'9999 1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u0100'9999 1209 (+20%) 1446 encode utf-16le '\U00010000'+'\u8000'9999 3237 (+562%) 21422 encode utf-16be 'A'10000 3294 (+500%) 19779 encode utf-16be '\x80'10000 3290 (+357%) 15036 encode utf-16be '\x80'+'A'9999 2382 (+209%) 7354 encode utf-16be '\u0100'10000 2381 (+208%) 7342 encode utf-16be '\u0100'+'A'9999 2377 (+209%) 7347 encode utf-16be '\u0100'+'\x80'9999 2382 (+207%) 7317 encode utf-16be '\u8000'10000 2381 (+208%) 7343 encode utf-16be '\u8000'+'A'9999 2376 (+209%) 7343 encode utf-16be '\u8000'+'\x80'9999 2377 (+206%) 7281 encode utf-16be '\u8000'+'\u0100'9999 1598 (-42%) 930 encode utf-16be '\U00010000'10000 1208 (+19%) 1436 encode utf-16be '\U00010000'+'A'9999 1208 (+19%) 1436 encode utf-16be '\U00010000'+'\x80'9999 1205 (+19%) 1434 encode utf-16be '\U00010000'+'\u0100'9999 1205 (+19%) 1433 encode utf-16be '\U00010000'+'\u8000'9999
msg162822 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-14 20:29
Thank you, Antoine. > 3327 (+360%) 15304 encode utf-16le 'A'10000 > 3314 (+335%) 14413 encode utf-16le '\x80'10000 > 3290 (+357%) 15036 encode utf-16be '\x80'+'A'9999 It must be a fluctuation (-30-40%). For all UCS1 strings the same code is used. > 1598 (-42%) 930 encode utf-16be '\U00010000'10000 This is most likely the fluctuation too. Code for non-BMP characters is different from the code for other characters in UCS4 string, but unlikely a difference is 1.5x. Reproduced whether this result? On 32-bit Linux, Intel Atom N570 @ 1.66GHz: Py2.7 Py3.2 Py3.3 patched 273 (+229%) 274 (+227%) 333 (+169%) 897 encode utf-16le 'A'10000 274 (+226%) 275 (+225%) 334 (+168%) 894 encode utf-16le '\x80'10000 274 (+231%) 275 (+230%) 334 (+172%) 908 encode utf-16le '\x80'+'A'9999 273 (+752%) 275 (+746%) 276 (+743%) 2326 encode utf-16le '\u0100'10000 274 (+695%) 275 (+692%) 276 (+689%) 2177 encode utf-16le '\u0100'+'A'9999 274 (+739%) 275 (+736%) 276 (+733%) 2300 encode utf-16le '\u0100'+'\x80'9999 274 (+739%) 275 (+736%) 276 (+733%) 2298 encode utf-16le '\u8000'10000 274 (+697%) 274 (+697%) 276 (+691%) 2184 encode utf-16le '\u8000'+'A'9999 274 (+741%) 274 (+741%) 277 (+731%) 2303 encode utf-16le '\u8000'+'\x80'9999 274 (+770%) 275 (+767%) 276 (+764%) 2384 encode utf-16le '\u8000'+'\u0100'9999 279 (+51%) 279 (+51%) 217 (+94%) 422 encode utf-16le '\U00010000'10000 274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'A'9999 274 (+6%) 274 (+6%) 162 (+79%) 290 encode utf-16le '\U00010000'+'\x80'9999 273 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u0100'9999 274 (+5%) 275 (+5%) 162 (+78%) 288 encode utf-16le '\U00010000'+'\u8000'9999 274 (+152%) 275 (+151%) 334 (+107%) 690 encode utf-16be 'A'10000 274 (+154%) 275 (+153%) 334 (+109%) 697 encode utf-16be '\x80'10000 274 (+152%) 275 (+151%) 333 (+108%) 691 encode utf-16be '\x80'+'A'9999 274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'10000 274 (+146%) 275 (+145%) 276 (+145%) 675 encode utf-16be '\u0100'+'A'9999 274 (+145%) 275 (+144%) 276 (+143%) 671 encode utf-16be '\u0100'+'\x80'9999 274 (+145%) 275 (+144%) 276 (+143%) 672 encode utf-16be '\u8000'10000 275 (+147%) 275 (+147%) 276 (+146%) 680 encode utf-16be '\u8000'+'A'9999 274 (+146%) 275 (+145%) 276 (+144%) 674 encode utf-16be '\u8000'+'\x80'9999 275 (+143%) 275 (+143%) 276 (+142%) 667 encode utf-16be '\u8000'+'\u0100'9999 279 (+26%) 279 (+26%) 217 (+62%) 351 encode utf-16be '\U00010000'10000 274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'A'9999 274 (-2%) 275 (-3%) 162 (+65%) 268 encode utf-16be '\U00010000'+'\x80'9999 274 (-4%) 275 (-4%) 162 (+63%) 264 encode utf-16be '\U00010000'+'\u0100'9999 274 (-3%) 275 (-4%) 162 (+64%) 265 encode utf-16be '\U00010000'+'\u8000'9999
msg162924 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-15 17:34
Serhiy, the tests crash here in debug mode: $ ./python -m test -v test_unicode == CPython 3.3.0a4+ (default:b17c8005e08a+, Jun 15 2012, 19:28:56) [GCC 4.5.2] == Linux-2.6.38.8-desktop-10.mga-x86_64-with-mandrake-1-Official little-endian == /home/antoine/cpython/default/build/test_python_2567 Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1) [1/1] test_unicode test_formatter_field_name_split (test.test_unicode.StringModuleTest) ... ok test_formatter_parser (test.test_unicode.StringModuleTest) ... ok test___contains__ (test.test_unicode.UnicodeTest) ... ok test_additional_rsplit (test.test_unicode.UnicodeTest) ... ok test_additional_split (test.test_unicode.UnicodeTest) ... ok test_ascii (test.test_unicode.UnicodeTest) ... ok test_aswidechar (test.test_unicode.UnicodeTest) ... ok test_aswidecharstring (test.test_unicode.UnicodeTest) ... ok test_bug1001011 (test.test_unicode.UnicodeTest) ... ok test_bytes_comparison (test.test_unicode.UnicodeTest) ... ok test_capitalize (test.test_unicode.UnicodeTest) ... ok test_casefold (test.test_unicode.UnicodeTest) ... ok test_center (test.test_unicode.UnicodeTest) ... ok test_codecs (test.test_unicode.UnicodeTest) ... python: Objects/unicodeobject.c:5401: _PyUnicode_EncodeUTF16: Assertion `(Py_uintptr_t)(((((((((PyObject)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)) ? (void) (0) : __assert_fail ("((((((PyObject)(v))->ob_type))->tp_flags & ((1L<<27))) != 0)", "Objects/unicodeobject.c", 5401, __PRETTY_FUNCTION__)), (((PyBytesObject *)(v))->ob_sval)) & 1 == 0' failed. Fatal Python error: Aborted Current thread 0x00007faa4980e700: File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1443 in test_codecs File "/home/antoine/cpython/default/Lib/unittest/case.py", line 385 in _executeTestPart File "/home/antoine/cpython/default/Lib/unittest/case.py", line 440 in run File "/home/antoine/cpython/default/Lib/unittest/case.py", line 492 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 105 in run File "/home/antoine/cpython/default/Lib/unittest/suite.py", line 67 in __call__ File "/home/antoine/cpython/default/Lib/unittest/runner.py", line 168 in run File "/home/antoine/cpython/default/Lib/test/support.py", line 1383 in _run_suite File "/home/antoine/cpython/default/Lib/test/support.py", line 1417 in run_unittest File "/home/antoine/cpython/default/Lib/test/test_unicode.py", line 1954 in test_main File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 1237 in runtest_inner File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 918 in runtest File "/home/antoine/cpython/default/Lib/test/regrtest.py", line 710 in main File "/home/antoine/cpython/default/Lib/test/__main__.py", line 13 in File "/home/antoine/cpython/default/Lib/runpy.py", line 75 in _run_code File "/home/antoine/cpython/default/Lib/runpy.py", line 162 in _run_module_as_main Abandon
msg162929 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-15 19:35
> Serhiy, the tests crash here in debug mode: My fault. It's operator precedence issue in the assert expression. Gcc warns about it: Objects/unicodeobject.c: In function ‘_PyUnicode_EncodeUTF16’: Objects/unicodeobject.c:5401: warning: suggest parentheses around comparison in operand of ‘&’ Here is a fixed patch.
msg162930 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-06-15 20:18
New changeset acca141fda80 by Antoine Pitrou in branch 'default': Issue #15026: utf-16 encoding is now significantly faster (up to 10x). http://hg.python.org/cpython/rev/acca141fda80
msg162931 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-15 20:19
Thank you for the quick turnaround! The patch is now pushed in 3.3.
msg162933 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-06-15 20:21
It would be nice to mention the improvement in the What's New in Python 3.3 doc (Optimizations section).
msg162934 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-06-15 20:25
New changeset 35667fc5f785 by Antoine Pitrou in branch 'default': Mention the UTF-16 encoding speedup in the whatsnew (issue #15026). http://hg.python.org/cpython/rev/35667fc5f785
msg162960 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-16 08:43
Thank you for pushing. :-) Are you interested in a faster UTF-32 codec?
msg162961 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-16 09:03
> Thank you for pushing. :-) Are you interested in a faster UTF-32 codec? Not much :) I know you posted issues on that, but I think UTF-32 is quite low priority.

History
Date	User	Action	Args
2022-04-11 14:57:31	admin	set	github: 59231
2012-06-16 09:03:30	pitrou	set	messages: +
2012-06-16 08:43:11	serhiy.storchaka	set	messages: +
2012-06-15 20:25:25	python-dev	set	messages: +
2012-06-15 20:21:43	vstinner	set	messages: +
2012-06-15 20:19:14	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: resolved
2012-06-15 20🔞32	python-dev	set	nosy: + python-devmessages: +
2012-06-15 19:35:12	serhiy.storchaka	set	files: + encode-utf16-2.patchmessages: +
2012-06-15 17:34:47	pitrou	set	messages: +
2012-06-14 20:29:52	serhiy.storchaka	set	messages: +
2012-06-13 09:37:49	pitrou	set	messages: +
2012-06-07 13:56:13	serhiy.storchaka	create