Issue 14654: Faster utf-8 decoding (original) (raw)

Issue14654

Created on 2012-04-23 21:04 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
decode_utf8.patch	serhiy.storchaka,2012-04-23 21:04	review
utf8-signed.diff	serhiy.storchaka,2012-04-24 08:37	Non-recomended hacking	review
decode_utf8_2.patch	serhiy.storchaka,2012-04-24 17:46	review
decode_utf8_3.patch	serhiy.storchaka,2012-04-24 17:46	review

Messages (19)
msg159080 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-23 21:04
The utf-8 decoder is already well optimized. I propose a patch, which accelerates the utf-8 decoder for some of the frequent cases even more (+10-30%). In particular, for 2-bites non-latin1 codes will get about +30%. This is not the final result of optimization. It may be possible to optimize the decoding of the ascii and mostly-ascii text (up to the speed of memcpy), decoding of text with occasional errors, reduce code duplication. But I'm not sure of the success. Related issues: [] Faster utf-8 decoding [] faster utf-8 decoding [] Faster ascii decoding [] Faster utf-16 decoder [] Faster utf-32 decoder
msg159082 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-23 21:06
Here are the results of benchmarking (numbers in MB/s). On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz: Py2.7 Py3.2 Py3.3 patch utf-8 'A'10000 191 (+790%) 1170 (+45%) 1664 (+2%) 1700 utf-8 '\x80'10000 187 (+4%) 219 (-11%) 172 (+13%) 194 utf-8 '\x80'+'A'9999 191 (+98%) 1152 (-67%) 376 (+1%) 378 utf-8 '\u0100'10000 188 (+15%) 221 (-2%) 164 (+32%) 217 utf-8 '\u0100'+'A'9999 191 (+103%) 1150 (-66%) 382 (+1%) 387 utf-8 '\u0100'+'\x80'9999 188 (+15%) 221 (-2%) 164 (+32%) 217 utf-8 '\u8000'10000 244 (-12%) 263 (-18%) 191 (+13%) 215 utf-8 '\u8000'+'A'9999 191 (+102%) 1174 (-67%) 382 (+1%) 386 utf-8 '\u8000'+'\x80'9999 188 (+15%) 216 (+0%) 164 (+32%) 217 utf-8 '\u8000'+'\u0100'9999 188 (+15%) 216 (+0%) 164 (+32%) 217 utf-8 '\U00010000'10000 251 (-15%) 248 (-14%) 199 (+7%) 213 utf-8 '\U00010000'+'A'9999 191 (+97%) 1173 (-68%) 372 (+1%) 376 utf-8 '\U00010000'+'\x80'9999 188 (+21%) 221 (+3%) 180 (+26%) 227 utf-8 '\U00010000'+'\u0100'9999 188 (+21%) 221 (+3%) 180 (+26%) 227 utf-8 '\U00010000'+'\u8000'9999 244 (-9%) 263 (-16%) 201 (+10%) 221 On 32-bit Linux, Intel Atom N570 @ 1.66GHz: Py2.7 Py3.2 Py3.3 patch utf-8 'A'10000 117 (+414%) 349 (+72%) 597 (+1%) 601 utf-8 '\x80'10000 86 (-5%) 89 (-8%) 67 (+22%) 82 utf-8 '\x80'+'A'9999 117 (+6%) 340 (-64%) 126 (-2%) 124 utf-8 '\u0100'10000 86 (-2%) 89 (-6%) 66 (+27%) 84 utf-8 '\u0100'+'A'9999 117 (+5%) 339 (-64%) 78 (+58%) 123 utf-8 '\u0100'+'\x80'9999 86 (-2%) 89 (-6%) 66 (+27%) 84 utf-8 '\u8000'10000 109 (-26%) 98 (-17%) 71 (+14%) 81 utf-8 '\u8000'+'A'9999 116 (+7%) 339 (-63%) 78 (+59%) 124 utf-8 '\u8000'+'\x80'9999 86 (-3%) 89 (-7%) 66 (+26%) 83 utf-8 '\u8000'+'\u0100'9999 86 (-3%) 89 (-7%) 66 (+26%) 83 utf-8 '\U00010000'10000 106 (-14%) 105 (-13%) 81 (+12%) 91 utf-8 '\U00010000'+'A'9999 116 (+12%) 338 (-62%) 127 (+2%) 130 utf-8 '\U00010000'+'\x80'9999 86 (+6%) 88 (+3%) 69 (+32%) 91 utf-8 '\U00010000'+'\u0100'9999 86 (+6%) 88 (+3%) 69 (+32%) 91 utf-8 '\U00010000'+'\u8000'9999 109 (-24%) 98 (-15%) 74 (+12%) 83 The results were ambiguous (everywhere plus, but in different ways). I would like to see the results for 64-bit platforms. For scripts see .
msg159085 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-04-23 21:23
64-bit Linux, Intel Core i5-2500K CPU @ 3.30GHz: vanilla 3.3 patched utf-8 'A'10000 6668 (+7%) 7145 utf-8 'A'9999+'\x80' 2358 (+3%) 2418 utf-8 'A'9999+'\u0100' 2306 (+0%) 2311 utf-8 'A'9999+'\u8000' 2299 (+0%) 2309 utf-8 'A'9999+'\U00010000' 2373 (-4%) 2278 utf-8 '\x80'10000 366 (+53%) 559 utf-8 '\x80'+'A'9999 859 (+1%) 868 utf-8 '\x80'9999+'\u0100' 529 (+5%) 558 utf-8 '\x80'9999+'\u8000' 529 (+5%) 558 utf-8 '\x80'9999+'\U00010000' 529 (+5%) 558 utf-8 '\u0100'10000 520 (+6%) 549 utf-8 '\u0100'+'A'9999 822 (+0%) 823 utf-8 '\u0100'+'\x80'9999 519 (+6%) 549 utf-8 '\u0100'9999+'\u8000' 520 (+6%) 549 utf-8 '\u0100'9999+'\U00010000' 520 (+6%) 549 utf-8 '\u8000'10000 470 (+4%) 491 utf-8 '\u8000'+'A'9999 822 (+0%) 822 utf-8 '\u8000'+'\x80'9999 509 (+8%) 549 utf-8 '\u8000'+'\u0100'9999 509 (+8%) 549 utf-8 '\u8000'9999+'\U00010000' 470 (-4%) 451 utf-8 '\U00010000'10000 483 (-6%) 453 utf-8 '\U00010000'+'A'9999 938 (-1%) 926 utf-8 '\U00010000'+'\x80'9999 561 (+6%) 595 utf-8 '\U00010000'+'\u0100'9999 561 (+6%) 595 utf-8 '\U00010000'+'\u8000'*9999 503 (-4%) 482
msg159086 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-04-23 21:33
> 64-bit Linux, Intel Core i5-2500K CPU @ 3.30GHz: (...) Hum, the patch doesn't look very interesting if it only optimize one specific case: > utf-8 '\x80'*10000 366 (+53%) 559
msg159124 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-24 08:27
Thank you, Antoine. It is interesting results, that on 64 bits greatly accelerated the case, which on 32 bits sped up a little. It was the pathology that a 2-byte to UCS1 was decoded in 1.5x slower than a 2-byte to UCS2. Interestingly, a small acceleration for the other cases are random deviations or consequential effect? Strange looks like the difference for ascii-only text, this branch is not affected by the patch. Except that the consequences of global optimization. The deceleration of the decoding of the 4-byte data is expected. Here is a patch, which is risky reception with signed numbers. For me, it shows the acceleration of a few percent in comparison with the previous patch. But I can not recommend it, it looks too hacker for such a small improvement. It will not work on the exotic platforms where signed numbers are implemented not as complement code (but Python is not supports such platforms).
msg159132 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-04-24 11:22
I'm -1 on using signed char in the implementation. If this gives any advantage, it's because the compiler is not able to generate as efficient code for unsigned char as it does for signed char. So the performance results may again change if you switch compilers, or use the next compiler version. The code should do what is logically correct; IMO, UTF-8 is really a sequence of unsigned bytes, conceptually. So if you want to demonstrate any performance improvements, you need to do so with unsigned chars.
msg159134 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-24 11:53
> I'm -1 on using signed char in the implementation. I completely agree with you, for these and for other not mentioned reasons. So I don't released this patch yesterday, and did not suggest it to accept. I showed him just out of curiosity -- whether the effect is stronger on a 64-bit platform? Although this technique will not be accepted, it sets the bar that can be achieved (if it's worth it).
msg159174 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-04-24 17:46
Here are two new patches. The first one takes into account the Martin wishes about comments. The second also rejects optimization for ASCII. On the Intel Atom last patch annihilates acceleration for some cases (mostly-ascii with UCS2 data): vanilla patch1 patch3 utf-8 'A'9999+'\u0100' 124 (+8%) 288 (-53%) 134 utf-8 'A'9999+'\u8000' 124 (+8%) 291 (-54%) 134 utf-8 '\u0100'+'A'9999 78 (+5%) 123 (-33%) 82 utf-8 '\u8000'+'A'9999 78 (+5%) 124 (-34%) 82 On the AMD Athlon there is no noticeable effect.
msg159907 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-04 07:08
> title: More fast utf-8 decoding -> Faster utf-8 decoding Éric, there is already an issue (#4868) with this title.
msg159908 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-04 07:24
There is nothing wrong with two issues having the same title. Of course, it would be best if the title reflected the actual defect or change, such as "specialize UTF-8 decoding by character width", or some such. In any case, the title change is desirable since the original title was ungrammatical. If you wanted to point out that this really is an augmented, escalated rise, then "Even faster utf-8 decoded", "amazingly faster UTF-8 decoding", or "unbelievably faster utf-8 decoding" could have worked :-)
msg159910 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-04 07:59
Thank you, Martin, this is what I had in mind. Lost in translation. ;)
msg159944 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-04 16:49
64-bit Linux, Intel Core i5-2500K CPU @ 3.30GHz: vanilla 3.3 patch 2 patch 3 utf-8 'A'10000 6931 (+3%) 7115 (+0%) 7117 utf-8 'A'9999+'\x80' 2347 (+1%) 2410 (-2%) 2360 utf-8 'A'9999+'\u0100' 2279 (+1%) 2282 (+1%) 2310 utf-8 'A'9999+'\u8000' 2264 (+2%) 2275 (+1%) 2300 utf-8 'A'9999+'\U00010000' 2351 (+0%) 2283 (+3%) 2359 utf-8 '\x80'10000 516 (+8%) 558 (+0%) 559 utf-8 '\x80'+'A'9999 859 (+0%) 868 (-1%) 860 utf-8 '\x80'9999+'\u0100' 526 (+6%) 558 (+0%) 558 utf-8 '\x80'9999+'\u8000' 535 (+4%) 558 (+0%) 558 utf-8 '\x80'9999+'\U00010000' 525 (+6%) 559 (-0%) 558 utf-8 '\u0100'10000 517 (+6%) 548 (+0%) 548 utf-8 '\u0100'+'A'9999 818 (+0%) 820 (+0%) 821 utf-8 '\u0100'+'\x80'9999 517 (+6%) 548 (+0%) 548 utf-8 '\u0100'9999+'\u8000' 525 (+4%) 548 (+0%) 548 utf-8 '\u0100'9999+'\U00010000' 517 (+6%) 549 (+0%) 549 utf-8 '\u8000'10000 490 (-8%) 433 (+4%) 451 utf-8 '\u8000'+'A'9999 818 (+0%) 819 (+0%) 821 utf-8 '\u8000'+'\x80'9999 529 (+4%) 548 (+0%) 548 utf-8 '\u8000'+'\u0100'9999 529 (+4%) 548 (+0%) 548 utf-8 '\u8000'9999+'\U00010000' 470 (-4%) 451 (+0%) 451 utf-8 '\U00010000'10000 554 (-18%) 427 (+6%) 453 utf-8 '\U00010000'+'A'9999 938 (+0%) 927 (+2%) 941 utf-8 '\U00010000'+'\x80'9999 572 (+4%) 595 (+0%) 595 utf-8 '\U00010000'+'\u0100'9999 571 (+4%) 595 (+0%) 595 utf-8 '\U00010000'+'\u8000'*9999 503 (-4%) 481 (+0%) 482
msg160011 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-05 17:50
Well, it seems, 64-bit processors are smart enough to not feel the need for this optimization. On 32-bit platforms I see a noticeable increase in speed. I am now working on a more advanced optimization, which now shows a gain of +20-60% compared with the previous patches, but I hope to increase a gain by +50%-100%.
msg160020 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-05 19:02
I'll be closing this issue at this point. Serhiy: I don't think the bug tracker should be used to evolve work in progress (except when responding to reviews received). Use a Mercurial clone for that instead. By posting a patch here, you are requesting that it be reviewed and considered - please understand that you consume a lot of people's time by such a posting. At this point, it appears that you don't intend to submit any of these patches for inclusion into Python. If you ever do want to contribute something in this area, please create a new issue.
msg160025 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-05-05 19:19
> I'll be closing this issue at this point. Serhiy: I don't think the > bug tracker should be used to evolve work in progress (except when > responding to reviews received). Use a Mercurial clone for that > instead. By posting a patch here, you are requesting that it be > reviewed and considered - please understand that you consume a lot of > people's time by such a posting. That's not very nice. If Serhiy wants feedback on his work, he definitely has to post somewhere. The bug tracker sounds like a reasonable place (certainly more reasonable than python-dev).
msg160043 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-05-05 22:06
> That's not very nice. If Serhiy wants feedback on his work, he > definitely has to post somewhere. The bug tracker sounds like a > reasonable place (certainly more reasonable than python-dev). I completely disagree (and I really tried to be nice). It is my utmost belief that the tracker must not be used for work-in-progress. For any open issue, numerous people review the issue, and even if they spend only a few minutes, this easily adds up to a lot of wasted time if there isn't anything to be done about an issue. OTOH, discussing it on python-dev indeed seems more appropriate: even though the readership is larger, people know that they can safely skip over messages that clearly don't need their attention. So if Serhiy posts a message titled "UTF-8 performance", people will hit the delete button very quickly if they are not interested. However, it would really be best in this case if Serhiy takes a step back, and analyzes the performance of the current decoder carefully, then proposes a patch which undoubtedly improves the performance and is meanwhile also maintainable. He may come to the conclusion that further improvement isn't really possible or reasonable, in which case it would be good if he posted his findings to python-dev.
msg160082 - (view)	Author: Jesús Cea Avión (jcea) *	Date: 2012-05-06 13:37
I understand Martin point, but I think 95% of issues in the bugtracker are "work in progress", mine included. Maybe the issue is that Serhiy hasn't made a concrete proposal to be tested & integrated. It seems to be more an exploratory work. I am in the nosy list because I am interested in this work.
msg160102 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-06 17:51
Martin, sorry to have wasted your time. I understand that you are busy, so I'm not too worried not receiving a feedback for ten days. > At this point, it appears that you don't intend to submit any of these patches for inclusion into Python. I'm at a loss. What causes such an impression? I quickly reacting to the comments, responding by the new patches. I take into account your comments, and if I do not agree, reinforce my opinion by the benchmarks. I suggest only cleaned and well-tested code. I provide tools for benchmarking. What am I doing wrong? May be my bad English has been misunderstood?
msg160185 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-05-08 05:02
See for advanced optimization.

History
Date	User	Action	Args
2022-04-11 14:57:29	admin	set	github: 58859
2012-05-08 05:02:12	serhiy.storchaka	set	messages: +
2012-05-08 03:29:09	eric.araujo	set	nosy: + eric.araujo
2012-05-06 17:51:16	serhiy.storchaka	set	messages: +
2012-05-06 13:37:27	jcea	set	messages: +
2012-05-05 22:06:18	loewis	set	messages: +
2012-05-05 19:19:54	pitrou	set	messages: +
2012-05-05 19:02:13	loewis	set	status: open -> closedresolution: rejectedmessages: +
2012-05-05 17:50:47	serhiy.storchaka	set	messages: +
2012-05-04 16:49:54	pitrou	set	messages: +
2012-05-04 07:59:42	serhiy.storchaka	set	messages: +
2012-05-04 07:24:16	loewis	set	messages: +
2012-05-04 07:08:26	serhiy.storchaka	set	messages: +
2012-05-04 02:21:22	eric.araujo	set	title: More fast utf-8 decoding -> Faster utf-8 decoding
2012-04-24 17:46:16	serhiy.storchaka	set	files: + decode_utf8_2.patch, decode_utf8_3.patchmessages: +
2012-04-24 11:53:59	serhiy.storchaka	set	messages: +
2012-04-24 11:22:06	loewis	set	nosy: + loewismessages: +
2012-04-24 08:37:51	serhiy.storchaka	set	files: + utf8-signed.diff
2012-04-24 08:27:32	serhiy.storchaka	set	messages: +
2012-04-24 04:52:26	Arfrever	set	nosy: + Arfrever
2012-04-23 21:59:38	jcea	set	nosy: + jcea
2012-04-23 21:33:12	vstinner	set	messages: +
2012-04-23 21:23:13	pitrou	set	messages: +
2012-04-23 21:06:57	serhiy.storchaka	set	messages: +
2012-04-23 21:04:06	serhiy.storchaka	create