msg279419 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-10-25 16:16 |
Currently utf7 encoder uses an aggressive memory allocation strategy: use the worst case 8. We can tighten the worst case. For 1 byte and 2 byte unicodes, the worst case could be 3*n + 2. For 4 byte unicodes, the worst case could be 6*n + 2. There are 2 cases. First, all characters needs to be encoded, the result length should be upper_round(2.67*n) + 2 <= 3*n + 2. Second, encode and not encode characters appear one by one. For even length, it's 3n < 3n + 2. For odd length, it's exactly 3n + 2. This won't benefit much when the string is short. But when the string is long, it speeds up. Without patch: [bin]$ ./python3 -m perf timeit -s 's = "abc"*10' 's.encode("utf7")' .................... Median +- std dev: 2.79 us +- 0.09 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*100' 's.encode("utf7")' .................... Median +- std dev: 4.55 us +- 0.13 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*1000' 's.encode("utf7")' .................... Median +- std dev: 14.0 us +- 0.4 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*10000' 's.encode("utf7")' .................... Median +- std dev: 178 us +- 1 us With patch: [bin]$ ./python3 -m perf timeit -s 's = "abc"*10' 's.encode("utf7")' .................... Median +- std dev: 2.87 us +- 0.09 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*100' 's.encode("utf7")' .................... Median +- std dev: 4.50 us +- 0.23 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*1000' 's.encode("utf7")' .................... Median +- std dev: 13.3 us +- 0.4 us [bin]$ ./python3 -m perf timeit -s 's = "abc"*10000' 's.encode("utf7")' .................... Median +- std dev: 102 us +- 1 us The patch also removes a check, base64bits can only be not 0 when inShift is not 0. |
|
|
msg279550 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-10-27 17:16 |
v2 uses _PyBytesWriter so we can use on stack buffer for short string. |
|
|
msg279556 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-10-27 18:03 |
The performance of the UTF-7 codec is not important. Unlikely to other UTF-* encodings this is not standard Unicode encoding. It is used in minority of applications and unlikely is a bottleneck. It is rather in the line of idna and punycode than UTF-8 and UTF-32. Actually I'm going to propose replacing it with Python implementation. This encoder was omitted form _PyBytesWriter-using optimizations for purpose. The patch complicates the implementation. Since the codec is rarely used some bugs lived long time in it. Any change risks to add new long living bug. |
|
|
msg279574 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-10-28 03:33 |
Actually the patch is not going to speed up the encoder but just make the memory allocation strategy better, make the memory upper bound tighter. The speedup is just a good side effect. > It is rather in the line of idna and punycode than UTF-8 and UTF-32. Agree. > This encoder was omitted form _PyBytesWriter-using optimizations for purpose. The patch complicates the implementation. Not much. I think the implementation still remains simple. |
|
|
msg281134 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-18 15:56 |
Serhiy Storchaka: "The performance of the UTF-7 codec is not important." Right. "Actually I'm going to propose replacing it with Python implementation." Oh. Sadly, PyUnicode_DecodeUTF7() is part of the stable ABI. Do you want to call the Python codec from the C function for backward compatibility? I dislike UTF-7 because it's complex, but it's not as optimized as the UTF-8 codec, so the code remains not too big and so not too expensive to matain. "This encoder was omitted form _PyBytesWriter-using optimizations for purpose." Ah? I don't recall that. When I wrote _PyBytesWriter, I skipped UTF-7 because I don't know well this codec and I preferred to keep the code unchanged to avoid bugs :-) "The patch complicates the implementation." Hum, I have to disagree. For me, the patched new is no more complex than the current code. The main change is that it adds code checking the kind to better estimate the output length. It's not hard to understand the link between the Unicode kind of the max_char_size. I vote +1 on this patch because I consider that it makes the code simpler, not because it makes the codec faster (I don't really care of UTF-7 codec performance). But again (as in issue #28398), it's up to you Serhiy: I'm also ok to leave the code unchanged if you are against the patch. |
|
|
msg281143 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-11-18 16:40 |
I fixed many long living bugs in the UTF-7 codec in the past, and I remember that we fixed bugs introduced by using _PyUnicodeWriter or _PyBytesWriter many months after changing the code. Since the UTF-7 codec is rarely used, there is a risk of introducing new long living bug. You should peruse not just the code near the changed lines, but all the codec. I'm not strongly against this patch, if you Victor takes the responsibility for it, I left it on you. |
|
|
msg281144 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-18 16:44 |
Oh no, now I'm afraid of breaking something :-D I don't trust anymore our test suite for the UTF-7 codec, so I close the issue :-) Sorry Xiang, but as we said, this rarely used codec is not important enough to require optimization. |
|
|
msg281145 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-18 16:45 |
> I remember that we fixed bugs introduced by using _PyUnicodeWriter or _PyBytesWriter many months after changing the code. Yeah, now I recall it (vaguely), that's why I closed the bug :-) |
|
|