Issue 13136: speed-up conversion between unicode widths (original) (raw)

Created on 2011-10-08 22:18 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unroll_convert_bytes.patch	pitrou,2011-10-08 22:18	review
unroll.c	loewis,2011-10-09 03:06

Messages (8)
msg145192 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-08 22:18
This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop. Example micro-benchmark: ./python -m timeit -s "a='x'10000;b='\u0102'1000;c='\U00100000'" "a+b+c" -> before: 100000 loops, best of 3: 14.9 usec per loop -> after: 100000 loops, best of 3: 9.19 usec per loop
msg145193 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-10-08 22:29
Antoine Pitrou wrote: > > New submission from Antoine Pitrou <pitrou@free.fr>: > > This patch speeds up _PyUnicode_CONVERT_BYTES by unrolling its loop. > > Example micro-benchmark: > > ./python -m timeit -s "a='x'10000;b='\u0102'1000;c='\U00100000'" "a+b+c" > > -> before: > 100000 loops, best of 3: 14.9 usec per loop > -> after: > 100000 loops, best of 3: 9.19 usec per loop Before going further with this, I'd suggest you have a look at your compiler settings. Such optimizations are normally performed by the compiler and don't need to be implemented in C, making maintenance harder. The fact that Windows doesn't exhibit the same performance difference suggests that the optimizer is not using the same level or feature set as on Linux. MSVC is at least as good at optimizing code as gcc, often better. I tested using memchr() when writing those "naive" loops. It turned out that using memchr() was slower than using the direct loops. memchr() is inlined by the compiler just like the direct loop and the generated code for the direct version is often easier to optimize for the compiler than the memchr() one, since it receives more knowledge about the used data types.
msg145194 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-08 22:34
> Before going further with this, I'd suggest you have a look at your > compiler settings. They are set by the configure script: gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Objects/unicodeobject.o Objects/unicodeobject.c > Such optimizations are normally performed by the > compiler and don't need to be implemented in C, making maintenance > harder. The fact that the glibc includes such optimization (in much more sophisticated form) suggests to me that many compilers don't perform these optimizations automically. > I tested using memchr() when writing those "naive" loops. memchr() is mentioned in another issue, #13134. > memchr() > is inlined by the compiler just like the direct loop I don't think so. If you look at the glibc's memchr() implementation, it's a sophisticated routine, not a trivial loop. Perhaps you're thinking about memcpy(). > and the generated > code for the direct version is often easier to optimize for the compiler > than the memchr() one, since it receives more knowledge about the used > data types. ?? Data types are fixed in the memchr() definition, there's no knowledge to be gained by inlining.
msg145200 - (view)	Author: Meador Inge (meador.inge) *	Date: 2011-10-09 00:49
On Sat, Oct 8, 2011 at 5:34 PM, Antoine Pitrou <report@bugs.python.org> wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Before going further with this, I'd suggest you have a look at your >> compiler settings. > > They are set by the configure script: > > gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall > -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o > Objects/unicodeobject.o Objects/unicodeobject.c > >> Such optimizations are normally performed by the >> compiler and don't need to be implemented in C, making maintenance >> harder. > > The fact that the glibc includes such optimization (in much more > sophisticated form) suggests to me that many compilers don't perform > these optimizations automically. I agree. This is more of an optimized runtime library problem than code optimization problem. >> I tested using memchr() when writing those "naive" loops. > > memchr() is mentioned in another issue, #13134. Yeah, this conversation is really more relevant to , but I will reply to these here anyway .... >> memchr() >> is inlined by the compiler just like the direct loop > > I don't think so. If you look at the glibc's memchr() implementation, > it's a sophisticated routine, not a trivial loop. Perhaps you're > thinking about memcpy(). Without link-time optimization enabled, I doubt the toolchain can "inline" 'memchr' in the traditional sense (i.e. inserting the body of the routine inline). Even if it could, the inline heuristics would most likely choose not to. I don't think we use LTO with GCC, but I think we might with VC++. GCC does something else called builtin folding that is more likely. For example, 'memchr ("bca", 'c', 3)' gets replace with instructions to compute a pointer index into "bca". This won't happen in this case because all of the 'memchr' arguments are all variable. >> and the generated >> code for the direct version is often easier to optimize for the compiler >> than the memchr() one, since it receives more knowledge about the used >> data types. > > ?? Data types are fixed in the memchr() definition, there's no knowledge > to be gained by inlining. I think what Marc-Andre is alluding to is that the first parameter of 'memchr' is 'void ' which could (in theory) limit optimization opportunities. Where as if it knew that the data being searched is a 'char ' or something it could take advantage of that.
msg145205 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-10-09 03:06
Marc-Andre: gcc will normally not unroll loops, unless -funroll-loops is given on the command line. Then, it will unroll many loops, and do so with 8 iterations per outer loop. This typically causes significant code bloat, which is why unrolling is normally disabled and left to the programmer. For those who want to experiment with this, I attach a C file with just the code in question. Compile this with your favorite compiler settings, and see what the compile generates. clang, on an x64 system, compiles the original loop into LBB0_2: ## =>This Inner Loop Header: Depth=1 movzbl (%rdi), %eax movw %ax, (%rdx) incq %rdi addq 2,2, %rdx decq %rsi jne LBB0_2 and the unrolled loop into LBB1_2: ## %.lr.ph6 ## =>This Inner Loop Header: Depth=1 movzbl (%rdi,%rcx), %r8d movw %r8w, (%rdx) movzbl 1(%rdi,%rcx), %r8d movw %r8w, 2(%rdx) movzbl 2(%rdi,%rcx), %r8d movw %r8w, 4(%rdx) movzbl 3(%rdi,%rcx), %r8d movw %r8w, 6(%rdx) addq 2,8, %rdx addq $4, %rcx cmpq %rax, %rcx jl LBB1_2
msg145252 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-10-09 11:33
Antoine Pitrou wrote: > >> I tested using memchr() when writing those "naive" loops. > > memchr() is mentioned in another issue, #13134. Looks like I posted the comment to the wrong ticket.
msg145360 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-10-11 19:07
New changeset 5b077c962a16 by Antoine Pitrou in branch 'default': Issue #13136: speed up conversion between different character widths. http://hg.python.org/cpython/rev/5b077c962a16
msg145361 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-11 19:08
Patch committed. It is of course still not as fast as memcpy, but it's a small step towards improving performance.

History
Date	User	Action	Args
2022-04-11 14:57:22	admin	set	github: 57345
2011-10-11 19:08:08	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2011-10-11 19:07:13	python-dev	set	nosy: + python-devmessages: +
2011-10-09 11:33:30	lemburg	set	messages: +
2011-10-09 03:06:10	loewis	set	files: + unroll.cnosy: + loewismessages: +
2011-10-09 00:49:51	meador.inge	set	nosy: + meador.ingemessages: +
2011-10-08 22:34:50	pitrou	set	messages: +
2011-10-08 22:29:12	lemburg	set	nosy: + lemburgmessages: +
2011-10-08 22🔞30	pitrou	create