Issue 13134: speed up finding of one-character strings (original) (raw)

Created on 2011-10-08 20:29 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
memchr.patch	pitrou,2011-10-08 20:29	review
memchr2.patch	pitrou,2011-10-08 20:55	review

Messages (11)
msg145186 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-08 20:29
In stringlib/fastsearch, instead of using our own loops, we can use memchr() (and, if available, memrchr()) when looking for one-character strings. memchr() is generally quite optimized; on this Linux/glibc machine, I get speedups of 5x-10x: ./python -m timeit -s "large='a'10000+'b'" "large.find('b')" -> before: 10.5 usec per loop -> after: 0.889 usec per loop ./python -m timeit -s "large='b'+'a'10000" "large.rfind('b')" -> before: 7.06 usec per loop -> after: 1.94 usec per loop
msg145187 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-08 20:39
With MSVC there seems to be no difference, meaning Windows probably has a naïve memchr() implementation.
msg145189 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-08 20:55
New patch with a couple of tests.
msg145251 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2011-10-09 11:23
[Posted the reply to the right ticket; see for the original post to the wrong ticket] Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Before going further with this, I'd suggest you have a look at your >> compiler settings. > > They are set by the configure script: > > gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall > -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o > Objects/unicodeobject.o Objects/unicodeobject.c Which gcc version are you using ? Is it possible that you have -fno-builtin enabled ? >> Such optimizations are normally performed by the >> compiler and don't need to be implemented in C, making maintenance >> harder. > > The fact that the glibc includes such optimization (in much more > sophisticated form) suggests to me that many compilers don't perform > these optimizations automically. When using gcc, the glibc functions are usually not used at all, since gcc comes with a (rather large) set of builtins which are inlined directly, if you have optimizations enabled and inlining is found to be more efficient than calling the glibc function: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html glibc includes the optimized versions since it has to implement C library (obviously) and for cases where inlining does not happen. >> I tested using memchr() when writing those "naive" loops. > > memchr() is mentioned in another issue, #13134. > >> memchr() >> is inlined by the compiler just like the direct loop > > I don't think so. If you look at the glibc's memchr() implementation, > it's a sophisticated routine, not a trivial loop. Perhaps you're > thinking about memcpy(). See http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html and the assembler output. If it's not inlined, then something must be preventing this and it would be good to find out why. >> and the generated >> code for the direct version is often easier to optimize for the compiler >> than the memchr() one, since it receives more knowledge about the used >> data types. > > ?? Data types are fixed in the memchr() definition, there's no knowledge > to be gained by inlining. There is: the compiler will have alignement information available and can also benefit from using registers instead of the stack, knowledge about processor cache lines, etc. Such information is lost when calling a function. The function call itself will also create some overhead. BTW: You should not only test the optimization with long strings, but also with short ones (e.g. 2-15 chars) - which is a much more common case in practice.
msg145279 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-09 21:21
> >> Before going further with this, I'd suggest you have a look at your > >> compiler settings. > > > > They are set by the configure script: > > > > gcc -pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall > > -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o > > Objects/unicodeobject.o Objects/unicodeobject.c > > Which gcc version are you using ? $ gcc -v Utilisation des specs internes. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-mageia-linux-gnu/4.5.2/lto-wrapper Target: x86_64-mageia-linux-gnu Configuré avec: ../configure --prefix=/usr --libexecdir=/usr/lib --with-slibdir=/lib64 --with-bugurl=http://bugs.mageia.org/ --mandir=/usr/share/man --infodir=/usr/share/info --enable-checking=release --enable-languages=c,c ++,ada,fortran,objc,obj-c++,java --build=x86_64-mageia-linux-gnu --host=x86_64-mageia-linux-gnu --with-cpu=generic --with-system-zlib --enable-threads=posix --enable-shared --enable-objc-gc --enable-long-long --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --enable-java-awt=gtk --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-gtk-cairo --disable-libjava-multilib --enable-ssp --disable-libssp --disable-werror --with-ppl --with-cloog --with-python-dir=/lib/python2.7/site-packages --enable-lto Modèle de thread: posix gcc version 4.5.2 (GCC) > Is it possible that you have -fno-builtin enabled ? Why would it be enabled? It is not on the command line. > >> Such optimizations are normally performed by the > >> compiler and don't need to be implemented in C, making maintenance > >> harder. > > > > The fact that the glibc includes such optimization (in much more > > sophisticated form) suggests to me that many compilers don't perform > > these optimizations automically. > > When using gcc, the glibc functions are usually not used at all, > since gcc comes with a (rather large) set of builtins which are > inlined directly, if you have optimizations enabled and inlining > is found to be more efficient than calling the glibc function: What would that change? Whether the optimized memchr() comes from gcc or the glibc is not relevant here. > There is: the compiler will have alignement information available and > can also benefit from using registers instead of the stack, knowledge > about processor cache lines, etc. Such information is lost when calling > a function. The function call itself will also create some overhead. > > BTW: You should not only test the optimization with long strings, but also > with short ones (e.g. 2-15 chars) - which is a much more common case > in practice. With very short strings, the runtimes tend to be identical, which is understandable.
msg145357 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-10-11 18:44
New changeset af4a89f4f466 by Antoine Pitrou in branch 'default': Issue #13134: optimize finding single-character strings using memchr http://hg.python.org/cpython/rev/af4a89f4f466
msg145359 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-11 18:45
I went ahead and committed.
msg145367 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-10-11 21:23
changeset: 72874:bfd3fcfb02f3 user: Victor Stinner <victor.stinner@haypocalc.com> date: Tue Oct 11 23:22:22 2011 +0200 files: Objects/stringlib/asciilib.h Objects/stringlib/fastsearch.h Objects/stringlib/stringdefs.h Objects/stringlib/ucs1lib.h Objects/stri description: Fix fastsearch for UCS2 and UCS4 * If needle is 0, try (p[0] >> 16) & 0xff for UCS4 * Disable fastsearch_memchr_1char() if needle is zero for UCS2 and UCS4
msg145451 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-10-13 12:22
I think the >1 character sizes are overly complex in this patch, and still memchr isn't typically used for them. So I suggest to simplify the code and restrict it to 1-byte chars only.
msg145454 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-10-13 14:21
> I think the >1 character sizes are overly complex in this patch, and > still memchr isn't typically used for them. So I suggest to simplify > the code and restrict it to 1-byte chars only. I would rather propose to simplify the needle heuristic and only use it when the lower byte is non-zero. A properly optimized memchr() (as in the glibc / gcc) is definitely faster than our naïve loop.
msg145467 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-10-13 16:05
> I would rather propose to simplify the needle heuristic and only use it > when the lower byte is non-zero. A properly optimized memchr() (as in > the glibc / gcc) is definitely faster than our naïve loop. That would be fine as well. Not sure if a heuristics would be needed in this case at all: it's probably uncommon that you search for a single character whose lower-half is 0 (most likely you are then searching for the null character, and not, say, LATIN CAPITAL LETTER A WITH DOUBLE GRAVE). In any case, I still think that the heuristics (if any) needs to be explained better, and needs some justification in the first place.

History
Date	User	Action	Args
2022-04-11 14:57:22	admin	set	github: 57343
2011-10-13 16:05:32	loewis	set	messages: +
2011-10-13 14:21:12	pitrou	set	messages: +
2011-10-13 12:22:59	loewis	set	messages: +
2011-10-11 21:23:26	vstinner	set	messages: +
2011-10-11 18:45:51	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: resolved
2011-10-11 18:44:28	python-dev	set	nosy: + python-devmessages: +
2011-10-09 21:21:56	pitrou	set	messages: +
2011-10-09 11:23:03	lemburg	set	nosy: + lemburgmessages: +
2011-10-08 20:55:11	pitrou	set	files: + memchr2.patchmessages: +
2011-10-08 20:39:45	pitrou	set	messages: +
2011-10-08 20:29:12	pitrou	set	files: + memchr.patchkeywords: + patch
2011-10-08 20:29:03	pitrou	create