Issue 17628: str==str: compare the first character before calling memcmp() (original) (raw)

Issue17628

Created on 2013-04-03 21:29 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_eq.patch vstinner,2013-04-03 21:29 review
benchmark vstinner,2013-04-03 21:33
bench_unicode_eq.py vstinner,2013-04-03 21:58
benchmark2 vstinner,2013-04-03 22:10
dont_compare_first_last.patch vstinner,2013-04-09 22:13 review
Messages (23)
msg185956 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:29
In Python 3.4, str==str is implemented by calling memcmp(). unicode_eq() function, used by dict and set types, checks the first byte before calling memcmp(). bytes==bytes uses the same check. Py_UNICODE_MATCH macro checks the first *and* last character before calling memcmp() since this commit: --- changeset: 38242:0de9a789de39 branch: legacy-trunk user: Fredrik Lundh <fredrik@pythonware.com> date: Tue May 23 10:10:57 2006 +0000 files: Include/unicodeobject.h description: needforspeed: check first *and* last character before doing a full memcmp --- Attached patch changes str==str to check the first and last character before calling memcmp(). It might reduce the overhead of a C function call, but it is much faster when comparing two different strings of the same length with a common prefix (but a different suffix). The patch merges also unicode_compare_eq() and unicode_eq() to use the same code for str, dict and set. We may use the same optimization on byte strings. See also #16321.
msg185957 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:33
According to my benchmark, performances are almost the same with the patch. The major difference is on comparing two strings longer than 10 characters, of the same length, with a common prefix but a different suffix. See attached benchmark for the result.
msg185965 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 21:58
Attach the benchmark script.
msg185970 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-03 22:10
benchmark2: Results on a slower computer. Comparing equal strings is much faster with the patch. Example: equal, 'A', 1000000 | 945 us (*) 1.25 ms (+32%) I don't understand why the patch makes the comparaison much slower, since most time is supposed to be spend in memcmp()? Is it because I starts at the second character to compare strings, instead of the first character? Memory alignment issue?
msg186007 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-04 06:21
> I don't understand why the patch makes the comparaison much slower, > since most time is supposed to be spend in memcmp()? Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? In other words, I'm not convinced this is a useful heuristic.
msg186012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-04 08:33
> In other words, I'm not convinced this is a useful heuristic. Me neither, but we should use the same optimization strategy for all functions. If we don't compare first and/or last character for str==str, we should do the same for bytes==bytes and Py_UNICODE_MATCH. str==str performances depends on the compiler and the libc. So performances may be very different on Windows, I will try to run the benchmark on Windows. GCC has also a known performance issue on memcmp. Its builtin memcmp implementation is slower than glibc >= 2.10, especiall glibc >= 2.13. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 "GCC can't beat glibc if function call overhead is low." 2013/4/4 Antoine Pitrou <report@bugs.python.org>: > > Antoine Pitrou added the comment: > >> I don't understand why the patch makes the comparaison much slower, >> since most time is supposed to be spend in memcmp()? > > Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? > > In other words, I'm not convinced this is a useful heuristic. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue17628> > _______________________________________
msg186016 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 09:09
On 04.04.2013 10:33, STINNER Victor wrote: >>> I don't understand why the patch makes the comparaison much slower, >>> since most time is supposed to be spend in memcmp()? >> >> Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? >> >> In other words, I'm not convinced this is a useful heuristic. Same here. The heuristic may work for short strings that easily fit into the CPU cache, but as soon as you use it on longer strings, this will result in much slower comparisons. Whether this results in a speedup or not also depends a lot on the domain of where you need to run comparisons, e.g. if you have run the heuristic on Python's special method names (such as "__init__") it won't give you any benefit. OTOH, it's easy to construct strings that benefit a lot from it :-) Something that typically works well in practice is to inline the comparison of the first few characters and then call memcmp() on the remaining ones. This avoids cache corruption and safes a few cycles setup costs for memcmp() for short strings.
msg186017 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-04 09:21
By the way, my initial concern was to merge unicode_compare_eq() and unicode_eq() functions. unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks if the first byte is different before calling memcmp(). So the question is to decide which one is faster ;-)
msg186019 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 09:30
On 04.04.2013 11:21, STINNER Victor wrote: > > STINNER Victor added the comment: > > By the way, my initial concern was to merge unicode_compare_eq() and > unicode_eq() functions. > > unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks > if the first byte is different before calling memcmp(). So the > question is to decide which one is faster ;-) The second one was faster a couple of years ago. Things may have changed since then (better compilers, CPUs, etc.). Perhaps you could run a benchmark with increasing sizes of strings, one set with mismatches in the last character, the other with matching strings.
msg186044 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2013-04-04 17:00
> Marc-Andre Lemburg added the comment: > Same here. The heuristic may work for short strings that easily fit > into the CPU cache, but as soon as you use it on longer strings, > this will result in much slower comparisons. When testing both, would it help to test the end of the string before the beginning? I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call.
msg186045 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-04-04 17:30
On 04.04.2013 19:00, Eric Snow wrote: > > Eric Snow added the comment: > >> Marc-Andre Lemburg added the comment: >> Same here. The heuristic may work for short strings that easily fit >> into the CPU cache, but as soon as you use it on longer strings, >> this will result in much slower comparisons. > > When testing both, would it help to test the end of the string before the beginning? I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call. Again: this depends a lot on what strings you are dealing with. If you are comparing strings that only vary in the first few characters, testing the last character first would not be ideal :-) Given that CPUs are optimized to read ahead in memory, it's always better to avoid jumping around too much when accessing memory. http://en.wikipedia.org/wiki/CPU_cache http://en.wikipedia.org/wiki/Locality_of_reference http://lwn.net/Articles/252125/ Ideally, you want to stay within a cache line, typically 64 bytes.
msg186231 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-07 17:31
Note that unicode_eq() always called after identity check and hash check. I.e. identity check in Victor's patch is redundant and unicode_eq() called only for strings which have the same hash. The probability to have the same first byte and be equal is a great. unicode_compare_eq() and unicode_eq() are designed for different purposes. They cannot be just merged. As the optimization for unicode_eq(), I would have suggested a checking of first machine words instead of first bytes, but this trick is too dirty.
msg186458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-09 22:13
Because most people agree that checking first and/or last byte/character is not a good idea (may be slower), here is a new patch removing code checking first/last byte or character in bytes_richcompare() and unicode_eq(). It removes the usage of the "register" keyword, I read that modern compilers generate worse code when this keyword is used. Register allocators of modern compilers known better than you which variables should use a register.
msg186464 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-04-09 22:24
> Because most people agree that checking first and/or last > byte/character is not a good idea (may be slower), here is a new patch > removing code checking first/last byte or character in > bytes_richcompare() and unicode_eq(). You misunderstood. Checking the first byte is ok, it's checking the last byte which can misfire. > It removes the usage of the "register" keyword, I read that modern > compilers generate worse code when this keyword is used. Register > allocators of modern compilers known better than you which variables > should use a register. I think you should open a separate issue to remove all instances of the "register" keyword in the code base. It has become useless, any modern compiler will allocate registers by itself, happily ignoring any "register" hint.
msg186733 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-04-13 15:30
Some crazy ideas. Try something like this: #define BLOCK unsigned long if (size >= sizeof(BLOCK)) { if (*(BLOCK*)data1 != *(BLOCK*)data2) return 0; return (memcmp((unsigned char*)data1 + sizeof(BLOCK), (unsigned char*)data2 + sizeof(BLOCK), size) == 0); } if (*(unsigned char*)data1 != *(unsigned char*)data2) return 0; return (memcmp(data1, data2, size) == 0); Or may be unroll memcmp for small size: switch (size) { #if SIZEOF_LONG == 8 case 7: ... #endif case 3: ... case 2: if (((unsigned char*)data1)[1] != ((unsigned char*)data2)[1]) return 0; case 1: if (((unsigned char*)data1)[0] != ((unsigned char*)data2)[0]) return 0; case 0: return 1; default: // case for size >= sizeof(BLOCK) }
msg186779 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-13 18:21
You must check that data is aligned. Did you run a benchmark?
msg195234 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-15 05:33
The dont-compare-first-last patch looks about right. The "if (len == 0) return 1;" shortcut perhaps should be taken out. It makes the common case pay (if only slightly) for the rare case (which of course, never gets predicted). This whole code block gets inlined in the very tight inner loops of the set/dict lookkey functions -- it should be as thin as possible. One other thing to take a look at is the "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" expression. The disassembly shows an imulq instruction rather than the usual scaled LEA computation. I don't know if this can be sped-up by a predictable branch for the common case.
msg195236 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-15 07:24
Perhaps it worth manually inline unicode_eq() in these tight inner loops. Then we can move "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" out of the loop.
msg195240 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-15 09:22
In issue #18719, Raymond modified Python 2.7, but he didn't touch the following macro: #define Py_UNICODE_MATCH(string, offset, substring) \ ((*((string)->str + (offset)) == *((substring)->str)) && \ ((*((string)->str + (offset) + (substring)->length-1) == *((substring)->str + (substring)->length-1))) && \ !memcmp((string)->str + (offset), (substring)->str, (substring)->length*sizeof(Py_UNICODE))) It was said that looking for the last character before calling memcmp() is inefficient for the CPU cache. This macro should also be modified.
msg195241 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-08-15 09:28
It's also bad for memory read performance if the string is rather long. The memory controller performs best when code reads memory sequential. The talk http://www.youtube.com/watch?v=MC1EKLQ2Wmg about mythbusting modern hardware sums it up real nice.
msg195419 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-16 21:08
FWIW, there are two distinct issues. As everyone has noted here, accessing memory in non-sequential order is a performance killer. The other issue (the one I was working on) is that early-out first-char or last-char tests are a waste (almost never executed) if we already know that the string hashes match (i.e. in the string equality calls from the dict and set lookup routines). If the hashes match, the odds are 2**64 to 1 if favor of the strings being equal.
msg196690 - (view) Author: Martin Mokrejs (mmokrejs) Date: 2013-09-01 00:04
Regarding benchmarking and code performance inspection, maybe you want to try on your linux box: perf top perf stat /usr/bin/python mytest.py http://perf.wiki.kernel.org/
msg238417 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-18 11:06
The benefit of any change is unclear. I'm not more interested to work on such optimization, so I just close the issue.
History
Date User Action Args
2022-04-11 14:57:43 admin set github: 61828
2015-03-18 11:06:35 vstinner set status: open -> closedresolution: out of datemessages: +
2013-09-01 00:04:15 mmokrejs set nosy: + mmokrejsmessages: +
2013-08-16 21:08:42 rhettinger set messages: +
2013-08-15 09:28:36 christian.heimes set messages: +
2013-08-15 09:22:11 vstinner set messages: +
2013-08-15 07:24:22 serhiy.storchaka set messages: +
2013-08-15 05:33:49 rhettinger set nosy: + rhettingermessages: +
2013-07-05 23:25:52 christian.heimes set nosy: + christian.heimes
2013-06-05 00:35:47 vstinner set title: str==str: compare the first and last character before calling memcmp() -> str==str: compare the first character before calling memcmp()
2013-04-13 18:21:31 vstinner set messages: +
2013-04-13 15:30:12 serhiy.storchaka set messages: +
2013-04-09 22:24:10 pitrou set messages: +
2013-04-09 22:13:49 vstinner set files: + dont_compare_first_last.patchmessages: +
2013-04-07 17:31:57 serhiy.storchaka set messages: +
2013-04-04 17:30:09 lemburg set messages: +
2013-04-04 17:00:13 eric.snow set nosy: + eric.snowmessages: +
2013-04-04 09:30:40 lemburg set messages: +
2013-04-04 09:21:43 vstinner set messages: +
2013-04-04 09:09:36 lemburg set nosy: + lemburgmessages: +
2013-04-04 08:33:11 vstinner set messages: +
2013-04-04 06:21:07 pitrou set messages: +
2013-04-03 22:10:33 vstinner set files: + benchmark2messages: +
2013-04-03 21:58:19 vstinner set files: + bench_unicode_eq.pymessages: +
2013-04-03 21:33:03 vstinner set files: + benchmarkmessages: +
2013-04-03 21:29:12 vstinner create