msg185956 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-03 21:29 |
In Python 3.4, str==str is implemented by calling memcmp(). unicode_eq() function, used by dict and set types, checks the first byte before calling memcmp(). bytes==bytes uses the same check. Py_UNICODE_MATCH macro checks the first *and* last character before calling memcmp() since this commit: --- changeset: 38242:0de9a789de39 branch: legacy-trunk user: Fredrik Lundh <fredrik@pythonware.com> date: Tue May 23 10:10:57 2006 +0000 files: Include/unicodeobject.h description: needforspeed: check first *and* last character before doing a full memcmp --- Attached patch changes str==str to check the first and last character before calling memcmp(). It might reduce the overhead of a C function call, but it is much faster when comparing two different strings of the same length with a common prefix (but a different suffix). The patch merges also unicode_compare_eq() and unicode_eq() to use the same code for str, dict and set. We may use the same optimization on byte strings. See also #16321. |
|
|
msg185957 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-03 21:33 |
According to my benchmark, performances are almost the same with the patch. The major difference is on comparing two strings longer than 10 characters, of the same length, with a common prefix but a different suffix. See attached benchmark for the result. |
|
|
msg185965 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-03 21:58 |
Attach the benchmark script. |
|
|
msg185970 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-03 22:10 |
benchmark2: Results on a slower computer. Comparing equal strings is much faster with the patch. Example: equal, 'A', 1000000 | 945 us (*) |
1.25 ms (+32%) I don't understand why the patch makes the comparaison much slower, since most time is supposed to be spend in memcmp()? Is it because I starts at the second character to compare strings, instead of the first character? Memory alignment issue? |
|
msg186007 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-04-04 06:21 |
> I don't understand why the patch makes the comparaison much slower, > since most time is supposed to be spend in memcmp()? Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? In other words, I'm not convinced this is a useful heuristic. |
|
|
msg186012 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-04 08:33 |
> In other words, I'm not convinced this is a useful heuristic. Me neither, but we should use the same optimization strategy for all functions. If we don't compare first and/or last character for str==str, we should do the same for bytes==bytes and Py_UNICODE_MATCH. str==str performances depends on the compiler and the libc. So performances may be very different on Windows, I will try to run the benchmark on Windows. GCC has also a known performance issue on memcmp. Its builtin memcmp implementation is slower than glibc >= 2.10, especiall glibc >= 2.13. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052 "GCC can't beat glibc if function call overhead is low." 2013/4/4 Antoine Pitrou <report@bugs.python.org>: > > Antoine Pitrou added the comment: > >> I don't understand why the patch makes the comparaison much slower, >> since most time is supposed to be spend in memcmp()? > > Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? > > In other words, I'm not convinced this is a useful heuristic. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue17628> > _______________________________________ |
|
|
msg186016 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2013-04-04 09:09 |
On 04.04.2013 10:33, STINNER Victor wrote: >>> I don't understand why the patch makes the comparaison much slower, >>> since most time is supposed to be spend in memcmp()? >> >> Because reading the last character evicts useful data from the CPU cache, just before memcmp() reads it again from memory? >> >> In other words, I'm not convinced this is a useful heuristic. Same here. The heuristic may work for short strings that easily fit into the CPU cache, but as soon as you use it on longer strings, this will result in much slower comparisons. Whether this results in a speedup or not also depends a lot on the domain of where you need to run comparisons, e.g. if you have run the heuristic on Python's special method names (such as "__init__") it won't give you any benefit. OTOH, it's easy to construct strings that benefit a lot from it :-) Something that typically works well in practice is to inline the comparison of the first few characters and then call memcmp() on the remaining ones. This avoids cache corruption and safes a few cycles setup costs for memcmp() for short strings. |
|
|
msg186017 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-04 09:21 |
By the way, my initial concern was to merge unicode_compare_eq() and unicode_eq() functions. unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks if the first byte is different before calling memcmp(). So the question is to decide which one is faster ;-) |
|
|
msg186019 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2013-04-04 09:30 |
On 04.04.2013 11:21, STINNER Victor wrote: > > STINNER Victor added the comment: > > By the way, my initial concern was to merge unicode_compare_eq() and > unicode_eq() functions. > > unicode_compare_eq() only uses memcmp(), whereas unicode_eq() checks > if the first byte is different before calling memcmp(). So the > question is to decide which one is faster ;-) The second one was faster a couple of years ago. Things may have changed since then (better compilers, CPUs, etc.). Perhaps you could run a benchmark with increasing sizes of strings, one set with mismatches in the last character, the other with matching strings. |
|
|
msg186044 - (view) |
Author: Eric Snow (eric.snow) *  |
Date: 2013-04-04 17:00 |
> Marc-Andre Lemburg added the comment: > Same here. The heuristic may work for short strings that easily fit > into the CPU cache, but as soon as you use it on longer strings, > this will result in much slower comparisons. When testing both, would it help to test the end of the string before the beginning? I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call. |
|
|
msg186045 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2013-04-04 17:30 |
On 04.04.2013 19:00, Eric Snow wrote: > > Eric Snow added the comment: > >> Marc-Andre Lemburg added the comment: >> Same here. The heuristic may work for short strings that easily fit >> into the CPU cache, but as soon as you use it on longer strings, >> this will result in much slower comparisons. > > When testing both, would it help to test the end of the string before the beginning? I'd expect that be more likely to leave the beginning in the cache for any subsequent memcmp() call. Again: this depends a lot on what strings you are dealing with. If you are comparing strings that only vary in the first few characters, testing the last character first would not be ideal :-) Given that CPUs are optimized to read ahead in memory, it's always better to avoid jumping around too much when accessing memory. http://en.wikipedia.org/wiki/CPU_cache http://en.wikipedia.org/wiki/Locality_of_reference http://lwn.net/Articles/252125/ Ideally, you want to stay within a cache line, typically 64 bytes. |
|
|
msg186231 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-04-07 17:31 |
Note that unicode_eq() always called after identity check and hash check. I.e. identity check in Victor's patch is redundant and unicode_eq() called only for strings which have the same hash. The probability to have the same first byte and be equal is a great. unicode_compare_eq() and unicode_eq() are designed for different purposes. They cannot be just merged. As the optimization for unicode_eq(), I would have suggested a checking of first machine words instead of first bytes, but this trick is too dirty. |
|
|
msg186458 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-09 22:13 |
Because most people agree that checking first and/or last byte/character is not a good idea (may be slower), here is a new patch removing code checking first/last byte or character in bytes_richcompare() and unicode_eq(). It removes the usage of the "register" keyword, I read that modern compilers generate worse code when this keyword is used. Register allocators of modern compilers known better than you which variables should use a register. |
|
|
msg186464 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-04-09 22:24 |
> Because most people agree that checking first and/or last > byte/character is not a good idea (may be slower), here is a new patch > removing code checking first/last byte or character in > bytes_richcompare() and unicode_eq(). You misunderstood. Checking the first byte is ok, it's checking the last byte which can misfire. > It removes the usage of the "register" keyword, I read that modern > compilers generate worse code when this keyword is used. Register > allocators of modern compilers known better than you which variables > should use a register. I think you should open a separate issue to remove all instances of the "register" keyword in the code base. It has become useless, any modern compiler will allocate registers by itself, happily ignoring any "register" hint. |
|
|
msg186733 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-04-13 15:30 |
Some crazy ideas. Try something like this: #define BLOCK unsigned long if (size >= sizeof(BLOCK)) { if (*(BLOCK*)data1 != *(BLOCK*)data2) return 0; return (memcmp((unsigned char*)data1 + sizeof(BLOCK), (unsigned char*)data2 + sizeof(BLOCK), size) == 0); } if (*(unsigned char*)data1 != *(unsigned char*)data2) return 0; return (memcmp(data1, data2, size) == 0); Or may be unroll memcmp for small size: switch (size) { #if SIZEOF_LONG == 8 case 7: ... #endif case 3: ... case 2: if (((unsigned char*)data1)[1] != ((unsigned char*)data2)[1]) return 0; case 1: if (((unsigned char*)data1)[0] != ((unsigned char*)data2)[0]) return 0; case 0: return 1; default: // case for size >= sizeof(BLOCK) } |
|
|
msg186779 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-13 18:21 |
You must check that data is aligned. Did you run a benchmark? |
|
|
msg195234 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2013-08-15 05:33 |
The dont-compare-first-last patch looks about right. The "if (len == 0) return 1;" shortcut perhaps should be taken out. It makes the common case pay (if only slightly) for the rare case (which of course, never gets predicted). This whole code block gets inlined in the very tight inner loops of the set/dict lookkey functions -- it should be as thin as possible. One other thing to take a look at is the "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" expression. The disassembly shows an imulq instruction rather than the usual scaled LEA computation. I don't know if this can be sped-up by a predictable branch for the common case. |
|
|
msg195236 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-08-15 07:24 |
Perhaps it worth manually inline unicode_eq() in these tight inner loops. Then we can move "PyUnicode_GET_LENGTH(a) * PyUnicode_KIND(a)" out of the loop. |
|
|
msg195240 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-08-15 09:22 |
In issue #18719, Raymond modified Python 2.7, but he didn't touch the following macro: #define Py_UNICODE_MATCH(string, offset, substring) \ ((*((string)->str + (offset)) == *((substring)->str)) && \ ((*((string)->str + (offset) + (substring)->length-1) == *((substring)->str + (substring)->length-1))) && \ !memcmp((string)->str + (offset), (substring)->str, (substring)->length*sizeof(Py_UNICODE))) It was said that looking for the last character before calling memcmp() is inefficient for the CPU cache. This macro should also be modified. |
|
|
msg195241 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2013-08-15 09:28 |
It's also bad for memory read performance if the string is rather long. The memory controller performs best when code reads memory sequential. The talk http://www.youtube.com/watch?v=MC1EKLQ2Wmg about mythbusting modern hardware sums it up real nice. |
|
|
msg195419 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2013-08-16 21:08 |
FWIW, there are two distinct issues. As everyone has noted here, accessing memory in non-sequential order is a performance killer. The other issue (the one I was working on) is that early-out first-char or last-char tests are a waste (almost never executed) if we already know that the string hashes match (i.e. in the string equality calls from the dict and set lookup routines). If the hashes match, the odds are 2**64 to 1 if favor of the strings being equal. |
|
|
msg196690 - (view) |
Author: Martin Mokrejs (mmokrejs) |
Date: 2013-09-01 00:04 |
Regarding benchmarking and code performance inspection, maybe you want to try on your linux box: perf top perf stat /usr/bin/python mytest.py http://perf.wiki.kernel.org/ |
|
|
msg238417 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-03-18 11:06 |
The benefit of any change is unclear. I'm not more interested to work on such optimization, so I just close the issue. |
|
|