Issue 25106: Hash computation speedup for {buffer, string, unicode}object (original) (raw)

Created on 2015-09-14 07:29 by alecsandru.patrascu, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hash8-v01.patch alecsandru.patrascu,2015-09-14 07:29 Initial patch for Python 2.7 review
hash8-v02.patch alecsandru.patrascu,2015-09-14 10:55 review
hash8-v03-gps01.patch gregory.p.smith,2015-09-14 21:53 review
Messages (21)
msg250639 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 07:29
Hi All, This is Alecsandru from Server Scripting Languages Optimization team at Intel Corporation. I would like to submit a patch that improves the performance of the hash computation code on stringobject, bufferobject and unicodeobject. As can be seen from the attached sample performance results from the Grand Unified Python Benchmark, speedups up to 40% were observed. Furthermore, we see a 5-7% performance on OpenStack/Swift, where most of the code is in Python 2.7. Attached is the patch that modifies Object/stringobject.c, Object/bufferobject.c and Object/unicodeobject.c files. We built and tested this patch for Python 2.7 on our Linux machines (CentOS 7/Ubuntu Server 14.04, Intel Xeon Haswell/Broadwell with 18/8 cores). Steps to apply the patch: 1. hg clone https://hg.python.org/cpython cpython 2. cd cpython 3. hg update 2.7 4. Copy hash8.patch to the current directory 5. hg import --no-commit hash8.patch 6. ./configure 7. make In the following, please find our sample performance results measured on a XEON Haswell machine. Hardware (HW): Intel XEON (Haswell) 18 Cores BIOS settings: Intel Turbo Boost Technology: false Hyper-Threading: false Operating System: Ubuntu 14.04.3 LTS trusty OS configuration: CPU freq set at fixed: 2.0GHz by echo 2000000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq echo 2000000 > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq Address Space Layout Randomization (ASLR) disabled (to reduce run to run variation) by echo 0 > /proc/sys/kernel/randomize_va_space GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) Benchmark: Grand Unified Python Benchmark (GUPB) GUPB Source: https://hg.python.org/benchmarks/ Python2.7 results: Python source: hg clone https://hg.python.org/cpython cpython Python Source: hg update 2.7 Benchmarks Speedup(%) unpack_sequence 40.32733766 chaos 24.84002537 chameleon 23.01392651 silent_logging 22.27202911 django 20.83842317 etree_process 20.46968294 nqueens 20.34234985 pathlib 19.63445919 pidigits 19.34722148 etree_generate 19.25836634 pybench 19.06895825 django_v2 18.06073108 etree_iterparse 17.3797149 fannkuch 17.08120879 pickle_list 16.60363602 raytrace 16.0316265 slowpickle 15.86611184 pickle_dict 15.30447114 call_simple 14.42909032 richards 14.2949594 simple_logging 13.6522626 etree_parse 13.38113097 json_dump_v2 12.26588885 float 11.88164311 mako 11.20606516 spectral_norm 11.04356684 hg_startup 10.57686164 mako_v2 10.37912648 slowunpickle 10.24030714 go 10.03567319 meteor_contest 9.956231435 normal_startup 9.607401586 formatted_logging 9.601244811 html5lib 9.082603748 2to3 8.741557816 html5lib_warmup 8.268150981 nbody 7.507012306 regex_compile 7.153922724 bzr_startup 7.140244739 telco 6.869411927 slowspitfire 5.746323922 tornado_http 5.24360121 rietveld 3.865704876 regex_v8 3.777622219 hexiom2 3.586305282 json_dump 3.477551682 spambayes 3.183991854 fastunpickle 2.971645347 fastpickle 0.673086656 regex_effbot 0.127946837 json_load 0.023727176 Thank you, Alecsandru
msg250640 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 08:06
Python 2.7 is closed for adding new features. Python 3.4+ uses different code for calculating hashes.
msg250642 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-14 08:23
> Python 2.7 is closed for adding new features. Python 3.4+ uses different code for calculating hashes. The code doesn't modify the hash function. It's a common loop unroll optimization technique. Since the optimization can be seen on real world benchmark, I think that it's worth to take this optimization.
msg250643 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 08:41
Hmm. Is Duff's Device a valid construct for CPython? It'd shorten this a lot...
msg250644 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 08:46
Or at very least, can fallthrough be used in the switch block, so there aren't 7+6+5+4+3+2+1 copies of the same line? -- Not a C performance expert --
msg250645 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 09:00
> Or at very least, can fallthrough be used in the switch block, so there > aren't 7+6+5+4+3+2+1 copies of the same line? It is how _Py_HashBytes is implemented in 3.4+.
msg250646 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 09:16
Yes, sure it can be implemented in smaller space, as Serhiy and Chris well pointed out. I submitted the 01 version like that just to be clear that I don't want to re-implement a new hash computing value and just unroll the loop in the existing one. I will submit a new version based on this observations.
msg250654 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 10:55
I've submitted a more compact version of the previous code (hash8-v02.patch)
msg250685 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 17:05
Is this applicable to Python 3 at all? From Serhiy's comment I think this might be a backport of a technique already used there, but it's not entirely clear as Alecsandru didn't say why there wasn't a 3.6 patch.
msg250687 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 17:25
This patch is not applicable to Python 3, as it already has a better hash function and has the same unrolling technique applied. As Brett well observed, it is a backport of an optimization technique used in Python 3, applied to the existing hash function in 2.7.
msg250689 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 17:30
If this is applying something we already did in Python 3 and it doesn't break compatibility then this should fall under the same sort of backport exemption that Guido granted for the eval loop performance enhancements that Intel did a couple months back.
msg250691 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-14 17:39
Yes, it doesn't break the compatibility with existing applications. To make sure, we tested it in various scenarios and workloads.
msg250692 - (view) Author: Chris Angelico (Rosuav) * Date: 2015-09-14 17:43
+1 for anything that makes Python faster with provably no semantic changes.
msg250695 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-14 18:01
The code for str and buffer can be shared as _Py_HashBytes() (as in 3.4+). Also please add comments in the switch block as in 3.4+. Should we ask Guido for granting backport exemption for this optimization?
msg250698 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2015-09-14 18:14
Assuming Benjamin doesn't object I don't think we need to explicit ask Guido. He made his feelings quite clear about allowing backports of performance improvements that already exist in Python 3 and have been battle-tested. Having said that, I hope Intel is also looking at Python 3 improvements and not solely backporting stuff to Python 2.7. =)
msg250704 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-09-14 21:53
attaching an updated patch. fwiw, in my own quick tests i'm not seeing a performance difference on the benchmark suite. but i am not trying to run it in such an isolated quiet machine locked freq. environment. it is still a good idea regardless. _some_ compilers really benefit from the manually unrolling. any that already unrolled things themselves should not care.
msg250724 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-09-15 06:03
Testing this on a host with a fixed frequency and mostx background tasks disabled I cannot reproduce the speedups you list. I see no significant change on most of them and a 6% slowdown on json_load and a 1-4% slowdown on regex_v8 (not sure if those are in the noise) I suspect your comparison was including other patches or was not built the same.
msg250834 - (view) Author: Maciej Fijalkowski (fijall) * (Python committer) Date: 2015-09-16 10:01
I find numbers really hard to believe. It would mean that over 40% of django templates is string hashing (assuming 2x speedup) which really sounds unbelievable. In fact in PyPy I never found string hashing to be significant while I would expect PyPy to have string hashing more of a bottleneck, since it's almost never optimized away really. What made you think string hashing is a good target for optimizations?
msg250858 - (view) Author: Alecsandru Patrascu (alecsandru.patrascu) * Date: 2015-09-16 17:38
Please hold this patch for a while, while I analyze the received feedback.
msg255428 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-26 17:29
Ping.
msg255431 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-11-26 17:50
Reopen this if you find a reproducible way for this patch to prove useful.
History
Date User Action Args
2022-04-11 14:58:21 admin set github: 69293
2015-11-26 17:50:52 gregory.p.smith set status: open -> closedresolution: rejectedmessages: +
2015-11-26 17:29:05 serhiy.storchaka set status: pending -> openmessages: +
2015-09-19 23:04:48 brett.cannon set status: open -> pending
2015-09-19 22:15:20 JelleZijlstra set status: pending -> opennosy: + JelleZijlstra
2015-09-16 17:38:49 brett.cannon set status: open -> pending
2015-09-16 17:38:26 alecsandru.patrascu set messages: +
2015-09-16 17:02:40 skip.montanaro set nosy: + skip.montanaro
2015-09-16 10:01:56 fijall set nosy: + fijallmessages: +
2015-09-15 06:03:20 gregory.p.smith set messages: +
2015-09-14 21:53:50 gregory.p.smith set files: + hash8-v03-gps01.patchmessages: +
2015-09-14 18:14:17 brett.cannon set messages: +
2015-09-14 18:01:02 serhiy.storchaka set nosy: + benjamin.petersonmessages: +
2015-09-14 17:43:57 Rosuav set messages: +
2015-09-14 17:39:10 alecsandru.patrascu set messages: +
2015-09-14 17:30:19 brett.cannon set messages: +
2015-09-14 17:25:48 alecsandru.patrascu set messages: +
2015-09-14 17:05:07 brett.cannon set nosy: + brett.cannonmessages: +
2015-09-14 15:12:23 gregory.p.smith set nosy: + gregory.p.smith
2015-09-14 10:55:54 alecsandru.patrascu set messages: +
2015-09-14 10:55:04 alecsandru.patrascu set files: + hash8-v02.patch
2015-09-14 09:16:55 alecsandru.patrascu set messages: +
2015-09-14 09:00:43 serhiy.storchaka set messages: + title: Hash computation speedup for {buffer,string,unicode}object -> Hash computation speedup for {buffer, string, unicode}object
2015-09-14 08:46:07 Rosuav set messages: +
2015-09-14 08:41:13 Rosuav set nosy: + Rosuavmessages: +
2015-09-14 08:23:19 vstinner set type: enhancement -> performancemessages: + nosy: + vstinner
2015-09-14 08:06:57 serhiy.storchaka set nosy: + serhiy.storchakamessages: +
2015-09-14 07:29:28 alecsandru.patrascu create