msg335714 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2019-02-16 20:15 |
Benchmark show what writes to class variables are anomalously slow. class A(object): pass A.x = 1 # This write is 3 to 5 times slower than other writes. FWIW, the same operation for old-style classes in Python 2.7 was several times faster. We should investigate to understand why the writes are so slow. There might be a good reason or there might be an opportunity for optimization. ------------------------------------------------- $ python3.8 Tools/scripts/var_access_benchmark.py Variable and attribute read access: 4.3 ns read_local 4.6 ns read_nonlocal 14.5 ns read_global 19.0 ns read_builtin 18.4 ns read_classvar_from_class 16.2 ns read_classvar_from_instance 24.7 ns read_instancevar 19.7 ns read_instancevar_slots 19.5 ns read_namedtuple 26.4 ns read_boundmethod Variable and attribute write access: 4.4 ns write_local 5.1 ns write_nonlocal 18.2 ns write_global 103.9 ns write_classvar <== Outlier 35.4 ns write_instancevar 25.6 ns write_instancevar_slots |
|
|
msg335733 - (view) |
Author: Pablo Galindo Salgado (pablogsal) *  |
Date: 2019-02-16 23:34 |
It seems 50% of the overhead (50ns) is due to two reasons: - 30-40% is due to the call to update_slot(type, name) after the item is set in the class dictionary. - 70-60% is due to all the extra work from _PyObject_GenericSetAttrWithDict until it ends calling _PyObjectDict_SetItem with the slot. If in the code of typeobject.c:type_setattro you change: res = _PyObject_GenericSetAttrWithDict((PyObject *)type, name, value, NULL); by PyObject* dictptr = _PyObject_GetDictPtr(type); res = _PyObjectDict_SetItem(type, dictptr, name, value); and delete the update_slot(type, name) call afterwards, the times are reduced to 50ns. |
|
|
msg335747 - (view) |
Author: Yasser Alshalaan (Yasser Alshalaan) * |
Date: 2019-02-17 02:53 |
can you include your python 2.7 runs? for me it looks similar |
|
|
msg335763 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2019-02-17 08:50 |
> can you include your python 2.7 runs? for me it looks similar It will give similar results unless you switch to old-style classes (edit out the inheritance from object). class A: pass A.x = 1 --------------------------------------- $ python2.7 var_access_benchmark.py Variable and attribute read access: 6.7 ns read_local 10.9 ns read_global 18.9 ns read_builtin 24.4 ns read_classvar_from_class 30.2 ns read_classvar_from_instance 22.7 ns read_instancevar 25.5 ns read_instancevar_slots 99.3 ns read_namedtuple 40.9 ns read_boundmethod Variable and attribute write access: 8.2 ns write_local 18.7 ns write_global 32.9 ns write_classvar <== Not formerly an outlier 32.5 ns write_instancevar 31.2 ns write_instancevar_slots |
|
|
msg335776 - (view) |
Author: Stefan Behnel (scoder) *  |
Date: 2019-02-17 14:36 |
It turns out that "update_slot()" is always called, even when we are not updating a slot name (which is always a special dunder-name). The linear search for names in "update_slots()" is a huge waste of time here, and short-circuiting out of it when the name does not start with "_" cuts the overall update time by 50%. I pushed a PR. Another improvement would be a sub-linear algorithm for searching the slot name, but that's a bigger change. |
|
|
msg335794 - (view) |
Author: Pablo Galindo Salgado (pablogsal) *  |
Date: 2019-02-18 00:59 |
This are the timings that I am measuring with PR 11907: Variable and attribute read access: 5.7 ns read_local 5.9 ns read_nonlocal 16.2 ns read_global 24.5 ns read_builtin 20.9 ns read_classvar_from_class 20.0 ns read_classvar_from_instance 29.7 ns read_instancevar 24.7 ns read_instancevar_slots 22.9 ns read_namedtuple 36.8 ns read_boundmethod Variable and attribute write access: 6.9 ns write_local 6.9 ns write_nonlocal 26.7 ns write_global 65.4 ns write_classvar <----- Down from 120.6 ns 49.5 ns write_instancevar 34.5 ns write_instancevar_slots |
|
|
msg335795 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2019-02-18 01:32 |
Wow, I didn't expect to get an immediate win this of this magnitude :-) |
|
|
msg335992 - (view) |
Author: Neil Schemenauer (nascheme) *  |
Date: 2019-02-19 18:25 |
Some profiling using 'perf'. This is for cpython 63fa1cfece4912110ce3a0ff11fb3ade3ff5e756. children self [...] + 97.27% 0.00% run_mod (inlined) + 88.53% 6.33% PyObject_SetAttr + 79.34% 6.80% type_setattro + 43.92% 43.92% update_slot + 26.87% 5.84% _PyObject_GenericSetAttrWithDict + 14.62% 6.52% insertdict + 8.80% 8.80% lookdict_unicode_nodummy + 6.57% 0.00% _Py_DECREF (inlined) + 5.19% 5.19% PyUnicode_InternInPlace + 3.57% 3.57% _PyObjectDict_SetItem + 3.38% 3.38% _PyType_Lookup + 1.71% 0.00% _Py_INCREF (inlined) + 1.42% 0.00% _Py_XDECREF (inlined) [...] After applying PR 11907: children self [...] + 94.76% 0.00% run_mod (inlined) + 75.22% 6.77% PyObject_SetAttr + 64.65% 13.08% type_setattro + 47.51% 11.96% _PyObject_GenericSetAttrWithDict + 22.99% 13.95% insertdict + 10.10% 10.10% lookdict_unicode_nodummy + 9.47% 9.47% PyUnicode_InternInPlace + 7.10% 7.10% _PyObjectDict_SetItem + 7.02% 7.02% _PyType_Lookup + 6.52% 0.00% _Py_DECREF (inlined) + 3.17% 0.00% _Py_INCREF (inlined) + 3.16% 0.00% _Py_XDECREF (inlined) + 2.59% 0.00% PyDict_SetItem (inlined) + 1.50% 0.00% is_dunder_name (inlined) [...] The PyUnicode_InternInPlace() can mostly be eliminated by testing PyUnicode_CHECK_INTERNED() first (we already have called PyUnicode_Check() on it). That only gives a 7% speedup on my machine though. The is_dunder_name() is a much bigger optimization. |
|
|
msg335995 - (view) |
Author: Neil Schemenauer (nascheme) *  |
Date: 2019-02-19 18:35 |
BTW, 'perf report [...]' has a really neat annotated assembly view. Scroll to the function you are interested in and press 'a'. Press 't' to toggle the time units (left side numbers). I'm attaching a screenshot of the disassembly of the lookdict function. The time units are sample accounts. Each count is a point where the profiler woke up on that specific instruction. |
|
|
msg336031 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2019-02-20 00:45 |
Thanks Neil. The tooling does indeed look nice. I added your PyUnicode_InternInPlace() suggestion to the PR. At this point, the PR looks ready-to-go unless any of you think we've missed some low-hanging fruit. |
|
|
msg336129 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2019-02-20 17:29 |
New changeset d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de by Raymond Hettinger (Stefan Behnel) in branch 'master': bpo-36012: Avoid linear slot search for non-dunder methods (GH-11907) https://github.com/python/cpython/commit/d8b9e1fc2e45d2bc3f4a9737c375f2adb8a8c7de |
|
|