Issue 10227: Improve performance of MemoryView slicing (original) (raw)

Created on 2010-10-29 08:50 by kristjan.jonsson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (34)

msg119872 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 08:50

In a recent email exchange on python-dev, Antoine Pitrou mentioned that slicing memoryview objects (lazy slices) wasn't necessarily very efficient when dealing with short slices. The data he posted was:

$ ./python -m timeit -s "x = b'x'*10000" "x[:100]" 10000000 loops, best of 3: 0.134 usec per loop $ ./python -m timeit -s "x = memoryview(b'x'*10000)" "x[:100]" 10000000 loops, best of 3: 0.151 usec per loop

Actually, this is not a fair comparison. A more realistic alternative to the memoryview is the bytearray, a mutable buffer. My local tests gave these numbers:

python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.14 usec per loop

python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.215 usec per loop

python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.163 usec per loop

In this case, lazy slicing is indeed faster than greedy slicing. However, I was intrigued by how much these cases differ. Why was slicing bytes objects so much faster? Each should just result in the generation of a single object.

It turns out that the slicing operation for strings (and sequences is very streamlined in the core. To address this to some extent I provide a patch with three main components:

There is now a single object cache of slice objects. These are generated by the core when slicing and immediately released. Reusing them if possible is very beneficial.
The PySlice_GetIndicesEx couldn't be optimized because of aliasing. Fixing that function sped it up considerably.
Creating a new api to create a memory view from a base memory view and a slice is much faster. The old way would do two copies of a Py_buffer with adverse effects on cache performance.

Applying this patch provides the following figures: python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.125 usec per loop

python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.202 usec per loop

python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.138 usec per loop

in memoryobject.c there was a comment stating that there should be an API for this. Now there is, only internal.

msg119874 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2010-10-29 09:14

You forgot to attach your patch.

msg119875 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 09:50

Oh dear. Here it is.

msg119876 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 10:04

But then, perhaps implementing the sequence protocol for memoryviews might be more efficient still.

msg119878 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2010-10-29 10:12

The sequence protocol (if I'm not confused) only work with a PyObject ** array.

msg119880 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 10:20

As an additional point: the PyMemoryObject has a "base" member that I think is redundant. the "view.obj" should be sufficient.

msg119881 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2010-10-29 10:23

As an additional point: the PyMemoryObject has a "base" member that I think is redundant. the "view.obj" should be sufficient.

Yes, that's what I think as well.

msg119884 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 10:48

In 2.x, strings are sliced using PySequence_GetSlice(). ceval.c in 3.0 is different, there is no apply_slice there (despite comments to that effect). I'd have to take another look with the profiler to figure out how bytes slicing in 3.0 works, but I suspect that it is somehow fasttracked passed the creation of slice objects, etc.

msg119885 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2010-10-29 10:50

I'd have to take another look with the profiler to figure out how bytes slicing in 3.0 works, but I suspect that it is somehow fasttracked passed the creation of slice objects, etc.

I don't think it is fasttracked at all. Even plain indexing is not fasttracked either.

msg119886 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-10-29 10:57

Well then, its back to the profiler for 3.2. I did all of the profiling with 2.7 for practical reasons (it was the only version I had available at the time) and then ported the change to 3.2 today. But obviously there are different rules in 3.2 :)

msg120113 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2010-11-01 09:18

I find it a lot easier to appreciate patches that implement a single change than those that mix different changes. There are three different things in your patch, which I would like to see in at least three different commits. I'd be happy if you could separate the changes into more readable feature patches. That makes it easier to accept them.

I'm generally happy about the slice changes, but you will have to benchmark the equivalent changes in Py3.2 to prove that they are similarly worth applying there.

msg120115 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-11-01 09:51

The benchmarks are from 3.2 Also, I'll do a more relevant profiling session for 3.2. This patch is based on profiling results from 2.7 so there might be more relevant optimization cases in 3.2

msg120117 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2010-11-01 09:52

In case I'm not clear enough: The patch is for 3.2, the benchmarks are 3.2, but it was created based on 2.7 results, which may not fully apply for 3.2

msg127695 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-01 12:42

I've extracted and fixed the part of this patch that implements the slice object cache. In particular, PySlice_Fini() was incorrectly implemented. This patch applies cleanly for me against the latest py3k branch.

msg127733 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-02-02 14:50

Any benchmark numbers for the slice cache? Also, is the call to PyObject_INIT necessary?

msg127746 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-02 18:48

Any benchmark numbers for the slice cache?

I ran the list tests in pybench and got this:

Test minimum run-time average run-time this other diff this other diff

ListSlicing:    66ms    67ms   -2.2%    67ms    68ms   -2.7%
 SmallLists:    61ms    64ms   -4.5%    61ms    65ms   -5.6%

Totals: 127ms 131ms -3.3% 128ms 133ms -4.1%

Repeating this gave me anything between 1.5% and 3.5% in total, with >2% for the small lists benchmark (which is the expected best case as slicing large lists obviously dominates the slice object creation).

IMHO, even 2% would be pretty good for such a small change.

Also, is the call to PyObject_INIT necessary?

In any case, the ref-count needs to be re-initialised to 1. A call to _Py_NewReference() would be enough, though, following the example in listobject.c. So you can replace

     PyObject_INIT(obj, &PySlice_Type);

     _Py_NewReference((PyObject *)obj);

in the patch. New patch attached.

msg127747 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-02-02 18:54

I ran the list tests in pybench and got this:

Test minimum run-time average run-time this other diff this other diff
ListSlicing:    66ms    67ms   -2.2%    67ms    68ms   -2.7%
 SmallLists:    61ms    64ms   -4.5%    61ms    65ms   -5.6%
Totals: 127ms 131ms -3.3% 128ms 133ms -4.1%

Repeating this gave me anything between 1.5% and 3.5% in total, with

2% for the small lists benchmark (which is the expected best case as slicing large lists obviously dominates the slice object creation).

IMHO, even 2% would be pretty good for such a small change.

Well, 3% on such micro-benchmarks (and, I assume, 0% on the rest) is generally considered very small. On the other hand, I agree the patch itself is quite simple.

by
     _Py_NewReference((PyObject *)obj);
in the patch. New patch attached.

Don't you also need a _Py_ForgetReference() at the other end? Or have I missed it?

msg127751 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-02 19:11

There's a "PyObject_Del(obj)" in all code paths.

msg127790 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-03 12:34

Here are some real micro benchmarks (note that the pybench benchmarks actually do lots of other stuff besides slicing):

base line:

$ ./python -m timeit -s 'l = list(range(100)); s=slice(None)' 'l[s]' 1000000 loops, best of 3: 0.464 usec per loop $ ./python -m timeit -s 'l = list(range(10)); s=slice(None)' 'l[s]' 10000000 loops, best of 3: 0.149 usec per loop $ ./python -m timeit -s 'l = list(range(10)); s=slice(None,1)' 'l[s]' 10000000 loops, best of 3: 0.135 usec per loop

patched:

$ ./python -m timeit -s 'l = list(range(100))' 'l[:1]' 10000000 loops, best of 3: 0.158 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[:]' 1000000 loops, best of 3: 0.49 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[1:]' 1000000 loops, best of 3: 0.487 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[1:3]' 10000000 loops, best of 3: 0.184 usec per loop

$ ./python -m timeit -s 'l = list(range(10))' 'l[:]' 10000000 loops, best of 3: 0.185 usec per loop $ ./python -m timeit -s 'l = list(range(10))' 'l[1:]' 10000000 loops, best of 3: 0.181 usec per loop

original:

$ ./python -m timeit -s 'l = list(range(100))' 'l[:1]' 10000000 loops, best of 3: 0.171 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[:]' 1000000 loops, best of 3: 0.499 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[1:]' 1000000 loops, best of 3: 0.509 usec per loop $ ./python -m timeit -s 'l = list(range(100))' 'l[1:3]' 10000000 loops, best of 3: 0.198 usec per loop

$ ./python -m timeit -s 'l = list(range(10))' 'l[:]' 10000000 loops, best of 3: 0.188 usec per loop $ ./python -m timeit -s 'l = list(range(10))' 'l[1:]' 1000000 loops, best of 3: 0.196 usec per loop

So the maximum impact seems to be 8% for very short slices (<10) and it quickly goes down for longer slices where the copy impact clearly dominates. There's still some 2% for 100 items, though.

I find it interesting that the base line is way below the other timings. That makes me think it's actually worth caching constant slice instances, as CPython already does for tuples. Cython also caches both now. I would expect that constant slices like [:], [1:] or [:-1] are extremely common. As you can see above, caching them could speed up slicing by up to 30% for short lists, and still some 7% for a list of length 100.

Stefan

msg127791 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-03 13:22

Here's another base line test: slicing an empty list

patched:

$ ./python -m timeit -s 'l = []' 'l[:]' 10000000 loops, best of 3: 0.0847 usec per loop

original:

$ ./python -m timeit -s 'l = []' 'l[:]' 10000000 loops, best of 3: 0.0977 usec per loop

That's about 13% less overhead.

msg127792 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-02-03 13:32

I find it interesting that the base line is way below the other timings. That makes me think it's actually worth caching constant slice instances, as CPython already does for tuples.

Indeed. I have never touched it, but I suppose it needs an upgrade of the marshal format to support slices. (of course, this will not help for other common cases such as l[x:x+2]).

msg127795 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-03 15:32

of course, this will not help for other common cases such as l[x:x+2]

... which is exactly what this slice caching patch is there for. ;-)

msg127796 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-03 15:50

A quick test against the py3k stdlib:

find -name ".py" | while read file; do egrep '[[-0-9]:[-0-9]*]' "$file"; done | wc -l

This finds 2096 lines in 393 files.

msg127805 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-02-03 19:26

Created follow-up issue 11107 for caching constant slice objects.

msg143728 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2011-09-08 14:55

Kristján, could you check out the new implementation over at #10181? I have trouble reproducing a big speed difference between bytearray and memoryview (Linux, 64-bit). Here are the timings I get for the current and the new version:

Slicing

./python -m timeit -n 10000000 -s "x = bytearray(b'x'*10000)" "x[:100]"
./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"
cpython: 0.137 usec pep-3118: 0.138 usec
cpython: 0.132 usec pep-3118: 0.132 usec

Slicing with overhead for multidimensional capabilities:

./python -m timeit -n 10000000 -s "import _testbuffer; x = _testbuffer.ndarray([ord('x') for _ in range(10000)], shape=[10000])" "x[:100]"
./python -m timeit -n 10000000 -s "import numpy; x = numpy.ndarray(buffer=bytearray(b'x'*10000), shape=[10000], dtype='B')" "x[:100]"
_testbuffer.c: 0.198 usec
numpy: 0.415 usec Slice assignment

./python -m timeit -n 10000000 -s "x = bytearray(b'x'*10000)" "x[5:10] = x[7:12]"
./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[5:10] = x[7:12]"
cpython: 0.242 usec pep-3118: 0.240 usec
cpython: 0.282 usec pep-3118: 0.287 usec

Slice assignment, overhead for multidimensional capabilities

./python -m timeit -n 10000000 -s "import _testbuffer; x = _testbuffer.ndarray([ord('x') for _ in range(10000)], shape=[10000], flags=_testbuffer.ND_WRITABLE)" "x[5:10] = x[7:12]"
./python -m timeit -n 10000000 -s "import numpy; x = numpy.ndarray(buffer=bytearray(b'x'*10000), shape=[10000], dtype='B')" "x[5:10] = x[7:12]"

_testbuffer.c: 0.469 usec numpy: 1.37 usec

tolist

./python -m timeit -n 10000 -s "import array; x = array.array('B', b'x'*10000)" "x.tolist()"
./python -m timeit -n 10000 -s "x = memoryview(bytearray(b'x'*10000))" "x.tolist()"
cpython, array: 104.0 usec
pep-3118, memoryview: 90.5 usec

tolist, struct module overhead

./python -m timeit -n 10000 -s "import _testbuffer; x = _testbuffer.ndarray([ord('x') for _ in range(10000)], shape=[10000])" "x.tolist()"
./python -m timeit -n 10000 -s "import numpy; x = numpy.ndarray(buffer=bytearray(b'x'*10000), shape=[10000], dtype='B')" "x.tolist()"

_testbuffer.c: 1.38 msec (yes, that's microseconds!) numpy: 104 usec

msg143731 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2011-09-08 15:26

I'm afraid I had put this matter far out of my head :) Seeing the amount of discussion on that other defect (stuff I had already come across and scrathced my head over) I think there is a lot of catching up that I'd need to do and I am unable to give this any priority at the moment. My original patch sought to even out the slicing performance difference between bytes and bytearray. bytes objects were very streamlined while other were not.

python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.125 usec per loop

python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.202 usec per loop

Did you take a look at this at all?

msg143732 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2011-09-08 16:03

I see. I thought this was mainly about memoryview performance, so I did not specifically look at bytearray. The poor performance seems to be Windows specific:

C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.118 usec per loop

C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.191 usec per loop

C:\Users\stefan\hg\pep-3118\PCbuild>amd64\python.exe -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]" 10000000 loops, best of 3: 0.146 usec per loop

Linux:

bytes: 10.9 usec bytearray: 0.14 usec memoryview: 0.14 usec

msg143733 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2011-09-08 16:28

With Stefan Behnel's slice-object-cache.patch, I get this (PEP-3118 branch):

Linux: bytes: 0.097 usec bytearray: 0.127 usec memoryview: 0.12 usec Windows: bytes: 0.11 usec bytearray: 0,184 usec memoryview: 0.139 usec

On Linux, that's quite a nice speedup.

msg147902 - (view)

Author: Stefan Behnel (scoder) * (Python committer)

Date: 2011-11-18 17:05

Updated single slice caching patch for latest Py3.3 hg tip.

msg147914 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2011-11-18 19:23

New changeset fa2f8dd077e0 by Antoine Pitrou in branch 'default': Issue #10227: Add an allocation cache for a single slice object. http://hg.python.org/cpython/rev/fa2f8dd077e0

msg147915 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-11-18 19:24

Thanks Stefan. I'm leaving the issue open since the original topic is a bit different.

msg153083 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2012-02-11 00:31

Kristján, I ran the benchmarks from http://bugs.python.org/issue10227#msg143731 in the current cpython and pep-3118 repos. In both cases the differences between Linux and Windows are far less pronounced than they used to be. All benchmarks were run with the x64 builds.

I also ran the profile guided optimization build for Visual Studio. The results are equal to (or better than) the non-pgo gcc results. In my experience Visual Studio relies heavily on PGO for x64 builds. The default optimizer is just not as good as gcc's.

If you can reproduce similar results, I think we can close this issue.

./python -m timeit -n 10000000 -s "x = ((b'x'*10000))" "x[:100]"

linux-cpython (4244e4348362): 0.102 usec linux-pep-3118 (memoryview:534f6bbe5422): 0.098 usec

windows-cpython: 0.109 usec windows-pep-3118: 0.112 usec usec windows-pep-3118-pgo: 0.103 usec

./python -m timeit -n 10000000 -s "x = (bytearray(b'x'*10000))" "x[:100]"

linux-cpython (4244e4348362): 0.107 usec linux-pep-3118 (memoryview:534f6bbe5422): 0.109 usec

windows-cpython: 0.127 usec windows-pep-3118: 0.128 usec windows-pep-3118-pgo: 0.106 usec

./python -m timeit -n 10000000 -s "x = memoryview(bytearray(b'x'*10000))" "x[:100]"

linux-cpython (4244e4348362): 0.127 usec linux-pep-3118 (memoryview:534f6bbe5422): 0.12 usec

windows-cpython: 0.145 usec windows-pep-3118: 0.14 usec windows-pep-3118-pgo: 0.0984 usec

msg153277 - (view)

Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer)

Date: 2012-02-13 15:31

Sure. Flagging this as fixed. Can´t close it until 10181 is closed due to some dependency thing. (perhaps someone else knows what to do?)

msg153279 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2012-02-13 15:39

Great. I removed the dependency since it's fixed in both cpython and pep-3118.

History

Date

User

Action

Args

2022-04-11 14:57:08

admin

set

github: 54436

2012-02-13 15:39:53

skrah

set

status: open -> closed

dependencies: - Problems with Py_buffer management in memoryobject.c (and elsewhere?)
messages: +
stage: resolved

2012-02-13 15:31:20

kristjan.jonsson

set

resolution: fixed
messages: +

2012-02-11 00:31:30

skrah

set

messages: +

2011-11-18 19:24:27

pitrou

set

messages: +

2011-11-18 19:23:02

python-dev

set

nosy: + python-dev
messages: +

2011-11-18 17:05:09

scoder

set

files: + slice-object-cache.patch

messages: +

2011-09-08 16:28:12

skrah

set

messages: +

2011-09-08 16:03:32

skrah

set

messages: +

2011-09-08 15:26:58

kristjan.jonsson

set

messages: +

2011-09-08 14:57:25

skrah

set

dependencies: + Problems with Py_buffer management in memoryobject.c (and elsewhere?)

2011-09-08 14:55:20

skrah

set

nosy: + skrah
messages: +

2011-06-25 09:47:11

mark.dickinson

set

assignee: mark.dickinson ->

2011-02-03 19:26:34

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-03 15:50:45

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-03 15:32:14

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-03 13:32:50

pitrou

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-03 13:22:01

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-03 12:34:06

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-02 19:11:55

scoder

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-02 18:54:15

pitrou

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-02 18:48:34

scoder

set

files: + slice-object-cache.patch
nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-02 14:50:48

pitrou

set

nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-02-01 12:42:37

scoder

set

files: + slice-object-cache.patch
nosy:mark.dickinson, pitrou, scoder, kristjan.jonsson
messages: +

2011-01-03 20:14:25

pitrou

set

assignee: mark.dickinson

nosy: + mark.dickinson
versions: + Python 3.3, - Python 3.2

2010-11-01 09:52:05

kristjan.jonsson

set

messages: +

2010-11-01 09:51:03

kristjan.jonsson

set

messages: +

2010-11-01 09🔞16

scoder

set

nosy: + scoder
messages: +

2010-10-29 10:57:22

kristjan.jonsson

set

messages: +

2010-10-29 10:50:43

pitrou

set

messages: +

2010-10-29 10:48:37

kristjan.jonsson

set

messages: +

2010-10-29 10:23:51

pitrou

set

messages: +

2010-10-29 10:20:06

kristjan.jonsson

set

messages: +

2010-10-29 10:12:51

pitrou

set

messages: +

2010-10-29 10:04:35

kristjan.jonsson

set

messages: +

2010-10-29 09:50:30

kristjan.jonsson

set

files: + memoryobj.patch

messages: +

2010-10-29 09:14:30

pitrou

set

messages: +

2010-10-29 08:50:27

kristjan.jonsson

create