Issue 2459: speedup for / while / if with better bytecode (original) (raw)

Created on 2008-03-22 22:25 by pitrou, last changed 2022-04-11 14:56 by admin.

Messages (31)

msg64343 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-22 22:28

This is a preliminary patch to speedup for and while loops (it will also be able to speedup list comps and gen comps, if I get to do it). The patch works by putting the loop condition test at the end of loop, which allows removing one JUMP_ABSOLUTE byte code out of the critical path.

For this two new opcodes are introduced:

Some micro-benchmarks:

./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 782 usec per loop After: 1000 loops, best of 3: 412 usec per loop

./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 22.1 usec per loop After: 100000 loops, best of 3: 16.6 usec per loop

./python Tools/pybench/pybench.py -t ForLoops Before: 365ms per round After: 234ms per round

Also, pystone gets 5% faster (from 43300 to 45800).

Now for the less shiny things:

  1. I'm having problems modifying the pure Python compiler module. For some reasons it seems to have stricter requirements than compile.c itself does (!); I get some assertion error in compiler.pyassem.FlowGraph.fixupOrderHonorNext as soon as a loop gets non-trivial.

  2. Line numbers probably need to be fixed. The lnotab format may even have to be adapted in order to accomodate non-monotonically increasing line numbers.

Is there some interest in this patch? If yes, I'd like to have your input on the two bullet points above :)

msg64359 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-23 14:18

This new patch includes surgery to the compiler package (especially flowgraph trickery) in order to make it work with the new opcodes. I think my changes are sane but since the package seems basically untested, unmaintained and almost unused, there may be some glitches. However, test_compiler passes.

(test_dis will need to be updated for the new opcodes, not a big deal)

msg64364 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-23 14:44

Removed latest patch, it was half-baked.

msg64367 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-23 15:51

This new patch should be ok. The block ordering algorithm in compiler.pyassem looks entirely clean now, to the extent that the previous "fixup" hacks have been disabled.

Attaching loops3.py.

msg64383 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-23 23:02

loops4.patch adds a mechanism to avoid blocking signal catching in empty loops (such as "for x in it: pass" or "while x: pass"). Much of the speedup is still retained.

./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 737 usec per loop After: 1000 loops, best of 3: 438 usec per loop

./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 21.7 usec per loop After: 100000 loops, best of 3: 16.6 usec per loop

./python Tools/pybench/pybench.py -t ForLoops Before: 364ms per round After: 242ms per round

msg64506 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-25 19:19

By the way, the compiler package fix has been isolated and cleaned up as part of #2472. Maybe it would be nice to start with that one.

msg64537 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-26 02:09

This new patch also updates the code generation for list comprehensions. Here are some micro-benchmarks:

./python -m timeit -s "l = range(100)" "[x for x in l]" Before: 100000 loops, best of 3: 14.9 usec per loop After: 100000 loops, best of 3: 12.5 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if x]" Before: 10000 loops, best of 3: 24.1 usec per loop After: 100000 loops, best of 3: 18.8 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if not x]" Before: 100000 loops, best of 3: 15.5 usec per loop After: 100000 loops, best of 3: 11.9 usec per loop

Please note that this patch is orthogonal with Neal's patch in #2183, so the two combined should be quite interesting for the list comprehensions bytecode.

msg64572 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-26 23:09

This new patch completes the bytecode modifications. For/while loops as well as list comprehensions and generator expressions are a bit faster now. Also, as a side effect I've introduced a speed improvement for "if" statements and expressions...

Some micro-benchmarks (completing the ones already given above):

./python Tools/pybench/pybench.py -t IfThenElse Before: 167ms per round After: 136ms per round

./python -m timeit -s "y=range(100)" "sum(x for x in y)" Before: 10000 loops, best of 3: 20.4 usec per loop After: 100000 loops, best of 3: 17.9 usec per loop

./python -m timeit -s "y=range(100)" "sum(x for x in y if x)" Before: 10000 loops, best of 3: 28.5 usec per loop After: 10000 loops, best of 3: 23.3 usec per loop

./python -m timeit -s "y=range(100)" "sum(x for x in y if not x)" Before: 100000 loops, best of 3: 16.4 usec per loop After: 100000 loops, best of 3: 12.1 usec per loop

./python -m timeit -s "x,y,z=1,2,3" "x if y else z" Before: 1000000 loops, best of 3: 0.218 usec per loop After: 10000000 loops, best of 3: 0.159 usec per loop

A couple of tests seem to be failing in obscure ways in the test suite, I'll try to examine them. Most of the test suite runs fine though.

msg64573 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-26 23:43

Ok, the fix for the bizarre failures was really simple. Now the only failing tests are in test_trace (because it makes assumptions regarding the bytecode that aren't true anymore, I'll have to adapt the tests).

msg64577 - (view)

Author: Neal Norwitz (nnorwitz) * (Python committer)

Date: 2008-03-27 05:04

Antoine, I hope to look at this patch eventually. Unfortunately, there are too many build/test problems that need to be resolved before the next release. If you can help out with those, I will be able to review this patch sooner.

msg64599 - (view)

Author: Armin Rigo (arigo) * (Python committer)

Date: 2008-03-27 16:21

Can you see if this simpler patch also gives speed-ups? (predict_loop.diff)

msg64603 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-03-27 16:56

Armin, your patch gives a speed-up for "for" loops and comprehensions, although a bit less. Also, it doesn't speed up "while" loops and "if" statements at all. For some reasons it also appears to make pystone a bit slower. Here are some micro-benchmarks:

./python -m timeit "for x in xrange(10000): pass" Before: 1000 loops, best of 3: 758 usec per loop After: 1000 loops, best of 3: 483 usec per loop

./python -m timeit "x=100" "while x: x -= 1" Before: 10000 loops, best of 3: 21.8 usec per loop After: 10000 loops, best of 3: 21.6 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l]" Before: 100000 loops, best of 3: 14.9 usec per loop After: 100000 loops, best of 3: 13.3 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if x]" Before: 10000 loops, best of 3: 23.9 usec per loop After: 10000 loops, best of 3: 22.3 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if not x]" Before: 100000 loops, best of 3: 15.8 usec per loop After: 100000 loops, best of 3: 13.9 usec per loop

./python Tools/pybench/pybench.py -t IfThenElse Before: 164ms per round After: 166ms per round

msg66027 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-04-30 21:46

Finally I had to slightly change the lnotab format to have the right tracing semantics: the change is that line number increments are now signed bytes rather than unsigned.

Still, there is a small change in tracing behaviour (see test_trace.py): the for statement is traced twice in the first loop iteration, that's because the iterator object is retrieved (GET_ITER) at the beginning of the loop while the next iterator value is fetched (FOR_ITER) at the end of the loop. I don't believe this is very painful.

All in all, the whole test suite now passes fine. The performance numbers are the same as before.

msg68687 - (view)

Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) *

Date: 2008-06-24 13:21

A pointer to previous (minor) research:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/72505e3cb6d9cb1a/e486759f06ec4ee5

esp. after Terry Reedy's post

msg80591 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-01-26 21:09

Reminder, make sure we can still break out of a "while 1: pass".

msg80593 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-01-26 21:18

Reminder, make sure we can still break out of a "while 1: pass".

Yes, the patch takes care of that.

msg80598 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-01-26 23:08

The patches don't apply cleanly anymore, I'll regenerate a new one.

msg80995 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-02 20:36

Here is an updated patch against trunk. To be honest I don't think it has a lot of future:

msg81029 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2009-02-03 07:09

I would like to see this go forward. It looks promising.

msg81958 - (view)

Author: Collin Winter (collinwinter) * (Python committer)

Date: 2009-02-13 18:13

I don't see the changes to the lnotab format being a roadblock; just mention it in NEWS. Likewise, the pure-Python compiler package shouldn't be a high priority; your changes to that package look good enough.

I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, compiling Python as 64-bit): Django templates (render a 150x150 table 100 times): Min: 0.595 -> 0.589: 0.94% faster Avg: 0.599 -> 0.591: 1.30% faster

Spitfire templates (render a 1000x1000 table 100 times): Min: 0.751 -> 0.729: 2.98% faster Avg: 0.753 -> 0.730: 3.09% faster

None of the apps I've benchmarked are negatively impacted. I only have two minor comments. Please commit this.

Review comments:

msg81962 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-13 18:37

Hello Collin,

Thanks for taking a look.

I don't see the changes to the lnotab format being a roadblock; just mention it in NEWS. Likewise, the pure-Python compiler package shouldn't be a high priority; your changes to that package look good enough.

Well, I have good news: the fixes to the pure Python compiler have been accepted and committed by Neil Schemenauer in r69373.

I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, compiling Python as 64-bit): Django templates (render a 150x150 table 100 times): Min: 0.595 -> 0.589: 0.94% faster Avg: 0.599 -> 0.591: 1.30% faster

Spitfire templates (render a 1000x1000 table 100 times): Min: 0.751 -> 0.729: 2.98% faster Avg: 0.753 -> 0.730: 3.09% faster

Not a tremendous speedup but not totally insignificant either. (I see you like Spitfire :-))

None of the apps I've benchmarked are negatively impacted. I only have two minor comments. Please commit this.

Before committing I want to know what to do with the new jump opcodes, with respect to the alternative proposal I've made in #4715. Ideally, I should fold the #4715 patch back into the present patch, since I think the #4715 approach is more thought out.

msg81976 - (view)

Author: Collin Winter (collinwinter) * (Python committer)

Date: 2009-02-13 23:23

On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Hello Collin,

Thanks for taking a look.

I don't see the changes to the lnotab format being a roadblock; just mention it in NEWS. Likewise, the pure-Python compiler package shouldn't be a high priority; your changes to that package look good enough.

Well, I have good news: the fixes to the pure Python compiler have been accepted and committed by Neil Schemenauer in r69373.

Yeah, I saw that. Fantastic.

I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64, compiling Python as 64-bit): Django templates (render a 150x150 table 100 times): Min: 0.595 -> 0.589: 0.94% faster Avg: 0.599 -> 0.591: 1.30% faster

Spitfire templates (render a 1000x1000 table 100 times): Min: 0.751 -> 0.729: 2.98% faster Avg: 0.753 -> 0.730: 3.09% faster

Not a tremendous speedup but not totally insignificant either. (I see you like Spitfire :-))

Well, Spitfire and Django represent very different ways of implementing a template system, so I like to measure both when doing application benchmarks. Template systems tend to be heavy CPU consumers for webapps, which is why I include them.

None of the apps I've benchmarked are negatively impacted. I only have two minor comments. Please commit this.

Before committing I want to know what to do with the new jump opcodes, with respect to the alternative proposal I've made in #4715. Ideally, I should fold the #4715 patch back into the present patch, since I think the #4715 approach is more thought out.

That sounds good, especially since Jeffrey and I have already reviewed #4715.

msg82556 - (view)

Author: Collin Winter (collinwinter) * (Python committer)

Date: 2009-02-21 00:30

On Fri, Feb 13, 2009 at 3:23 PM, Collin Winter <collinw@gmail.com> wrote:

On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:

Before committing I want to know what to do with the new jump opcodes, with respect to the alternative proposal I've made in #4715. Ideally, I should fold the #4715 patch back into the present patch, since I think the #4715 approach is more thought out.

That sounds good, especially since Jeffrey and I have already reviewed #4715.

If you don't have the bandwidth to integrate 4715 into this patch, I can do that, if you want.

Collin

msg82557 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-21 01:37

If you don't have the bandwidth to integrate 4715 into this patch, I can do that, if you want.

Collin, that would be very nice from you. You could also apply Jeffrey's naming suggestions in #4715.

Thanks!

Antoine.

msg83003 - (view)

Author: Jeffrey Yasskin (jyasskin) * (Python committer)

Date: 2009-03-02 06:11

I've updated for_iter.patch to the latest trunk, merging in issue 4715. I also changed tracing a bit so that the first line of a loop doesn't get traced twice in the first iteration, and added to test_dis to check that decreasing line numbers work there.

Review at http://codereview.appspot.com/20103 if you like.

Performance:

32-bit gcc-4.3 Intel Core2: 2to3: 25.09 -> 24.63: 1.87% faster

Django: Min: 0.614 -> 0.609: 0.86% faster Avg: 0.616 -> 0.635: 3.09% slower

Pickle: (cPickle) Min: 1.647 -> 1.651: 0.21% slower Avg: 1.650 -> 1.656: 0.39% slower

PyBench: Min: 5341 -> 5273: 1.29% faster Avg: 5450 -> 5397: 0.98% faster

SlowPickle: (pickle) Min: 1.384 -> 1.341: 3.13% faster Avg: 1.385 -> 1.343: 3.08% faster

Spitfire: Min: 0.773 -> 0.690: 11.97% faster Avg: 0.776 -> 0.695: 11.62% faster Spitfire re-run: Min: 0.740 -> 0.693: 6.81% faster Avg: 0.744 -> 0.695: 6.93% faster

SlowUnpickle: (pickle) Min: 0.645 -> 0.668: 3.37% slower Avg: 0.646 -> 0.670: 3.59% slower SlowUnpickle re-run: Min: 0.645 -> 0.660: 2.31% slower Avg: 0.645 -> 0.661: 2.32% slower

Unpickle: (cPickle) Min: 1.015 -> 1.006: 0.89% faster Avg: 1.021 -> 1.009: 1.16% faster

64-bit gcc-4.3 Intel Core2 2to3: 22.31 -> 21.97: 1.57% faster

Django: Min: 0.577 -> 0.564: 2.29% faster Avg: 0.579 -> 0.566: 2.20% faster

Pickle: Min: 1.162 -> 1.178: 1.35% slower Avg: 1.166 -> 1.183: 1.37% slower

PyBench: Min: 4498 -> 4193: 7.27% faster Avg: 4586 -> 4276: 7.25% faster

SlowPickle: Min: 1.212 -> 1.133: 6.92% faster Avg: 1.213 -> 1.135: 6.92% faster

Spitfire: Min: 0.631 -> 0.617: 2.32% faster Avg: 0.632 -> 0.621: 1.75% faster

SlowUnpickle: Min: 0.575 -> 0.551: 4.31% faster Avg: 0.576 -> 0.553: 4.14% faster

Unpickle: Min: 0.708 -> 0.722: 1.88% slower Avg: 0.745 -> 0.736: 1.20% faster

msg83012 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-03-02 11:52

I've updated for_iter.patch to the latest trunk, merging in issue 4715. I also changed tracing a bit so that the first line of a loop doesn't get traced twice in the first iteration, and added to test_dis to check that decreasing line numbers work there.

Thanks a lot!

By the way, why do you bench cPickle? Does your test call Python code significantly?

Overall, the results look positive although not overwhelming.

msg83022 - (view)

Author: Jeffrey Yasskin (jyasskin) * (Python committer)

Date: 2009-03-02 16:24

No particular reason for cPickle. It sometimes shows when we've caused problems by increasing the code size, and shows the size of any random effects that the compiler causes by moving code around. Could you double-check the patch to see if I did anything dumb?

msg83088 - (view)

Author: Jeffrey Yasskin (jyasskin) * (Python committer)

Date: 2009-03-03 19:45

Hold off on reviewing this. There's one bug around the peepholer not turning itself off when line numbers skip by more than 127, and another around the traceback generator still assuming line numbers are unsigned. I'll post another patch when those are fixed and tested. I may not get to it for a day or two, so someone else is welcome to update the patch instead. :)

msg106883 - (view)

Author: Dirkjan Ochtman (djc) * (Python committer)

Date: 2010-06-02 14:17

Is this still relevant / will it get some love in the future?

msg192575 - (view)

Author: Christian Heimes (christian.heimes) * (Python committer)

Date: 2013-07-07 16:13

Is this enhancement still relevant?

msg221249 - (view)

Author: Mark Lawrence (BreamoreBoy) *

Date: 2014-06-22 12:04

As a lot of work has gone into this it saddens me to see it languishing. Surely if Python performance is to be improved the bytecode for conditionals and loops is one of the places if not the place to do it? Are there any names missing from the nosy list that ought to be there?

History

Date

User

Action

Args

2022-04-11 14:56:32

admin

set

github: 46711

2014-06-22 12:04:07

BreamoreBoy

set

nosy: + BreamoreBoy
messages: +

2013-07-07 16:13:01

christian.heimes

set

status: open -> languishing
versions: + Python 3.3, Python 3.4, - Python 3.1
nosy: + christian.heimes

messages: +

2010-09-02 08:58:17

giampaolo.rodola

set

nosy: + giampaolo.rodola

2010-09-01 21:36:01

stutzbach

set

nosy: + stutzbach

2010-06-02 14:17:02

djc

set

nosy: + djc
messages: +

2009-03-03 19:45:30

jyasskin

set

assignee: pitrou -> jyasskin
messages: +

2009-03-02 16:24:41

jyasskin

set

messages: +

2009-03-02 11:52:52

pitrou

set

messages: +

2009-03-02 06:11:51

jyasskin

set

files: + trunk-opt-loop.patch
assignee: pitrou
messages: +

2009-02-21 01:37:59

pitrou

set

messages: +

2009-02-21 00:30:58

collinwinter

set

messages: +

2009-02-13 23:23:10

collinwinter

set

messages: +

2009-02-13 18:37:24

pitrou

set

messages: +

2009-02-13 18:13:58

collinwinter

set

messages: +

2009-02-03 08:59:39

arigo

set

nosy: - arigo

2009-02-03 07:09:31

rhettinger

set

messages: +

2009-02-02 20:37:02

pitrou

set

files: + for_iter.patch
messages: +

2009-01-26 23:09:02

pitrou

set

files: - loops3.patch

2009-01-26 23:08:59

pitrou

set

files: - loops4.patch

2009-01-26 23:08:55

pitrou

set

files: - loops5.patch

2009-01-26 23:08:52

pitrou

set

files: - loops6.patch

2009-01-26 23:08:48

pitrou

set

files: - loops7.patch

2009-01-26 23:08:44

pitrou

set

files: - loops8.patch

2009-01-26 23:08:37

pitrou

set

messages: +

2009-01-26 21🔞40

pitrou

set

messages: +

2009-01-26 21:09:21

rhettinger

set

nosy: + rhettinger
messages: +
versions: + Python 3.1, Python 2.7, - Python 2.6

2009-01-26 18:59:00

collinwinter

set

nosy: + collinwinter

2008-06-24 13:21:16

tzot

set

nosy: + tzot
messages: +

2008-04-30 21:46:31

pitrou

set

files: + loops8.patch
messages: +
title: speedup loops with better bytecode -> speedup for / while / if with better bytecode

2008-04-21 16:13:34

jcea

set

nosy: + jcea

2008-03-28 04:22:33

jyasskin

set

nosy: + jyasskin

2008-03-27 20:32:36

lauromoura

set

nosy: + lauromoura

2008-03-27 16:56:14

pitrou

set

messages: +

2008-03-27 16:21:48

arigo

set

files: + predict_loop.diff
nosy: + arigo
messages: +

2008-03-27 07:03:19

phsilva

set

nosy: + phsilva

2008-03-27 05:04:32

nnorwitz

set

nosy: + nnorwitz
messages: +

2008-03-26 23:43:06

pitrou

set

files: + loops7.patch
messages: +

2008-03-26 23:09:31

pitrou

set

files: + loops6.patch
messages: +

2008-03-26 02:09:51

pitrou

set

files: + loops5.patch
messages: +

2008-03-25 19:19:25

pitrou

set

messages: +

2008-03-25 18:21:20

fdrake

set

messages: -

2008-03-25 18:21:12

fdrake

set

messages: -

2008-03-25 17:59:21

gregory.p.smith

set

nosy: + gregory.p.smith

2008-03-23 23:02:41

pitrou

set

files: + loops4.patch
messages: +

2008-03-23 22:07:37

pitrou

set

files: - loops2.patch

2008-03-23 15:51:03

pitrou

set

files: + loops3.patch
messages: +

2008-03-23 14:44:38

pitrou

set

messages: +

2008-03-23 14:44:14

pitrou

set

files: - loops3.patch

2008-03-23 14🔞12

pitrou

set

files: + loops3.patch
messages: +

2008-03-22 23:36:41

pitrou

set

files: - loops.patch

2008-03-22 23:36:36

pitrou

set

files: + loops2.patch

2008-03-22 22:45:04

pitrou

set

messages: +

2008-03-22 22:28:28

pitrou

set

messages: +

2008-03-22 22:25:04

pitrou

create