Issue 604716: faster [None]n or []n (original) (raw)

Issue604716

Created on 2002-09-04 20:20 by gregsmith, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (14)
msg12265 - (view)	Author: Gregory Smith (gregsmith)	Date: 2002-09-04 20:20
This is a suggestion to speed up multiplying a list by an integer, in the case where the length of the list is 0 or 1. currently there is a lot of overhead in the loop for these cases. The current loop ( in list_repeat, in listobject.c) : p = np->ob_item; for (i = 0; i < n; i++) { for (j = 0; j < a->ob_size; j++) { p = a->ob_item[j]; Py_INCREF(p); p++; } } return (PyObject ) np; Suggested ( where Py_INCREF_N(o,n) bumps the refcount of 'o' by n): p = np->ob_item; if( a->ob_size <= 1 ){ if( a->ob_size ){ // is 1 Pyobject a0 = a->ob_item[0]; for (i = 0; i < n; i++) { p++ = a0; } Py_INCREF_N(a0,n); } }else{ for (i = 0; i < n; i++) { for (j = 0; j < a->ob_size; j++) { p = a->ob_item[j]; Py_INCREF(p); p++; } } } return (PyObject ) np; You could also do special cases for len=2, etc (with p++ = a0; p++ = a1; inside) but len=1 is a common case, with things like [None]1000 used to create lists. The best approach in general is to check if the list being replicated is sufficiently smaller than the replication count; and if so, use a different set of loops nested in the other order, so that the outer loop runs less often. With the inverted loop case, you have 'a->ob_size' calls to Py_INCREF_N instead of 'a->ob_sizen' calls to Py_INCREF. Exact same comments apply to tuplerepeat(). If any interest, I will code this and test it.
msg12266 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2002-09-05 01:45
Logged In: YES user_id=80475 Here's a faster approach that doesn't involve a special case: for (i = 0; i < n; i++) memcpy(p + in, p, size); for (j = 0; j < a->ob_size; j++){ Py_INCREF_N(p, n); p++; }
msg12267 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-09-06 09:53
Logged In: YES user_id=21627 Can you report what speed-up you get, for what example?
msg12268 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2002-09-06 18:38
Logged In: YES user_id=80475 There was a typo in my code fragment. It should have read: memcpy(p + in, a->ob_item, a->ob_size sizeof (PyObject *)); The memcpy speeds up cases where there are many items in the list. It is actually slower for a list of length one which should still be special cased as the OP suggested. Also, his idea for adding Py_INCREF_N is great. It can be used throughout python. Since Py_INCREF has lots of macro magic for updating debug statistics, the OP's idea may have to be implemented as a function. I think there is interest in the OP's ideas and that he should go-ahead and develop a tested patch along with timing statistics.
msg12269 - (view)	Author: Gregory Smith (gregsmith)	Date: 2002-09-06 18:41
Logged In: YES user_id=292741 [ sound effects: can opener, worms... ] ;-) Use of memcpy/memset is a whole other issue- see below What you've got there doesn't help for the case I mention. If you do []100000, you get 100000 calls to memcpy with size = 0, and if you do [None]100000, 100000 calls to memcpy with size = 4; this could easily wind up being even slower. The point is to reduce the number of times the inner loop quits and the outer loop runs; the actual code in the inner loop will run the same number of times regardless (although p = regvar is faster than p = ptr->a[j] ); it's the loop overhead (instruction queue flushes, etc) that make a difference here. Bear in mind too that 'j < a->ob_size' will load a->ob_size on each loop, if the compiler can't be sure you didn't write it via p [yes i''ve spent too much time looking at C compiler output. real-time DSP applications mostly]. Many compilers do intrinsic memcpy so that there's no call, but since the length of the copy is not fixed, code still needs to be generated to test the length, and you wind up being no better than the original for short source lists. Regarding the use of memcpy/memset in general; these could be applied in a lot of cases - making a list slice, filling a new list with null's etc - and on some machines they make a big difference to the speed of the operation. Compared to the time spent later working on that list, it may not show up much in the end. Having said that, I've done some python apps which process, say, 20Mbytes of binary floating-point data in about 12 seconds, by avoiding any python looping ( using Numeric and image file read/writes). In this kind of thing, a faster slice copy can make difference. I'm assuming the programmer would rather not have the marginal speedup of these routines, if it poses even a small possibility of creating portability issues on some weird machine, and this is a pretty wise policy on a project of this nature. I suspect any machine where memcpy doesn't work in this context would also fail to work with python's polymorphism and general memory management, but whatever. A really good way to make a lot of copies of a small list is to make the first copy, then do a single memcpy( p, p+size, (n-1)size * sizeof(*p) ) which 'rolls' the same memory over and over. No loops (except within memcpy and the assumption is that that one is as good as it gets). But this may be even less portable than the other uses of memcpy, because it makes assumptions about how memcpy handles overlapped transfers. E.g., machines with a lot of registers could do a bunch of reads and a bunch of writes to help out the cache, thus breaking this for small input lists. And yes, I'll report some speedup results when I have some.
msg12270 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2002-09-06 19:16
Logged In: YES user_id=80475 Yes, I looked at the all-at-once memcpy approach but found that MVC++'s memcpy() was too smart and would not use the buffer overlap as intended. This is a dead-end. Using memcpy() in a loop as I suggested will work fine. Neal implemented a number of micro-optimizations using this and/or memchr(). You've already hit on the common case and best speed-up by special casing length one and by py_incref_n. Go for it.
msg12271 - (view)	Author: Gregory Smith (gregsmith)	Date: 2002-09-06 23:10
Logged In: YES user_id=292741 using this test code, and changing the 1000000, I have found that it adds an additional 40 nsec for each additional copy of the None. This doesn't change noticeably when I add the 'improvement'. If I change it to []whatever, then the old implementation needs 5 ns extra for each non-copy (and the new one isn't sensitive). This is telling me that most of the time in the [None]lots is in the PyList_New() to get the list. So I went and made a variant of PyList_New which doesn't clear the list, and found that [None]1000000 takes 22msec instead of the 40 it used to take. This doesn't make too much sense, looking at the code; maybe cache effects. Time to go home now... ------------------------- def timeit(): from time import clock t0 = clock() tin = 0 niter = 50 for i in range(niter): ta,p,tb = clock(),[None]1000000,clock() tin += tb-ta p = 0 # deallocate list t0 = clock()-t0 print "avg. time = ", tin/niter print "avg. overhead = ", t0/niter timeit() ---------------------
msg12272 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-09-11 16:43
Logged In: YES user_id=21627 Please don't invoke clock() twice per iteration. The standard timing procedure is iterations = [None] * niter start = clock() for i in iterations: actions_to_time total = clock() - start That way, you don't measure the time it takes to create the range object. Also, don't trust any measurements that have a total time of less then 10s (running it for a minute is even better). With this algorithm, please re-report the two totals, and the number of iterations. Run it several times in a row, to see how results vary.
msg12273 - (view)	Author: Gregory Smith (gregsmith)	Date: 2002-09-12 21:13
Logged In: YES user_id=292741 OK, i'll try that when I have time. The problem is, you can't do [None]1000000 over and over again for 10 seconds unless you also destroy the lists, there isn't enough RAM - and I was trying not to count the dispose time, which of course is also dependent on the list length. I was calling calling clock() immediately before and after the operation, and accumulating the differences, thus excluding the rest of the loop with the list delete. any uncertainty in clock() should in principle average out over a lot of reps. For what it's worth, the results were reasonably repeatable from run to run (maybe 5%) and very linear with respect to changing the '1000000' factor. I know i'm effectively timing one call to clock() as part of the timed code; but that won't change with the list size and can thus be factored out.
msg12274 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2002-10-04 20:16
Logged In: YES user_id=80475 It's starting to look as if any sort of optimization effort will take more human time to create, test, and review than it could ever save in computer time. Can this be closed?
msg12275 - (view)	Author: Gregory Smith (gregsmith)	Date: 2002-10-12 15:12
Logged In: YES user_id=292741 Sure, the loop inversion is not a big deal. There are two slightly larger issues, though: (1) O(n) time for []n seems worth fixing; I noticed Guido was just in there adding a size check, and bumping into the [..]0 case (descr. in #622091 as []5 ), so another little edit won't hurt right? I suggest changing size = a->ob_size n; if (n && size/n != a->ob_size) return PyErr_NoMemory(); to size = a->ob_size * n; if( size == 0 ) return PyList_New(0); if (size/n != a->ob_size) return PyErr_NoMemory(); which is hopefully correct by inductive inspection :-) (2) If there is any merit seen in introducing Py_INCREF_N to object.h for future use, this seems as good a place as any to break ground.
msg12276 - (view)	Author: Michael Stone (mbrierst)	Date: 2003-02-06 01:04
Logged In: YES user_id=670441 I went ahead and coded up a patch based on the discussion below. It implements Py_INCREF_N and Py_DECREF_N (just for symmetry, not sure if anyone needs it) and puts the new INCREF to use in list_repeat. Also, list_repeat makes special cases of 0's. I tested with python like that described below: from time import clock niter = 10 iterations = [None] * niter start, end = 0, 7 for length in range(start, end): l = [None]length if length == 0: multiplier = 10000000 else: multiplier = 1000000/length start = clock() for i in iterations: lmult = lmultiplier total = clock() - start print "for list length %d, time %e" % (length, total) I tested the patch on different list sizes, 0-7, 200-207, 1000-1007, and found about an 8% speed increase in all cases (except the 0 case of course, 100% speedup). I was surprised to find the debug build didn't differ in improvment from the patch. I also tried implementing the inner loop as memcpy, but found the results only started to improve noticably at list sizes of 1000 or so, and there was a sizable performance hit at small sizes of more than 20%. This might be worth including for very large lists as a separate case, but I think it's pretty rare to repeat a list that size. Let me know what you think of it or what changes you want. (I can't attach it in sourceforge so I'm pasting it in): Index: Objects/listobject.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Objects/listobject.c,v retrieving revision 2.145 diff -c -r2.145 listobject.c * Objects/listobject.c 2 Jan 2003 20:51:08 -0000 2.145 --- Objects/listobject.c 6 Feb 2003 00:48:46 -0000 *********** * 415,436 **** list_repeat(PyListObject a, int n) { int i, j; ! int size; PyListObject np; ! PyObject *p; if (n < 0) n = 0; ! size = a->ob_size n; ! if (n && size/n != a->ob_size) return PyErr_NoMemory(); np = (PyListObject ) PyList_New(size); if (np == NULL) return NULL; p = np->ob_item; for (i = 0; i < n; i++) { ! for (j = 0; j < a->ob_size; j++) { ! p = a->ob_item[j]; ! Py_INCREF(p); p++; } } --- 415,441 ---- list_repeat(PyListObject a, int n) { int i, j; ! int size, ob_size; PyListObject np; ! PyObject p, *ob_item; if (n < 0) n = 0; ! ob_size = a->ob_size; ! if (n == 0 \|	ob_size == 0) ! return PyList_New(0); ! size = ob_size * n; ! if (size/n != ob_size) return PyErr_NoMemory(); np = (PyListObject ) PyList_New(size); if (np == NULL) return NULL; p = np->ob_item; + ob_item = a->ob_item; + for (i = 0; i < ob_size; i++) + Py_INCREF_N(ob_item[i], n); for (i = 0; i < n; i++) { ! for (j = 0; j < ob_size; j++) { ! p = ob_item[j]; p++; } } Index: Include/object.h =================================================================== RCS file: /cvsroot/python/python/dist/src/Include/object.h,v retrieving revision 2.112 diff -c -r2.112 object.h * Include/object.h 17 Nov 2002 17:52:44 -0000 2.112 --- Include/object.h 6 Feb 2003 00:48:51 -0000 *********** * 526,532 **** --- 526,534 ---- PyAPI_FUNC(void) _Py_NegativeRefcount(const char fname, int lineno, PyObject op); #define _Py_INC_REFTOTAL _Py_RefTotal++ + #define _Py_INC_REFTOTAL_N(N) _Py_RefTotal += (N) #define _Py_DEC_REFTOTAL _Py_RefTotal-- + #define _Py_DEC_REFTOTAL_N(N) _Py_RefTotal -= (N) #define _Py_REF_DEBUG_COMMA , #define _Py_CHECK_REFCNT(OP) \ { if ((OP)->ob_refcnt < 0) \ ************* * 535,541 **** --- 537,545 ---- } #else #define _Py_INC_REFTOTAL + #define _Py_INC_REFTOTAL_N(N) #define _Py_DEC_REFTOTAL + #define _Py_DEC_REFTOTAL_N(N) #define _Py_REF_DEBUG_COMMA #define _Py_CHECK_REFCNT(OP) /* a semicolon /; #endif / Py_REF_DEBUG / ************ * 580,591 **** --- 584,607 ---- _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ (op)->ob_refcnt++) + #define Py_INCREF_N(op, N) ( \ + _Py_INC_REFTOTAL_N(N) _Py_REF_DEBUG_COMMA \ + (op)->ob_refcnt += (N)) + #define Py_DECREF(op) \ if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \ --(op)->ob_refcnt != 0) \ _Py_CHECK_REFCNT(op) \ else \ _Py_Dealloc((PyObject )(op)) + + #define Py_DECREF_N(op, N) ( \ + if (_Py_DEC_REFTOTAL_N(N) _Py_REF_DEBUG_COMMA \ + ((op)->ob_refcnt -= (N)) != 0) \ + _Py_CHECK_REFCNT(op) \ + else \ + _Py_Dealloc((PyObject )(op)) + /* Macros to use in case the object pointer may be NULL: */ #define Py_XINCREF(op) if ((op) == NULL) ; else Py_INCREF(op)
msg12277 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2003-05-21 05:15
Logged In: YES user_id=357491 Are people still interested in this?
msg12278 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-05-21 05:59
Logged In: YES user_id=80475 Wrapping this one up by applying the simplest version of the patch that gets a measurable improvement (from 255 usec to 224 usec measured by python timeit.py "[None] * 10000" See Objects/listobject.c 2.153

History
Date	User	Action	Args
2022-04-10 16:05:39	admin	set	github: 37135
2002-09-04 20:20:47	gregsmith	create

Issue 604716: faster [None]*n or []*n (original) (raw)

Issue 604716: faster [None]n or []n (original) (raw)