msg66369 - (view) |
Author: Jean Brouwers (MrJean1) |
Date: 2008-05-07 19:53 |
The attached patch bltmodule2.c.diff is a different implementation of the fast summation code for your consideration. It uses three separate sums to add ints, floats and other objects. All ints are accumulated into a C long, reset even after overflow. All floats are added into a C double without mixing ints. Other objects are handled as before. This version has been tested with Python 2.5.2 as well and passed the existing regressions. /Jean Brouwers |
|
|
msg66378 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-05-07 22:09 |
What is the benefit of this new implementation? If it is faster, can you give us any performance numbers? (for example using timeit) |
|
|
msg66382 - (view) |
Author: Jean Brouwers (MrJean1) |
Date: 2008-05-08 00:21 |
I did not compare performance (yet). If there is a particular test case, I am happy to measure that. Also, I made sure that the results and errors (!) for non-numeric items are the same as for the original, Slow Summation case. The version I submitted is perhaps only slightly faster for most int and float cases, since it never creates and adds the 0 int default for the optional 'start' argument. For other int and float cases it will certainly be faster. Here are a few (extreme) examples. 1) Assume a long list consisting of one float followed by all ints. The existing code will switch to the float loop at the very first float encountered and then use C double add (plus the FPE protection overhead) to add all subsequent ints. My version will use C long adds for all ints, C double add for the float plus one C double add to combine the long and double results. 2) Consider a list of just 2 items, one float and one int (and no 'start' argument). The current code will first create a result object for the 'start' default 0. Then, it enters the int loop (since result is an int, 0) and gets the first item, the float. Since the item is not an int, it is added to the current result using PyNumber_Add. Next, it falls into the float loop, since the result is now a float, presumably. Next, it gets the int item and adds that int as C double to the C double in f_result. The end result C double f_result is returned as float. Quite costly for just one float and one int and maybe worse than than Slow Summation. 3) My version does not create a 0 int for the missing 'start'. That is only created when absolutely needed at some later time. In addition, all int items are added in C long i_sum and all floats are added in C double f_sum, regardless of the order they occur in the sequence. At the end of the iteration loop, the C long i_sum is added to C double f_sum once and a single float object is created and returned. That is very efficient and without any unnecessary overhead. In other words, for sequences with mixed int and float objects the difference can be significant, depending on where the very first float item is located in the sequence and how may int follow. However, for sequences with items of the same type, all int or all float, there will be little or no difference. /Jean Brouwers PS) There are two additional things to consider for the float case: - Ooverflow (ERANGE) and value (EDOM) errors in C double adds - Using a compensating summation for floats, like Kahan's method. |
|
|
msg66387 - (view) |
Author: Jean Brouwers (MrJean1) |
Date: 2008-05-08 01:44 |
Here is one, simple comparison between 3 different builtin_sum's, all measured on the same machine, MacOS X 10.4.11 (Intel Duo): 1) Python 2.5.1 ActiveState with standard Slow Summation: % python2.5.1 -mtimeit --setup="x=[1.0e12, 7]*1000000" "sum(x)" 10 loops, best of 3: 177 msec per loop 2) Python 2.6a2 (built from source) with the existing Fast Summation: % python2.6a2 -mtimeit --setup="x=[1.0e12, 7]*1000000" "sum(x)" 10 loops, best of 3: 52.5 msec per loop 3) Python 2.5.2 (built from source) with the submitted Fast Summation: % python2.5.2 -mtimeit --setup="x=[1.0e12, 7]*1000000" "sum(x)" 10 loops, best of 3: 48.2 msec per loop |
|
|
msg66389 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2008-05-08 02:12 |
Will take a look at this one but am inclined to reject it. The timing benefits are marginal at best and only help in the atypical case of mixed floats and ints. The patch changes the type and ordering of intermediate sums, resulting in different answers than the standard sum (). In contrast, the existing fast sum() was carefully designed to mimick the operations of the original sum(), so it always produces the same results. |
|
|
msg66409 - (view) |
Author: Jean Brouwers (MrJean1) |
Date: 2008-05-08 07:13 |
The one example given may not be convincing, other ones may be. What is it in the submitted version which is not designed carefully or which may not produce the same results? |
|
|
msg66410 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2008-05-08 07:45 |
The approach of using separate accumulations is intrinsically flawed if you want the same result as a regular sum(). Here's a small dataset that shows why you can't accumulate separate sums by type: >>> d = [1000000000000000, - 1000000000000000.0, .0000000000000001, .0000000000000001] >>> sum(d) 2e-16 >>> d[0] + sum(d[1:]) 0.0 Closing this one. The approach is doomed and the possible benefits over the current approach are microscopic at best and only apply to unusual corner cases. I also prefer the current approach because it is easily extended to LongLongs and it is easy to show that the code is correct (at least with respect to producing the same result as a regular sum()). |
|
|