msg70309 - (view) |
Author: Fredrik Johansson (fredrikj) |
Date: 2008-07-27 01:11 |
A few weeks ago, I blogged about taking advantage of Karatsuba multiplication and Newton's method to divide big integers quickly (some of you may have read it, as it was posted to Daily Python-URL among other places): http://fredrik-j.blogspot.com/2008/07/making-division-in-python-faster.html To summarize, this method blows the builtin division out of the water already at ~(2000 digits)/(1000 digits). The primary catch is that the result of the Newton division may be slightly wrong (typically 1 ulp). However, a single extra multiplication and a subtraction at the end allows one to compute a remainder, and since the remainder must satisfy 0 <= r < q, the error is easily corrected. From a quick test, the cost of the extra multiplication seems to move the break-even point with the builtin division up to around 5000/2500 digits. A pure Python implementation of divmod, with error correction based on the remainder, can be found in this file: http://www.dd.chalmers.se/~frejohl/code/libarith/libarith.py (See the function idivmod) Of particular note is that fast divmod gives a fast way to do radix conversion, by recursively splitting the number in half. The function numeral (see same .py file) does this, e.g: >>> from time import clock >>> a = 2**1257787-1 >>> t1=clock(); s1=str(a); t2=clock(); t2-t1 109.08065159119386 >>> t1=clock(); s2=numeral(a); t2=clock(); t2-t1 7.1394886780303182 >>> s1 == s2 True (This recursive algorithm, by the way, is actually faster than str() even with the slow builtin divmod.) Would there be any interest in porting these algorithms to C and using them in the standard Python long implementation? There are likely some problems that I have overlooked. A critical review will be most welcome. |
|
|
msg70313 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-07-27 06:34 |
> Would there be any interest in porting these algorithms to C and using > them in the standard Python long implementation? Yes, definitely! Though it depends a little bit how much complication is involved. A subquadratic algorithm for converting strings to longs might also fit well here. Now I'm just waiting for you to propose an implementation of integer square root :-). |
|
|
msg70316 - (view) |
Author: Fredrik Johansson (fredrikj) |
Date: 2008-07-27 07:07 |
> Yes, definitely! Though it depends a little bit how much complication is involved. Not that much, I think, the pure Python version being just a few lines of code (plus a few more lines of boilerplate). But I'm not too familiar with converting Python to C myself, so I may underestimate the difficulties. It might get a little more complicated if you want to minimize temporary allocations, for example. > Now I'm just waiting for you to propose an implementation of integer square root :-). Yes, there is also code for fast (O(M(n)) integer square roots in libarith.py. This function would be much less useful overall as a builtin, though. |
|
|
msg70325 - (view) |
Author: Fredrik Johansson (fredrikj) |
Date: 2008-07-27 16:04 |
For your convenience, I have split the division and numeral code off to a standalone .py file which I'm attaching here. I also updated the remainder logic to use the default divmod instead of repeated subtraction, which ensures worst-case performance similar to the builtin divmod in the very unlikely case that div_newton would return junk. |
|
|
msg70326 - (view) |
Author: Fredrik Johansson (fredrikj) |
Date: 2008-07-27 16:17 |
And here some timings on my laptop. Both int->str and balanced division become faster somewhere between 1000 and 2000 digits. This is after editing the division benchmark to use random numbers instead of exact powers of ten (interestingly causing a bit of slowdown). String conversion might be a little slower for lengths that aren't close to powers of two. |
|
|
msg70702 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-08-04 15:51 |
There's also the recursive division algorithm due to Burnikel and Ziegler; this might be worth a look. I think it's the same asymptotic complexity (constant times karatsuba multiplication complexity), but may turn out to be faster for one reason or another. I had a Python implementation of this somewhere; I'll see if I can dig it out. Assigning this to me so that it doesn't get lost or forgotten; but note that I don't intend to do anything about it before 2.6/3.0 final. If anyone else wants to take it off my hands before then, feel free. |
|
|
msg70704 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-08-04 15:58 |
See also issue 1746088. |
|
|
msg70719 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-08-04 20:38 |
Here's a pure Python implementation of the Burnikel and Ziegler recursive division algorithm. I've no idea whether it's faster or slower than Newton, but it might be worth a look. It depends heavily on bit operations, which ought to be much faster when coded in C. (Some of the shifts would be completely unnecessary---replaced by changes in indexing instead.) The original paper describing the algorithm is available here: http://cr.yp.to/bib/1998/burnikel.ps |
|
|
msg70720 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-08-04 20:38 |
Oops. Wrong file. Here's the right one. |
|
|
msg70776 - (view) |
Author: Fredrik Johansson (fredrikj) |
Date: 2008-08-06 08:51 |
Indeed, that seems to be much faster. In the mean time, do you mind if I steal the code? :-) |
|
|
msg71010 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-08-11 10:40 |
> Indeed, that seems to be much faster. In the mean time, do you mind if I > steal the code? :-) Not at all! I guess constant factors could well appear and/or disappear when recoding in C; it might well be worth trying to implement both methods to see which is faster. There are some semi-obvious tricks available that might speed up the recursive version. For one thing, all shifts should be in multiples of 15 bits (because longs are stored in base 2**15). For another, it ought to be possible to avoid the single-bit normalization shifts every time the number of bits ('n') is odd---one can do a single shift at the beginning of the calculation instead. |
|
|
msg72772 - (view) |
Author: Pernici Mario (pernici) |
Date: 2008-09-08 09:39 |
I have translated in C the algorithm for long division by Burnikel and Ziegler (BZ), using the Python version fast_div.py and the suggestions by Mark. Here is a benchmark for divmod(p. q), p = 7**np, q = 5**nq digits = q_digits = p_digits/2; the time taken by Python division is normalized to 1 tests on Debian, BZ with Python-2.6b3 Pentium 1.60GHz Athlon XP 2600+ Athlon 64 3800+ digits BZ fast_div BZ fast_div BZ fast_div 500 1.01 1.27 1.00 1.18 1.00 1.21 700 0.88 1.22 0.76 1.08 0.81 1.14 1000 0.82 1.17 0.72 1.04 0.76 1.10 2000 0.66 0.85 0.55 0.73 0.59 0.78 4000 0.51 0.62 0.43 0.52 0.45 0.56 10000 0.32 0.38 0.31 0.37 0.33 0.39 20000 0.24 0.25 0.23 0.25 0.25 0.27 100000 0.14 0.14 0.11 0.11 0.12 0.12 BZ starts being faster than Python division around 2000 bits, apart from two cases: for q with less than 4000 bits and p much smaller than q**2 x_divrem is faster than divmod_pos; for p very much larger than q**2 x_divrem becomes again faster. I put a bound in long_divrem to fall back to x_divrem in those cases, based on tests on my laptop Pentium 1.60GHz. The treatment of exceptions is very rudimentary; I use a simple macro STUB_EXC to return NULL, but without releasing memory. Help for doing this properly is welcome. Please find attached the diff to the longobject.c file in Python-2.6b3 . Mario |
|
|
msg73698 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2008-09-24 09:55 |
Thanks for the patch, Mario! I'll give a look when I get the chance. |
|
|
msg74679 - (view) |
Author: Pernici Mario (pernici) |
Date: 2008-10-13 10:14 |
In this patch I added to the patch by Mark in issue 3944 the code in the previous patch, modified to release memory in case of exceptions. Benchmark for division on Athlon 64 3800+ with respect to Python3.0: (1) Python with patch 30bit.patch (2) Python with this patch up to 1000 digits (1) and (2) perform in the same way digits (1) (2) 2000 4x 5x 4000 4x 7x 10000 4x 10x 20000 4x 13x 100000 4x 27x |
|
|
msg84105 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2009-03-24 18:32 |
The longobject2.diff patch probably doesn't apply cleanly any more. Anyone interested in updating it? I think this patch looks like a promising beginning, but there's still quite a lot of work to do. The main concerns at the moment are: (1) the huge numbers of Py_DECREFs and Py_XDECREFs. Some refactoring might help here ("goto" isn't all bad: it's fairly common to use lots of "goto error" statements in Python's source code). (2) I suspect that many of the operations could be turned into in-place operations on digit vectors, thus saving lots of object allocations and deallocations. This should also help out with (1). I'm not yet 100% sold on getting subquadratic division into Python---it remains to be seen how much complexity it adds. If it makes it easy to implement subquadratic integer <-> string conversions that would be a big plus. |
|
|
msg84702 - (view) |
Author: Pernici Mario (pernici) |
Date: 2009-03-30 22:59 |
In this patch there is an implementation of the algorithm to convert numbers in strings by recursively splitting the number in half, adapted from Fredrik's div.py |
|
|
msg84704 - (view) |
Author: Pernici Mario (pernici) |
Date: 2009-03-30 23:04 |
In this second patch to the above patch it is added the recursive division algorithm by Burnikel and Ziegler (BZ) from longobject2.diff, unmodified, to see the effect of the subquadratic algorithm; there is still a lot of work to be done on it, as Mark pointed out. Here is a benchmark on hp pavilion Q8200 2.33GHz a = 7**n; compute str(a) N times n N unpatched patch1 patch2 100 100000 0.25 0.25 0.25 200 100000 0.63 0.64 0.63 300 100000 1.19 1.22 1.23 400 100000 1.87 1.74 1.75 500 10000 0.27 0.24 0.24 1000 10000 0.98 0.59 0.60 2000 1000 0.36 0.16 0.17 5000 1000 2.17 0.65 0.66 10000 100 0.86 0.20 0.19 20000 100 3.42 0.70 0.55 50000 10 2.13 0.37 0.24 100000 1 0.85 0.13 0.08 500000 1 21 2.9 0.94 1000000 1 85 11 2.8 2000000 1 339 44 8.4 str(n) in the first patch uses a quadratic algorithm, but asymptotically it is almost 8 times faster than the unpatched version; in the second patch the subquadratic algorithm starts showing after 10000 digits. |
|
|
msg103028 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2010-04-13 09:27 |
Unassigning myself for now. The most recent patch still needs work before it can be considered for inclusion. |
|
|
msg190407 - (view) |
Author: Eric V. Smith (eric.smith) *  |
Date: 2013-05-31 15:05 |
See also , in particular http://mail.python.org/pipermail/pypy-dev/2013-May/011433.html. |
|
|
msg221270 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2014-06-22 16:12 |
As has been closed as a duplicate of this, should this be moved to open from languishing? |
|
|
msg390940 - (view) |
Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * |
Date: 2021-04-13 08:55 |
FWIW, we have implemented a faster algorithm for divmod for big numbers using Mark's fast_div.py in PyPy. In particular, this speeds up str(long) for large numbers significantly (eg calling str on the result of math.factorial(2**17) is now 15x faster than CPython ;-)). |
|
|
msg390966 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2021-04-13 13:55 |
> we have implemented a faster algorithm for divmod for big numbers using Mark's fast_div.py in PyPy Urk! I hope your implementation included a healthy dose of validation, testing and cleanup. :-) (I have no doubts that it did.) I'd consider fast_div.py to be proof-of-concept code rather than something suitable for immediate use. |
|
|
msg390968 - (view) |
Author: Mark Dickinson (mark.dickinson) *  |
Date: 2021-04-13 14:57 |
FWIW, I think we should close this; the same comments as in issue 26256 apply. See onwards in that discussion. |
|
|
msg390971 - (view) |
Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * |
Date: 2021-04-13 15:29 |
yes, that sounds fair. In PyPy we improve things occasionally if somebody feels like working on it, but in general competing against GMP is a fools errand. |
|
|
msg390972 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-04-13 15:47 |
Python has a reasonable efficiency up to 1000 decimal digits (according to benchmark) which is already really great! Operations with more than 1000 decimal digits is an uncommon. IMO you should use a dedicated library like GMP for that. I would prefer to keep CPython code simple enough to maintain. Objects/longobject.c is already 5744 lines of C code. longformat_BZ.diff adds around 700 lines of code, but it only makes str(int) starting at 2000 decimal digits. Yeah, we could do better for integers with , but I would prefer not to :-) Python has a limited number of volunteers working on it, and it's not specialized in numbers. We should keep the maintenance burden at an acceptable level ;-) |
|
|
msg412121 - (view) |
Author: Tim Peters (tim.peters) *  |
Date: 2022-01-30 01:41 |
Ha! This will never die. More discussion in bpo-46558. Ya, I already closed it, but don't want to. I opened it to begin with to record an int->str method that doesn't use division, so it didn't really belong on this report. What if we _didn't_ convert these things to C? That's the primary question the new report gets around to asking. These things are far easier to write, understand, and maintain in Python, and converting to C is generally only improving a constant factor, or _maybe_ removing a log(n) factor. When we're looking at massive O() improvements, squeezing those out is of comparatively little value. |
|
|