Issue 4258: Use 30-bit digits instead of 15-bit digits for Python integers. (original) (raw)

Created on 2008-11-04 10:45 by mark.dickinson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (78)

msg75491 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-04 10:45

Here's an experimental patch, against the py3k branch, that makes Python represent its long integers internally in base 230 instead of base 215, on platforms that have 32-bit and 64-bit integers available.

On platforms for which autoconf is unable to find both 32-bit and 64-bit integers, the representation falls back to the usual one.

See also issue 1814 (GMP for longs), and the discussion at

http://mail.python.org/pipermail/python-dev/2008-November/083315.html

(note particularly Tim Peter's message at:

http://mail.python.org/pipermail/python-dev/2008-November/083355.html )

msg75492 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-04 11:04

Note that to avoid "bad marshal data" errors, you'll probably need to do a 'make distclean' before rebuilding with this patch.

msg75494 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-04 15:28

Here's an updated patch, with the following changes from the original:

make the size of a digit (both the conceptual size in bits and actual size in bytes) available to Python via a new structseq sys.int_info. This information is useful for the sys.getsizeof tests.
fix a missing cast in long_hash
better fast path for 1-by-1 multiplication that doesn't go via PyLong_FromLongLong.

msg75512 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-05 00:29

Note that to avoid "bad marshal data" errors, you'll probably need to do a 'make distclean' before rebuilding with this patch.

I saw that you choosed to use the base 2^30 for marshal. For a better portability (be able to use .pyc generated without your patch), you may keep the base 2^15. I implemented that in my GMP patch (manual conversion from/to base 2^15).

If we change the marshal format of long, the magic number should be different (we might use a tag like the "full unicode" tag used in Python3 magic number) and/or the bytecode (actual bytecode is 'l'). The base should be independent of the implementation, like Python does with text: UTF-8 for files and UCS-4 in memory. We may use the base 2^8 (256) or another power or 2^8 (2^16, 2^32, 2^64?). The base 256 sounds interresting because any CPU is able to process 8 bits digits.

Cons: Use a different bases makes Python slower for loading/writing from/to .pyc.

msg75513 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2008-11-05 00:57

oh yay, thanks. it looks like you did approximately what i had started working on testing a while back but have gone much further and added autoconf magic to try and determine when which size should be used. good.

(i haven't reviewed your autoconf stuff yet)

As for marhsalled data and pyc compatibility, yes that is important to consider.

We should probably base the decision on which digit size to use internally on benchmarks, not just if the platform can support 64bit ints. Many archs support 64bit numbers as a native C type but require multiple instructions or are very slow when doing it.

(embedded arm, mips or ppc come to mind as obvious things to test that on)

msg75518 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-05 08:55

As for marhsalled data and pyc compatibility, yes that is important to consider.

The problem is also that with the 30-bit digit patch, some Python will use 15 bits whereas some other will use 30 bits. The base in marshal should be the same in both cases.

And why 30 bits and not 31 bits, or 63 bits, or 120 bits? We may use other bases in the future. That's why I prefer to use a common base like base 256. And GMP has functions (mpz_import) to load data in base 256, but it's more complicate to load data in base 2^15.

msg75520 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-05 10:53

[Victor Stinner]

I saw that you choosed to use the base 2^30 for marshal. For a better portability (be able to use .pyc generated without your patch), you may keep the base 2^15. I implemented that in my GMP patch (manual conversion from/to base 2^15).

Agreed. I'll fix this so that the .pyc format is unchanged. Thanks!

And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness: the change from 15 to 30 bits turned out to be extraordinarily easy. Note that the longobject.c part of the patch does almost nothing except adding a bunch of casts here and there.

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5. It would gain very little over 30 bits, and if Pernici Mario's optimizations are considered (issue 3944) multiplication would likely be slower with 31-bit digits than with 30-bit digits.

63 (or 62, or 60) bits is simply too large right now: you'd need access to a hardware 64 x 64 -> 128 bit multiply to make this worth it, and I'd guess there are many fewer platforms that make this easy, though I don't really know. I know it's possible on gcc/x86_64 by making use of the (undocumented) __uint128_t type. But even where this type is available, base 263 might still be slower than base 230. I've done some experiments with multiprecision decimal arithmetic in C that showed that even on a 64-bit machine, using base 109 (i.e. approx 30 bits) was significantly faster than using base 1018.

120 bits? Does GMP even go this far? I guess part of the attraction is that it's a multiple of 8...

The other obvious choices to consider would be 32 bits (or 16 bits, or 64 bits). This is possible, and may even be worth it, but it would be a much bigger effort; most of the algorithms would need to be rewritten. One problem is again the mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler, but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety of compilers.

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and hence low risk of accidental breakage. So if there are indeed significant efficiency benefits (still to be determined) then it looks like a clear win to me.

[Gregory Smith]

(i haven't reviewed your autoconf stuff yet)

The changes to configure and pyconfig.h are just from rerunning autoconf and autoheader; it's only configure.in that should need looking at.

msg75522 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-05 11:24

And why 30 bits and not 31 bits, or 63 bits, or 120 bits?

Mostly laziness (...)

It was an argument for changing the base used by the mashal :-)

31 bits would involve rewriting the powering algorithm, which assumes that PyLong_SHIFT is divisible by 5

Powering is an simple algorithm. If it was the division, it would be much harder :-) Python stores the sign of the number in the first digit. Because of that, we are limited to 15/30 bits. Storing the sign in the size (which size? no idea yet) would allows to use a bigger base (31 bits? 63 bits?).

One problem is again the mismatch between C and assembler: detecting overflow when adding two 32-bit unsigned integers is trivial in x86 assembler, but it's not so obvious how to write portable C code that has a decent chance of being compiled optimally on a wide variety of compilers.

I wrote an example to detect overflows in C on the mailing list. Copy of my email: ------------------------------- 8< ---------------------- About 31, 32, 63 or 64 bits: I guess that you want to avoid integer overflow. Intel has an "overflow" flag, changed for all instructions. For other CPUs, you can use emulate this flag. Example with the type int that I used in my GMP patch:

Add: int a, b, sum; sum = a + b; exact = ((a < 0) ^ (b < 0)) || ((a < 0) == (sum < 0));

Substract: int a, b, diff; diff = a + b; exact = ((a < 0) == (b < 0)) || ((a < 0) == (diff < 0));

Multiply: int a, b, product; if (a == 0 || b == 0) { product = 0; /* exact / } else if (a != INT_MIN || (b == 1)) { product = a * b; exact = (product / b) == a); } else { / INT_MIN * -1 = -INT_MIN: overflow */ }

Divide: int a, b, q; if (a != INT_MIN || b != -1) { q = a / b; /* exact / } else { / INT_MIN / -1 = -INT_MIN: overflow */ }

Checking overflow may costs more than using a smaller base. Only a benchmark can answer ;-) ------------------------------- 8< ----------------------

I guess my feeling is simply that the 15-bit to 30-bit change seems incredibly easy and cheap: very little code change, and hence low risk of accidental breakage.

Python has an amazing regression test suite! I used it to fix my GMP patch. We can experiment new bases using this suite.

Anyway, i love the idea of using 30 bits instead of 15! Most computer are now 32 or 64 bits! But it's safe to keep the 15 bits version to support older computers or buggy compilers.

I started to work with GIT. You may be interrested to work together on GIT. It's much easier to exchanges changeset and play with branches. I will try to publish my GIT tree somewhere.

msg75534 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-05 23:04

Following Victor's suggestion, here's an updated patch; same as before, except that marshal now uses base 2**15 for reading and writing, independently of whether PyLong_SHIFT is 15 or 30.

msg75536 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-05 23:13

Mark: would it be possible to keep the "2 digits" hack in PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 15". The base 2^15 slow, so don't make it slower :-)

/* 2 digits */
if (!(ival >> 2*PyLong_SHIFT)) {
```
  v = _PyLong_New(2);
```
```
  if (v) {
```
```
     Py_SIZE(v) = 2*sign;
```

     v->ob_digit[0] = (digit)ival & PyLong_MASK;

     v->ob_digit[1] = ival >> PyLong_SHIFT;

```
  }
```
```
  return (PyObject*)v;
```
}

msg75539 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-06 01:35

marshal now uses base 2**15 for reading and writing

Yes, it uses base 215 but it's not the correct conversion to base 215. You convert each PyLong digit to base 2**15 but not the whole number. As a result, the format is different than the current mashal version.

msg75540 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-06 02:32

PyLong_FromLong() doesn't go into the 1 digit special case for negative number. You should use: /* Fast path for single-digits ints / if (!(abs_ival>>PyLong_SHIFT)) { v = _PyLong_New(1); if (v) { Py_SIZE(v) = sign; v->ob_digit[0] = abs_ival; } return (PyObject)v; }

msg75551 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-06 09:01

Yes, it uses base 215 but it's not the correct conversion to base 215. You convert each PyLong digit to base 2**15 but not the whole number.

I don't understand: yes, each base 230 digit is converted to a pair of base 215 digits, and if necessary (i.e., if the top 15 bits of the most significant base 2**30 digit are zero) the size is adjusted. How is this not converting the whole number?

As a result, the format is different than the current mashal version.

Can you give an example of an integer n such that marshal.dumps(n) gives you different results with and without the patch? As far as I can tell, I'm getting the same marshal results both with the unpatched version and with the patch applied.

msg75553 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-06 09:21

Other responses...

It was an argument for changing the base used by the mashal :-)

Ah. I think I'm with you now. You're saying that ideally, marshal shouldn't have to care about how Python stores its longs: it should just ask some function in longobject.c to provide an already-converted- to-base-256 representation. Is that right?

I also find the idea of making the marshal representation base 256 quite attractive. There are already functions in longobject.c that could be used for this: PyLong{From,As}ByteArray. And then you wouldn't need to touch marshal.c when swapping the GMP version of longobject.c in and out.

Python stores the sign of the number in the first digit. Because of that, we are limited to 15/30 bits.

No: the sign is stored in the size: if v is a PyLongObject then ABS(Py_SIZE(v)) gives the number of digits in the absolute value of the integer represented by v, and the sign of Py_SIZE(v) gives the sign of the integer.

would it be possible to keep the "2 digits" hack in PyLong_FromLong, especially with base 2^15? Eg. "#if PyLong_SHIFT == 15". The base 2^15 slow, so don't make it slower :-)

Why don't we leave this kind of micro-optimization out until we've got some benchmarks. (I'm also tempted to get rid of the long_mul fast path for now.)

PyLong_FromLong() doesn't go into the 1 digit special case for negative number.

Well spotted! Yes, this should be fixed. I have a nasty feeling that I was responsible for introducing this bug some time ago...

msg75559 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-06 11:17

Here's a pybench comparison, on OS X 10.5/Core 2 Duo/gcc 4.0.1 (32-bit non-debug build of the py3k branch). I got this by doing:

[create clean build of py3k branch] dickinsm$ ./python.exe Tools/pybench/pybench.py -f bench_unpatched [apply 30bit patch and rebuild] dickinsm$ ./python.exe Tools/pybench/pybench.py -c bench_unpatched

Highlights: SimpleLongArithmetic: around 10% faster. SimpleComplexArithmetic: around 16% slower! CompareFloatsIntegers: around 20% slower.

I'll investigate the slowdowns.

msg75560 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-06 11:30

I'll investigate the slowdowns

The problem may comes from int64_t on 32 bits CPU. 32x32 -> 64 may be emulated on your CPU and so it's slower. I improved your patch to make it faster, but I lost all my work because of a misuse of GIT... As I remember:

I fixed PyLong_FromLong() for small negative integer
I unrolled the loop in PyLong_FromLong(): the loop is at least called twice (the number has 2 digits or more)
I added special code for operations on two numbers of 1 digit (each) for long_add(), long_mul(), long_div(), long_bitwise(), etc.
and I don't remember the other changes...

Oh, I have an old patch. I will attach it to this message. About speed, it was:

unpatched: 20600..20900 pystones
your patch: 19900..20100 pystones
- my changes: 20200..20400 pytones

msg75561 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-06 12:47

I wrote a patch to compute stat about PyLong function calls.

make (use setup.py):

PyLong_FromLong: 168572 calls, min=( 0, ), avg=(1.4, ), max=( 3, ) long_bool: 48682 calls, min=( 0, ), avg=(0.2, ), max=( 2, ) long_add: 39527 calls, min=( 0, 0), avg=(0.9, 1.0), max=( 2, 3) long_compare: 39145 calls, min=( 0, 0), avg=(1.2, 1.1), max=( 3, 3) PyLong_AsLong: 33689 calls, min=( 0, ), avg=(0.9, ), max=( 45, ) long_sub: 13091 calls, min=( 0, 0), avg=(0.9, 0.8), max=( 1, 1) long_bitwise: 4636 calls, min=( 0, 0), avg=(0.8, 0.6), max=( 2, 2) long_hash: 1097 calls, min=( 0, ), avg=(0.9, ), max=( 3, ) long_mul: 221 calls, min=( 0, 0), avg=(0.8, 1.1), max=( 2, 2) long_invert: 204 calls, min=( 0, ), avg=(1.0, ), max=( 1, ) long_neg: 35 calls, min=( 1, ), avg=(1.0, ), max=( 1, ) long_format: 3 calls, min=( 0, ), avg=(0.7, ), max=( 1, ) long_mod: 3 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1) long_pow: 1 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1)

pystone:

PyLong_FromLong:1587652 calls, min=( 0, ), avg=(1.0, ), max=( 3, ) long_add: 902487 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 2, 2) long_compare: 651165 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 3, 3) PyLong_AsLong: 252476 calls, min=( 0, ), avg=(1.0, ), max=( 2, ) long_sub: 250032 calls, min=( 1, 0), avg=(1.0, 1.0), max=( 1, 1) long_bool: 102655 calls, min=( 0, ), avg=(0.5, ), max=( 1, ) long_mul: 100015 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2) long_div: 50000 calls, min=( 1, 1), avg=(1.0, 1.0), max=( 1, 1) long_hash: 382 calls, min=( 0, ), avg=(1.1, ), max=( 2, ) long_bitwise: 117 calls, min=( 0, 0), avg=(1.0, 1.0), max=( 1, 2) long_format: 1 calls, min=( 2, ), avg=(2.0, ), max=( 2, )

min/avg/max are the integer digit count (minimum, average, maximum).

What can we learn from this numbers?

PyLong_FromLong(), long_add() and long_compare() are the 3 most common operations on integers.

Except PyLong_FromLong(), long_compare() and long_format(), arguments of the functions are mostly in range [-2^15; 2^15].

Biggest number is a number of 45 digits: maybe just one call to long_add(). Except this number/call, the biggest numbers have between 2 and 3 digits.

long_bool() is never called with number bigger than 2 digits.

long_sub() is never called with number bigger than 1 digit!

msg75562 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-06 13:24

And now the stat of Python patched with 30bit_longdigit3.patch.

min/avg/max are now the number of bits which gives better informations. "bigger" is the number of arguments which are bigger than 1 digit (not in range [-2^30; 2^30]).

make

_FromLong: 169734 calls, min=( 0, ), avg=(11.6, ), max=( 32, ) --> bigger=31086 long_bool: 48772 calls, min=( 0, ), avg=( 0.3, ), max=( 24, ) long_add: 39685 calls, min=( 0, 0), avg=( 6.5, 3.5), max=( 19, 32) --> bigger=1 long_compare: 39445 calls, min=( 0, 0), avg=( 9.3, 8.4), max=( 31, 33) --> bigger=10438 _AsLong: 33726 calls, min=( 0, ), avg=( 4.9, ), max=(1321, ) --> bigger=10 long_sub: 13285 calls, min=( 0, 0), avg=( 7.6, 5.6), max=( 13, 13) long_bitwise: 4690 calls, min=( 0, 0), avg=( 1.7, 1.9), max=( 16, 16) long_hash: 1097 calls, min=( 0, ), avg=( 8.1, ), max=( 33, ) --> bigger=4 long_mul: 236 calls, min=( 0, 0), avg=( 1.3, 5.4), max=( 17, 17) long_invert: 204 calls, min=( 0, ), avg=( 2.4, ), max=( 3, ) long_neg: 35 calls, min=( 1, ), avg=( 4.3, ), max=( 7, ) long_format: 3 calls, min=( 0, ), avg=( 2.0, ), max=( 4, ) long_mod: 3 calls, min=( 1, 2), avg=( 1.7, 2.0), max=( 2, 2) long_pow: 1 calls, min=( 2, 6), avg=( 2.0, 6.0), max=( 2, 6)

Notes about make:

PyLong_FromLong(), long_compare(), PyLong_AsLong() and long_hash() gets integers not in [-2^30; 2^30] which means that all other functions are only called with arguments of 1 digit!
PyLong_FromLong() gets ~30.000 (18%) integers of 32 bits
global average integer size is between 0.3 and 11.6 (~6.0 bits?)
There are 41.500 (12%) big integers on ~350.000 integers

pystone

_FromLong: 1504983 calls, min=( 0, ), avg=( 5.1, ), max=( 31, ) --> bigger=14 long_add: 902487 calls, min=( 0, 0), avg=( 3.9, 2.4), max=( 17, 17) long_compare: 651165 calls, min=( 0, 0), avg=( 1.7, 1.4), max=( 31, 31) --> bigger=27 _AsLong: 252477 calls, min=( 0, ), avg=( 4.6, ), max=( 16, ) long_sub: 250032 calls, min=( 1, 0), avg=( 4.0, 1.6), max=( 7, 7) long_bool: 102655 calls, min=( 0, ), avg=( 0.5, ), max=( 7, ) long_mul: 100015 calls, min=( 0, 0), avg=( 2.5, 2.0), max=( 4, 16) long_truediv: 50000 calls, min=( 4, 2), avg=( 4.0, 2.0), max=( 4, 2) long_hash: 382 calls, min=( 0, ), avg=( 8.1, ), max=( 28, ) long_bitwise: 117 calls, min=( 0, 0), avg=( 6.7, 6.6), max=( 15, 16) long_format: 1 calls, min=(16, ), avg=(16.0, ), max=( 16, )

Notes about pystone:

very very few numbers are bigger than one digit: 41 / ~4.000.000
global average integer size is between 0.5 and 6.7 (~3.0 bits?)
the biggest number has only 31 bits (see long_compare)

Short summary:

pystone doesn't use big integer (1 big integer for 100.000 integers) => don't use pystone!
the average integer size is around 3 or 6 bits, which means that most integers can be stored in 8 bits (-255..255) => we need to focus on the very small numbers => base 2^30 doesn't help for common Python code, it only helps programs using really big numbers (128 bits or more?)

msg75695 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-10 13:34

Here's a minor update to the patch, that does some extra cleanup:

don't include longintrepr.h in Objects/abstract.c or Objects/boolobject.c --- it's not needed.
fix several places in longobject.c where int should have been size_t or Py_ssize_t
remove some unnecessary forward declarations in longobject.c.
fix PyLong_FromLong for small negative integers

At some point I'll try to separate the pure bugfixes (missing casts, int vs Py_ssize_t, etc.) from the 15-bit to 30-bit conversion.

msg75718 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-11 01:58

Using 30bit_longdigit4.patch, I get this error: "Objects/longobject.c:700: erreur: "SIZE_T_MAX" undeclared (first use in this function)". You might use the type Py_ssize_t with PY_SSIZE_T_MAX. I used INT_MAX to compile the code.

msg75720 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-11-11 02:24

I like the idea of sys.int_info, but I would prefer a "base" attribute than "bits_per_digit". A base different than 2^n might be used (eg. a base like 10^n for fast conversion from/to string).

msg75746 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2008-11-11 16:30

Here's a version of the 15-bit to 30-bit patch that adds in a souped-up version of Mario Pernici's faster multiplication.

I did some testing of 100x100 digit and 1000x1000 digit multiplies. On 32-bit x86: 100 x 100 digits : around 2.5 times faster 1000 x 1000 digits : around 3 times faster.

On x86_64, I'm getting more spectacular results: 100 x 100 digits : around 5 times faster 1000 x 1000 digits: around 7 times faster!

The idea of the faster multiplication is quite simple: with 30-bit digits, one can fit a sum of 16 30-bit by 30-bit products in a uint64_t. This means that the inner loop for the basecase grade-school multiplication contains one fewer addition and no mask and shift.

[Victor, please don't delete the old longdigit4.patch!]

msg77769 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-12-14 00:28

Just tested the patch, here are some benchmarks:

./python -m timeit -s "a=100000000;b=777777" "a//b" -> 2.6: 0.253 usec per loop -> 3.1: 0.61 usec per loop -> 3.1 + patch: 0.331 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a*b" -> 2.6: 0.431 usec per loop -> 3.1: 0.302 usec per loop -> 3.1 + patch: 0.225 usec per loop

./python -m timeit -s "a=100000000;b=777777" "a+b" -> 2.6: 0.173 usec per loop -> 3.1: 0.229 usec per loop -> 3.1 + patch: 0.217 usec per loop

But it seems there are some outliers:

./python -m timeit -s "a=1000000005+1;b=7777773" "a//b" -> 2.6: 1.13 usec per loop -> 3.1: 1.12 usec per loop -> 3.1 + patch: 1.2 usec per loop

msg77770 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-12-14 01:09

I wrote a small benchmark tool dedicated to integer operations (+ -

/ etc.): bench_int.py attached to . See also Message75715 and Message75719 for my benchmark results. Short sum up: 2^30 base helps a lot on 64 bits CPU (+26%) whereas the speedup is small (4%) on 32 bits CPU. But don't trust benchmarks :-p

msg82116 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-14 20:32

The most recent patch is out of date and no longer applies cleanly. I'm working on an update.

msg82250 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-16 16:47

Updated patch against py3k. I'm interested in getting this into the trunk as well, but py3k is more important (because all integers are long integers). It's also a little more complicated to do this for py3k (mostly because of all the small integer caching), so backporting to 2.7 is easier than trying to forward port a patch from 2.7 to 3.1.

Notes:

I've added a configure option --enable-big-digits (there are probably better names), enabled by default. So you can use --disable-big-digits to get the old 15-bit behaviour.
I think this patch should work on Windows; confirmation would be appreciated.
I've removed the fast multiplication code, in the interests of keeping the patch simple. If this patch goes in, we can concentrate on speeding up multiplication afterwards. For now, note that 30-bit digits give the potential for significant speedups in multiplication and division (see next item).
There's a nasty 'feature' in x_divmod: the multiplication in the innermost loop is digit-by-twodigits -> twodigits, when it should be digit-by-digit -> twodigits; this probably causes significant slowdown with 30-bit digits. This may explain Antoine's outlier. Again, if this patch goes in I'll work on fixing x_divmod.
Re: Victor's comment about a 'base' attribute: I tried this, but quickly discovered that we still need the 'bits_per_digit' for tests. I think that binaryness is so ingrained that it's not really worth worrying about the possibility of the base changing from a power of 2 to a power of 10. So in the end I left base out.
It did occur to me that NSMALLPOSINTS and NSMALLNEGINTS might usefully be exposed in sys.int_info, mostly for the purposes of testing. Thoughts?

msg82251 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-16 16:51

Forgot to mention: you'll need to rerun autoconf and autoheader after applying the patch and before doing ./configure

msg82318 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 11:30

Antoine, were your posted results on a 64-bit or a 32-bit system?

msg82319 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 11:34

Mark, I think it was 32-bit at the time.

msg82320 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 11:44

Now with the latest patch, and under a 64-bit system (the same one actually, but with a 64-bit distro):

pybench is roughly 2% slower
timeit -s "a=100000000;b=777777" "a//b"

before: 0.563 usec per loop
after: 0.226 usec per loop

timeit -s "a=100000000;b=777777" "a*b"

before: 0.179 usec per loop
after: 0.131 usec per loop

timeit -s "a=100000000;b=777777" "a+b"

before: 0.174 usec per loop
after: 0.134 usec per loop

msg82321 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 11:49

Actually, I think my previous results were in 64-bit mode already.

By the way, I don't think unconditionally using uint64_t is a good thing on 32-bit CPUs. uint64_t might be an emulated type, and operations will then be very slow. It would be better to switch based on sizeof(long) (for LP64 systems) or sizeof(void *) (for LLP64 systems).

msg82325 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 12:12

Actually, I still get a speedup on a 32-bit build. :)

msg82326 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 12:14

Here's a version of the patch that includes optimizations to basecase multiplication, and a streamlined x_divrem for faster division. With Victor's benchmark, I'm getting 43% speed increase on 64-bit Linux/Core 2 Duo.

Note: the base patch is stable and ready for review; in contrast, the optimizations are still in a state of flux, so the +optimizations patch is just there as an example of what might be possible.

About using uint64_t:

the 64-bit type isn't really used very much: its main role is as the result type of a 32-bit by 32-bit multiplication. So it might not matter too much if it's an emulated type; what's important is that the 32-bit by 32-bit multiply with 64-bit results is done in a single CPU instruction. I don't know how to test for this. Do you know of a mainstream system where this isn't true?

I'll test this tonight on 32-bit PPC and 32=bit Intel, and report back.

I don't care very much about trying to automatically do the right thing for small or embedded systems: they can use the --disable-big-digits configure option to turn 30-bit digits off.

Antoine, do you think we should be using 30-bit digits by default only on 64-bit machines? I guess I could go with that, if it can be manually overridden by the configure option.

msg82342 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 17:09

As I said, I actually see a speedup as well on a 32-bit build on a 64-bit CPU. So the current patch (30bit_longdigit13.patch) is fine.

msg82344 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 17:22

Some more benchmarks results (with 30bit_longdigit13.patch):

Victor's bench_int.py:

32-bit without patch: 1370.1 ms
32-bit with patch: 1197.8 ms (23% speedup)
64-bit without patch: 1357.6 ms
64-bit with patch: 981.6 ms (28% speedup)

calculating 2000 digits of pi (*):

32-bit without patch: 2.87 s.
32-bit with patch: 2.87 s. (0% speedup: ???)
64-bit without patch: 3.35 s.
64-bit with patch: 1.68 s. (50% speedup)

(*) using the following script adapted for py3k: http://shootout.alioth.debian.org/u64q/benchmark.php?test=pidigits&lang=python&id=1

msg82345 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 17:25

Here's the py3k version of pidigits.py.

msg82346 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 17:27

Thanks, Antoine. I've reworked the configure stuff anyway: the decision about what size digits to use should take place in pyport.h rather than Include/longintrepr.h. Updated patches will arrive shortly!

msg82347 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 17:36

Updated non-optimized patch. The only real change is that I've moved some of the configuration stuff around (so not worth re-benchmarking this one); I hope that I've now got the division of labour correct:

configure script simply parses the --enable-big-digits option and sets PYLONG_DIGIT_SIZE in pyconfig.h to 15 or 30, or leaves it undefined if that option wasn't given to configure
PC/pyconfig.h doesn't define PYLONG_DIGIT_SIZE
pyport.h chooses a suitable value for PYLONG_DIGIT_SIZE if it's not already defined
Include/longintrepr.h just follows orders: it expects PYLONG_DIGIT_SIZE to be defined already, and complains if PYLONG_DIGIT_SIZE=30 but the necessary integer types aren't available.

Thanks for all the benchmarking.

I'd probably better check on python-dev before pushing this in, since it's a new feature. I hope no-one wants a PEP. :-)

msg82348 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-17 17:47

The last patch (30bit_longdigit14.patch) is obviously missing some stuff, but other than that I think everything's fine and you could commit.

msg82350 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 17:55

Oops. Here's the correct patch.

msg82362 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2009-02-17 20:23

Has any conclusion been reached wrt. overhead of 30-bit multiplication on 32-bit systems? IIUC, the single-digit multiplication is equivalent to the C program

unsigned long long m(unsigned long long a, unsigned long b) { return a*b; }

(i.e. one digit is cast into two digits, and multiplied with the other one). gcc 4.3.3, on x86, compiles this into

    movl    12(%esp), %eax
    movl    8(%esp), %ecx
    imull   %eax, %ecx
    mull    4(%esp)
    leal    (%ecx,%edx), %edx
    ret

In pseudo-code, this is

    tmp = high_a * b;
    high_res:low_res = low_a * b;
    high_res += tmp

So it does use two multiply instructions (plus an add), since one argument got cast into 64 bits.

VS2008 compiles it into

push	eax
push	ecx
push	0
push	edx
call	__allmu

i.e. it widens both arguments to 64 bits, then calls a library routine.

msg82364 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 20:30

unsigned long long m(unsigned long long a, unsigned long b) { return a*b; }

I think that's doing a 32 x 64 -> 64 multiplication; what's being used is more like this:

unsigned long long m(unsigned long a, unsigned long b) { return (unsigned long long)a*b; }

which gcc -O3 compiles to:

pushl	%ebp
movl	%esp, %ebp
movl	12(%ebp), %eax
mull	8(%ebp)
leave
ret

msg82368 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 20:52

Patch uploaded to Rietveld (assuming that I did it right):

http://codereview.appspot.com/14105

msg82374 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-17 21:56

It looks as though Visual Studio 2008 does the 'right' thing, too, at least in some circumstances. Here's some assembler output (MSVC Express Edition, 32-bit Windows XP / Macbook Pro).

; 3 : unsigned long long mul(unsigned long x, unsigned long y) {

push	ebp
mov	ebp, esp

; 4 : return (unsigned long long)x * y;

mov	eax, DWORD PTR _x$[ebp]
mov	ecx, DWORD PTR _y$[ebp]
mul	ecx

; 5 : }

pop	ebp
ret	0

msg82375 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2009-02-17 22:00

Patch uploaded to Rietveld (assuming that I did it right): http://codereview.appspot.com/14105

Hehe, your configure's patch is too huge for Rietveld which displays a "MemoryError" :-) Bug reported at: http://code.google.com/p/rietveld/issues/detail?id=87

msg82385 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2009-02-17 23:00

Ok, let's try 30bit_longdigit14.patch:

patch -p0 < 30bit_longdigit14.patch autoconf && autoheader ./configure && make

I'm using two computers:

marge: Pentium4, 32 bits, 3 GHz (32 bits)
lisa: Core Quad (Q9300), 64 bits, 2.5 GHz (64 bits)

Both uses 32 bits digit (and 64 bits twodigits), py3k trunk and Linux.

My bench_int.py:
- 32-bit without patch: 1670.7 ms
- 32-bit with patch: 1547.8 ms (+7.4%)
- 64-bit without patch: 885.2 ms
- 64-bit with patch: 627.1 ms (+29.2%)
pidigits 2000 (I removed the calls to print): lowest result on 5 runs:
- 32-bit without patch: 2991.5 ms
- 32-bit with patch: 3445.4 ms (-15.2%) SLOWER!
- 64-bit without patch: 1949.9 ms
- 64-bit with patch: 973.0 ms (+50.1%)
pybench.py (minimum total)
- 32-bit without patch: 9209 ms
- 32-bit with patch:
- 64-bit without patch: 4430 ms
- 64-bit with patch: 4330 ms (=)

pybench details:

Test 32 bits (without,patch) | 64 bits (without,patch)

CompareFloatsIntegers: 293ms 325ms | 113ms 96ms CompareIntegers: 188ms 176ms | 129ms 98ms DictWithIntegerKeys: 117ms 119ms | 73ms 69ms SimpleIntFloatArithmetic: 192ms 204ms | 84ms 80ms SimpleIntegerArithmetic: 188ms 196ms | 84ms 80ms

On 64 bits, all integer related tests are faster. On 32 bits, some tests are slower.

Sum up: on 64 bits, your patch is between cool (30%) and awesome (50%) :-) On 32 bits, it's not a good idea to use 32 bits digit because it's a little bit slower.

=> I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits CPU).

Note: I already get similar result (2^30 is slower on 32 bits CPU) in older tests.

msg82386 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2009-02-17 23:02

New attachment: pidigits_noprint.py, hacked version of pidigits.py to remove the print and add the computation time in millisecond. Print was useless because we don't want to benchmark int->str conversion, especially when the integer is in [0; 9].

msg82404 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-18 03:32

Thanks very much for the timings, Victor.

Just out of interest, could you try the pydigits script with the +optimizations patch on 32-bit?

As mentioned above, there's a significant (for 30-bit digits) problem with x_divrem: the inner loop does a 32 x 64-bit multiply when it should be doing a 32 x 32-bit multiply (the variable q is declared as twodigits, but always fits into a digit). This is fixed in the +optimizations patch, and pi_digits is heavy on the divisions, so I wonder whether this might make a difference.

msg82407 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-18 05:13

I would suggest to use 2^30 base only if sizeof(long)>=8 (64 bits CPU).

Thats not the correct test. Test for an actual 64-bit build target. sizeof(long) and sizeof(long long) are not usefully related to that in any sort of cross platform manner. On windows, we'd define the flag for 15 vs 30 bit longs in the build configs for the various build targets. On every thing else (autoconf), we can use a configure test to check the same things that platform.architecture() checks to return '32bit' vs '64bit'.

msg82424 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-18 17:06

Reviewers: Martin v. Löwis,

http://codereview.appspot.com/14105/diff/1/11 File Doc/library/sys.rst (right):

http://codereview.appspot.com/14105/diff/1/11#newcode418 Line 418: A struct sequence that holds information about Python's Agreed. All that's important here is the attribute access.

http://codereview.appspot.com/14105/diff/1/6 File Include/longintrepr.h (right):

http://codereview.appspot.com/14105/diff/1/6#newcode24 Line 24: Furthermore, NSMALLNEGINTS and NSMALLPOSINTS should fit in a digit. */ On 2009/02/17 22:39:18, Martin v. Löwis wrote:

Merge the comments into a single on. There is no need to preserve the evolution of the code in the comment structure.

Done, along with a general rewrite of this set of comments.

http://codereview.appspot.com/14105/diff/1/9 File Objects/longobject.c (right):

http://codereview.appspot.com/14105/diff/1/9#newcode2872 Line 2872: /* XXX benchmark this! Is is worth keeping? */ On 2009/02/17 22:39:18, Martin v. Löwis wrote:

Why not PyLong_FromLongLong if available (no special case if not)?

Yes, PyLong_FromLongLong would make sense. If this is not available, we still need to make sure that CHECK_SMALL_INT gets called.

http://codereview.appspot.com/14105/diff/1/10 File PC/pyconfig.h (right):

http://codereview.appspot.com/14105/diff/1/10#newcode318 Line 318: #define PY_UINT64_T unsigned __int64 On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I think this should use PY_LONG_LONG, to support MingW32; likewise, __int32 shouldn't be used, as it is MSC specific

Ok. I'll use PY_LONG_LONG for 64-bit, and try int and long for 32-bit.

http://codereview.appspot.com/14105/diff/1/2 File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160 Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p); On 2009/02/17 22:39:18, Martin v. Löwis wrote:

This needs to deal with overflow (sizeof(size_t) > sizeof(long))

Hmm. It looks as though there are many places in this file, particularly in w_object, that do "w_long((long)n, p), where n has type Py_ssize_t. Presumably all of these should be fixed.

http://codereview.appspot.com/14105/diff/1/2#newcode540 Line 540: if (n < -INT_MAX || n > INT_MAX) On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I think this is obsolete now; longs can have up to ssize_t_max digits.

Agreed. Again, this needs to be fixed throughout marshal.c (many occurrences in r_object).

http://codereview.appspot.com/14105/diff/1/8 File configure.in (right):

http://codereview.appspot.com/14105/diff/1/8#newcode3132 Line 3132: # determine what size digit to use for Python's longs On 2009/02/17 22:39:18, Martin v. Löwis wrote:

I'm skeptical (-0) that we really need to have such a configure option.

I think it's potentially useful to be able to do --disable-big-digits on platforms where the compiler isn't smart enough to translate a 32-bit by 32-bit multiply into the appropriate CPU instruction, so that using 30-bit digits might hurt performance.

I've also found it handy during debugging and testing. But I guess I'm only +0.5 on the configure option; if others think that it's just unnecessary clutter then I'll remove it.

http://codereview.appspot.com/14105/diff/1/14 File pyconfig.h.in (left):

http://codereview.appspot.com/14105/diff/1/14#oldcode9 Line 9: #undef AC_APPLE_UNIVERSAL_BUILD On 2009/02/17 22:39:18, Martin v. Löwis wrote:

We should find out why this is gone.

Looks like an autoconf 2.63/autoconf 2.61 difference. Whoever previously ran autoconf and autoheader used 2.63; I used 2.61. (Which explains the huge configure diff as well.)

Description: This patchset makes it possible for Python to use base 230 instead of base 215 for its internal representation of arbitrary-precision integers.

The aim is both to improve performance of integer arithmetic, and to make possible some additional optimizations (not currently included in this patchset).

The patchset includes:

a new configure option --enable-big-digits
a new structseq sys.int_info giving information about the internal representation

See http://bugs.python.org/issue4258 for the related tracker discussion.

Please review this at http://codereview.appspot.com/14105

Affected files: M Doc/library/sys.rst M Include/longintrepr.h M Include/longobject.h M Include/pyport.h M Lib/test/test_long.py M Lib/test/test_sys.py M Objects/longobject.c M PC/pyconfig.h M Python/marshal.c M Python/sysmodule.c M configure M configure.in M pyconfig.h.in

msg82425 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-18 18:10

Le mercredi 18 février 2009 à 17:06 +0000, Mark Dickinson a écrit :

Looks like an autoconf 2.63/autoconf 2.61 difference. Whoever previously ran autoconf and autoheader used 2.63;

Sorry, that was me. autoconf seems unable to maintain reasonably similar output between two different versions...

msg82426 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-18 19:36

On 32-bit x86 (1.4Ghz Efficeon) using gcc 4.3.2-1ubuntu12 I see the following perf with pidigits_noprint 2000:

py3k: baseline longdigit14 longdigit13+optimizations 3709 ms 3664ms 4545ms

Those were from the best of five runs after a warmup loop while the system was idle. --enable-big-digits was passed to configure and sys.int_info confirmed it was 30bits on the longdigit versions.

baseline is entirely unpatched.

msg82428 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-18 20:08

Gregory, are you sure you didn't swap the 30-bit and 30-bit+opt results? On OS X/Core 2 Duo my timings are the other way around: 30bit is significantly slower than unpatched, 30bit+opt is a little faster than unpatched. Here are sample numbers:

Macintosh-3:py3k-30bit-opt dickinsm$ ./python.exe ../pidigits_noprint.py Time; 2181.3 ms Macintosh-3:py3k-30bit dickinsm$ ./python.exe ../pidigits_noprint.py 2000 Time; 2987.9 ms Macintosh-3:py3k dickinsm$ ./python.exe ../pidigits_noprint.py 2000 Time; 2216.2 ms

msg82429 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-18 20:34

And here are results from 64-bit builds on the same machine as above (OS X 10.5.6/Core 2 Duo, gcc 4.0.1 from Apple).

./python.exe ../pidigits_noprint.py 2000 gives the following timings:

30-bit digits: Time; 1245.9 ms 30-bit digits + optimizations: Time; 1184.4 ms unpatched py3k: Time; 2479.9 ms

msg82430 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-18 20:50

hmm yes, ignore my 13+optimize result. apparently that used 15bit digits despite --enable-big-digits on configure. attempting to fix that now and rerun.

msg82431 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-18 21:16

apparently that used 15bit digits despite --enable-big-digits on configure

Maybe configure didn't get updated properly? I'm almost sure I found a case where autoconf and autoheader just reused the stuff in the autom4te.cache directory even though configure.in had been changed. I've been doing: rm -fr autom4te.cache; autoconf; autoheader each time.

msg82432 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2009-02-18 21:27

On all other follow-ups I agree, so no further comments there.

http://codereview.appspot.com/14105/diff/1/2 File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160 Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p);

Presumably all of these should be fixed.

Ok, so I'd waive this for this patch; please do create a separate report.

http://codereview.appspot.com/14105

msg82433 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2009-02-18 21:33

Maybe configure didn't get updated properly? I'm almost sure I found a case where autoconf and autoheader just reused the stuff in the autom4te.cache directory even though configure.in had been changed. I've been doing: rm -fr autom4te.cache; autoconf; autoheader each time.

I think the docs say to run autoheader first, then autoconf.

msg82434 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-18 21:35

Maybe configure didn't get updated properly? I'm almost sure I found a case where autoconf and autoheader just reused the stuff in the autom4te.cache directory even though configure.in had been changed. I've been doing: rm -fr autom4te.cache; autoconf; autoheader each time.

Perhaps by doing "autoreconf" instead?

msg82444 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-18 23:45

new results after fixing my longdigit13 build to use 30 bits instead of 15 (the configure script in longdigit13+optimizations didn't work right, i had to manually add the #define to pyconfig.h)

py3k: baseline longdigit14 longdigit13+optimizations 3709 ms 3664 ms 3377 ms

thats much saner.

msg82445 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-19 01:03

attaching an updated pidigits benchmark script that does a warmup run before reporting the best result of 5.

msg82448 - (view)

Author: Gregory P. Smith (gregory.p.smith) * (Python committer)

Date: 2009-02-19 02:40

Here are the results from 32-bit x86 on core2 duo gcc 4.0.1 using pydigits_bestof.py 4000:

30-bit digits (14): 15719 ms 30-bit digits + optimizations (13+ops): 12490 ms unpatched py3k : 13289 ms

(again, i had to manually add #define PYLONG_DIGIT_SIZE 30 to pyconfig.h for the longdigit13+optimizations patch).

and pybench runs on the same builds vs unpatched:

30-bit digits (14): -1.4% (slightly slower) 30-bit digits + optimizations (13+ops): -0.2% (insignificant)

My votes:

Obviously use the optimized version (but fix the configure stuff).

+0 for enabling it by default on 32bit builds. +10 for enabling it by default on 64bit builds.

msg82458 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2009-02-19 06:33

Obviously use the optimized version (but fix the configure stuff).

Before such a version gets committed, I'd like to see it on Rietveld again.

msg82466 - (view)

Author: Pernici Mario (pernici)

Date: 2009-02-19 09:27

The attached patch uses mul1 in long_mul in the version patched with 30bit_longdigit13+optimizations.patch

Comparison between these two patches on hp pavilion Q8200 2.33GHz

pybench patch new patch SimpleIntegerArithmetic 89 85 other tests equal

pidigits_noprint 2000 998 947

msg82468 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-19 09:37

Before such a version gets committed, I'd like to see it on Rietveld again.

Sure. My original plan was to get the structural changes in first, and then worry about optimizations.

But now I think the x_divrem fix has to be considered a prerequisite for the 30-bit digits patch. So I'll clean up the x_divrem code, include it in the basic patch and upload to Rietveld. The other optimization (to multiplication) is more optional, potentially more controversial (it adds 60 or so lines of extra code), and should wait until after the commit and get a separate issue and review.

The x_divrem work actually simplifies the code (whilst not altering the underlying algorithm), as well as speeding it up. Still, it's changing a subtle core algorithm, so we need to be very sure that the new code's correct before it goes in. In particular, I want to add some tests to test_long that exercise the rare corner cases in the algorithm (which only become rarer when 30-bit digits are used).

I also want to understand Gregory's problems with configure before committing anything.

None of this is going to happen before the weekend, I'm afraid.

msg82469 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-19 09:42

http://codereview.appspot.com/14105/diff/1/2 File Python/marshal.c (right):

http://codereview.appspot.com/14105/diff/1/2#newcode160 Line 160: w_long((long)(Py_SIZE(ob) > 0 ? l : -l), p); On 2009/02/18 21:27:04, Martin v. Löwis wrote:

Ok, so I'd waive this for this patch; please do create a separate report.

Done.

http://bugs.python.org/issue5308

http://codereview.appspot.com/14105

msg82533 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-20 16:01

Here's an updated patch that includes the x_divrem fixes and optimizations. I've also updated the Rietveld issue with these fixes.

I think this is (modulo any requested changes) the version that I'd like to get into py3k. After that we can look at the possibility of optimizing the multiplication algorithm; the discussion for this should probably go back to issue 3944.

Here's a detailed list of the changes to x_divrem.

Most importantly, in the inner loop, we make sure that the multiplication is digit x digit -> twodigits; previously it was digit x twodigits -> twodigits, which is likely to expand to several instructions on a 32-bit machine. I suspect that this is the main cause of the slowdown that Victor was seeing.
Make all variables unsigned. This eliminates the need for Py_ARITHMETIC_RIGHT_SHIFT, and also removes the only essential use of stwodigits in the entire long object implementation.
Save an iteration of the outer loop when possible by comparing top digits.
Remove double tests in the main inner loop and correction loop.
Quotient digit correction step follows Knuth more closely, and uses fewer operations. The extra exit condition (r >= PyLong_BASE) will be true more than 50% of the time, and should be cheaper than the main test.
In quotient digit estimation step, remove unnecessary special case when vj == w->ob_digits[w_size-1]. Knuth only needs this special case because his base is the same as the wordsize.
There's no need to fill the eliminated digits of v with zero; instead, set Py_SIZE(v) at the end of the outer loop.
More comments; some extra asserts.

There are many other optimizations possible; I've tried only to include those that don't significantly complicate the code. An obvious remaining inefficiency is that on every iteration of the outer loop we divide by the top word of w; on many machines I suspect that it would be more efficient to precompute an inverse and multiply by that instead.

msg82538 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-02-20 16:31

Updated benchmarks results with 30bit_longdigit17.patch:

Victor's bench_int.py:

32-bit with patch: 1178.3 ms (24% speedup)
64-bit with patch: 990.8 ms (27% speedup)

Calculating 2000 digits of pi:

32-bit with patch: 2.16 s. (25% speedup)
64-bit with patch: 1.5 s. (55% speedup)

This is very good work.

msg82563 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-21 09:28

Adding Tim Peters to the nosy list, mainly to give him an opportunity to throw up his hands in horror at my rewrite of his (I'm guessing) implementation of division.

msg82599 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-22 12:19

It finally occurred to me that what might be killing 32-bit performance is the divisions, rather than the multiplications.

To test this, here's a version of 30bit_longdigit17.patch that replaces just two of the divisions in Objects/longsobject.c by the appropriate x86 divl assembler instruction. The result for pydigits is an astonishing 10% speedup!

Results of running python pydigits_bestof.py 2000 on 32-bit OS X 10.5.6/Core 2 Duo:

upatched py3k

Best Time; 2212.6 ms

30bit_longdigit17.patch

Best Time; 2283.9 ms (-3.1% relative to py3k)

30bit_longdigit17+asm.patch

Best Time; 2085.7 ms (+6.1% relative to py3k)

The problem is that (e.g., in the main loop of x_divrem) we're doing a 64-bit by 32-bit division, expecting a 32-bit quotient and a 32-bit remainder. From the analysis of the algorithm, we know that the quotient will always fit into 32 bits, so that e.g., on x86, a divl instruction is appropriate. But unless the compiler is extraordinarily clever it doesn't know this, so it produces an expensive library call to a function that probably involves multiple divisions and/or some branches, that produces the full 64-bit quotient.

On 32-bit PowerPC things are even worse, since there there isn't even a 64-by-32 bit divide instruction; only a 32-bit by 32-bit division.

So I could still be persuaded that 30-bit digits should only be enabled by default on 64-bit machines...

msg82669 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-02-24 18:18

Okay, let's abandon 30-bit digits on 32-bit machines: it's still unclear whether there's any real performance gain, and it's trivial to re-enable 30-bit digits by default later. I'm also going to abandon the optimizations for now; it'll be much easier to work on them once the base patch is in.

Here's a 'release-candidate' version of the patch: it's exactly the same as before, except:

I removed all x_divrem changes (though I've left the extra division tests in, since they're just as relevant to the old x_divrem). I'll open a separate issue for these optimizations.
enable 30-bit digits by default only on 64-bit platforms, where for the purposes of this patch a 64-bit platform is one with all the necessary integer types available (signed and unsigned 32- and 64-bit integer types) and SIZEOF_VOID_P >= 8.

I've also updated the version uploaded to Rietveld: the patchset there should exactly match 30bit_longdigit20.patch. See:

http://codereview.appspot.com/14105

Martin, thank you for all your help with reviewing the previous patch.
Is it okay with you to check this version in? It's the same as the version that you reviewed, except with some extra tests for correct division, and with 30-bit digits disabled by default on 32-bit platforms.

Gregory, if you have time, please could you double check that the configure stuff is working okay for you with this patch? "configure -- enable-big-digits" should produce 30-bit digits; "configure --disable- big-digits" should produce 15-bit digits, and a plain "configure" should give 15-bit or 30-bit depending on your platform.

Anyone else have any objections to this going in exactly as it is? I'd quite like to resolve this issue one way or the other and move on.

msg82792 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2009-02-26 23:52

1 comment and 1 question about 30bit_longdigit20.patch:

I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT == 30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15 (eg. digit=unsigned short). Even if short is always 16 bits, I would prefer fixed size types in any case.
In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is it for the Windows world (which doesn't use configure script)?

I prefer the patch version 20 because it's much simplier than the other with the algorithm optimisations. The patch is already huge, so it's better to split it into smaller parts ;-)

msg83389 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-03-09 15:01

[Victor]

In pyport.h, you redefine PYLONG_BITS_IN_DIGIT if it's not set. Is it for the Windows world (which doesn't use configure script)?

Yes, but it's also for Unix: PYLONG_BITS_IN_DIGIT is only set when the -- enable-big-digits option is supplied to configure. So the code in pyport.h always gets used for a plain ./configure && make.

I love fixed size type: you use them when PYLONG_BITS_IN_DIGIT == 30 (eg. digit=PY_UINT32_T) but not when PYLONG_BITS_IN_DIGIT == 15 (eg. digit=unsigned short). Even if short is always 16 bits, I would prefer fixed size types in any case.

I agree with you to some extent, but I'd still prefer to leave the 15-bit definitions as they are, partly out of a desire not to make unnecessary changes. The 'unsigned short' type used for 15-bit digits is both theoretically sufficient and not wasteful in practice.

Barring further objections, I'm planning to get the 30bit_longdigit20.patch in later this week.

msg83467 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-03-11 13:02

I tried the patch on a 64-bit Linux system and it's ok.

msg83772 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-03-18 20:18

Committed 30bit_longdigit20.patch to py3k in r70460, with some minor variations (incorporate Nick Coghlan's improved error messages for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

Thanks all for your feedback.

I might get around to backporting this to 2.7 one day...

msg83773 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-03-18 20:19

That should be r70459, of course.

msg83774 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-03-18 20:19

Great!

Committed 30bit_longdigit20.patch to py3k in r70460, with some minor variations (incorporate Nick Coghlan's improved error messages for marshal, fix some comment typos, add whatsnew and Misc/NEWS entries).

msg83865 - (view)

Author: Mark Dickinson (mark.dickinson) * (Python committer)

Date: 2009-03-20 16:08

Backported to trunk in r70479.

Mario, thanks for the long multiplication tweaks you submitted: could you possibly regenerate your Feb 19th patch against the trunk (or py3k, whichever you prefer) and attach it to issue 3944?