Issue 10044: small int optimization (original) (raw)
Created on 2010-10-07 11:56 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (38)
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 11:56
This is an experimental patch to optimize some operations on small ints. pystone is 5-10% faster, pybench 2-3% faster, and here are some relevant benchmarks from unladen swallow:
nbody
Min: 0.345136 -> 0.317502: 1.09x faster Avg: 0.346827 -> 0.319561: 1.09x faster Significant (t=79.50) Stddev: 0.00140 -> 0.00198: 1.4084x larger
nqueens
Min: 0.339744 -> 0.313506: 1.08x faster Avg: 0.342630 -> 0.315380: 1.09x faster Significant (t=73.41) Stddev: 0.00218 -> 0.00146: 1.4931x smaller
If the principle gets accepted, we could experiment with further optimizations such as dedicated opcodes for addition-with-int-constant, subscripting-with-int-constant, etc.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2010-10-07 13:34
Arithmetic with void* pointers is not allowed by the Microsoft compilers. char* should be used instead.
Author: Daniel Stutzbach (stutzbach)
Date: 2010-10-07 14:43
How does performance change if you adjust NSMALLPOSINTS and NSMALLNEGINTS, but make no other changes?
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 15:49
Looks like a nice idea, at first glance, though I haven't looked at the code in detail. I like the use of the macros to abstract away the long implementation details.
"INPLACE_SUBTRACT_end", not "INPLACE_SUBSTRACT_end", please!
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 15:53
+#define _PyLong_IS_SMALL_INT(v) \
- (((PyLongObject *)(v)) >= _PyLong_small_ints && \
((PyLongObject *)(v)) < _PyLong_SMALL_INTS_END)
+/* These macros purposedly avoid a cast to int, since it is most of time
- useless, and sometimes detrimental (because of truncation).
- XXX _PyLong_AS_SMALL_INT might be slower if sizeof(PyLongObject) is not
- a power of two.
- */
Urk! This is nasty. :(
I don't think arbitrary comparisons of pointers give well-defined results, unless those pointers both happen to point into the same array. (Might be wrong; I don't have a copy of the C standard to hand.)
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 16:13
Maybe we could consider adding an extra field to a PyLong giving its 'small_int' value for small values, and some flag value for non-small longs. An extra field wouldn't actually enlarge the size of a PyLong for small values---on a 64-bit machine, a value like 23L takes 28 bytes, for which 32 bytes will actually be allocated (since Python always allocates in multiples of 8 bytes, I believe).
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 16:23
Maybe we could consider adding an extra field to a PyLong giving its 'small_int' value for small values, and some flag value for non-small longs. An extra field wouldn't actually enlarge the size of a PyLong for small values---on a 64-bit machine, a value like 23L takes 28 bytes, for which 32 bytes will actually be allocated (since Python always allocates in multiples of 8 bytes, I believe).
I actually had a patch for that. It declared a ob_digit[2] array instead of ob_digit[1], and ob_digit[1] contained the small int. But the patch still used the pointer comparison approach for PyLong_IS_SMALL_INT, because I think it's faster (just two comparisons). So there didn't seem to much point.
Also, the pointer addition trick for addition (see BINARY_ADD) is probably faster than the more intuitive method.
Author: Daniel Stutzbach (stutzbach)
Date: 2010-10-07 16:27
I don't think arbitrary comparisons of pointers give well-defined results, unless those pointers both happen to point into the same array. (Might be wrong; I don't have a copy of the C standard to hand.)
Technically arbitrary relational comparisons of pointers are undefined, but in practice Antoine's assumptions here are very modest. They boil down to:
v >= &array[0] && v < &array[array_len]
It is hard for me to imagine a system designed such that the expression could evaluate to true when v is not in the array.
I suppose a system could be designed where relational comparisons of unrelated data pointers causes a segmentation fault or similar, but that also seems unlikely to me.
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 19:30
Technically arbitrary relational comparisons of pointers are undefined, but in practice Antoine's assumptions here are very modest. They boil down to:
v >= &array[0] && v < &array[array_len]
I can't say anything about the standard, but p > q looks like it should be the same as (p - q) > 0, which looks rather well-defined for pointers.
(at worse we could cast to _Py_uintptr_t, hopefully the compiler wouldn't produce any different code)
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 20:29
How does performance change if you adjust NSMALLPOSINTS and NSMALLNEGINTS, but make no other changes?
It makes a very small difference (which is understandable since it doesn't cut down on code execution a lot).
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 21:00
Technically arbitrary relational comparisons of pointers are undefined, but in practice Antoine's assumptions here are very modest.
I disagree: there's a very real practical danger here. Namely, optimizing compilers are free to assume that code doesn't involve any undefined behaviour and optimize accordingly. gcc for one is known to make extensive use of this freedom. I wouldn't be at all surprised to find some current or future version of gcc optimizing an (p >= start_of_array) check to 'always true', on the basis that it will always be true in legal code.
Please don't introduce undefined behaviour here---it's asking for trouble.
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 21:04
I can't say anything about the standard, but p > q looks like it should be the same as (p - q) > 0
Yep.
which looks rather well-defined for pointers.
Nope. It's only well-defined for pointers pointing into the same array (or to one past the end of an array). Otherwise it's undefined behaviour.
See section 6.5.6, paragraph 9, of
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:12
Nope. It's only well-defined for pointers pointing into the same array (or to one past the end of an array). Otherwise it's undefined behaviour.
How can the compiler tell whether two pointers are "into the same array"? That sounds like an undecidable criterion.
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 21:14
How can the compiler tell whether two pointers are "into the same array"? That sounds like an undecidable criterion.
It doesn't have to be able to tell---it's allowed to assume. :-)
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:15
How can the compiler tell whether two pointers are "into the same array"? That sounds like an undecidable criterion.
It doesn't have to be able to tell---it's allowed to assume. :-)
That doesn't very clear or understandable.
Author: Eric V. Smith (eric.smith) *
Date: 2010-10-07 21:18
In the bad old days of 386 segment:offset memory architectures this was a problem. You could have overlapping segments but pointers inside an object were always in the same segment, so the segment selectors never had to be inspected. Pointers to different objects could indeed have the same offset and would compare equal.
We should follow the standard here: no comparisons between pointers to different arrays (basically).
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 21:22
See the example above: suppose that a compiler is looking at a (p >= q) comparison of pointers. Suppose furthermore that in a particular case that compiler is smart enough to figure out that q is a pointer to the start of an array. Then the compiler is permitted to assume that p also points into the same array, since if it didn't then the code would introduce undefined behaviour. And since q is the start of the array, and p is (by assumption) a pointer into the same array, p >= q will automatically be true, so the compiler is free to replace the expression with the integer '1' (i.e., true).
gcc does similar things with checks like (x + 1 > x): if x is a (signed) int, then gcc can and will optimize (x + 1 > x) to 'true', on the basis that x + 1 can never overflow, because such overflow would be undefined behaviour and therefore can't occur in a valid C program.
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:29
In the bad old days of 386 segment:offset memory architectures this was a problem. You could have overlapping segments but pointers inside an object were always in the same segment, so the segment selectors never had to be inspected. Pointers to different objects could indeed have the same offset and would compare equal.
AFAIK it caused lots of other problems much more annoying than just pointer compares, and programmers had to juggle by hand with various addressing modes (near, far, etc.).
ISTM that all C programs nowadays assume some kind of flat, paged memory model, CPython included; I don't think we have to fear some i386-like segmentation model.
We should follow the standard here: no comparisons between pointers to different arrays (basically).
Again, what is an "array" supposed to be? In memory there are no "arrays".
Let's say I have the following function in a module:
int ptr_compare(char *a, char *b) { return a > b; }
and another module, the following function
int foo() { static char x[5]; static char y[6]; return ptr_compare(x + 2, y + 3); }
How is the compiler supposed to know whether a and b belong to the same array when compiling ptr_compare?
Otherwise, we could cast to Py_uintptr_t.
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 21:36
How is the compiler supposed to know whether a and b belong to the same array when compiling ptr_compare?
It doesn't need to know. So long as the compiler can guarantee that its code will produce correct results in the case that a and b do both point to the same array, that's enough. In other words, when producing code for ptr_compare, the compiler is allowed to assume that a and b point into the same array, and act accordingly.
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:38
See the example above: suppose that a compiler is looking at a (p >= q) comparison of pointers. Suppose furthermore that in a particular case that compiler is smart enough to figure out that q is a pointer to the start of an array.
Which array? You can have arrays everywhere in memory, at any address, ending anywhere.
union { struct { char ch1; char arr1[2]; } struct { char arr2[2]; char ch2; } struct { char arr3[3]; } }
Which is an array, and which is not? is &ch1 an array? and arr1? and arr2+1? Why would the compiler choose one over another? And what about arrays defined in other modules?
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:40
In other words, when producing code for ptr_compare, the compiler is allowed to assume that a and b point into the same array, and act accordingly.
But this assumption doesn't bring anything, right? That is, there is no shortcut way to compute "p - q" or "p > q" when both point into the same "array".
Author: Antoine Pitrou (pitrou) *
Date: 2010-10-07 21:47
For the record, a Py_uintptr_t version works and has the same performance. Would you agree to it or is there still some menacing oddity from the i386 days lurking around?
Author: Mark Dickinson (mark.dickinson) *
Date: 2010-10-07 22:03
For the record, a Py_uintptr_t version works and has the same performance. Would you agree to it or is there still some menacing oddity from the i386 days lurking around?
Technically, it's still dodgy: as the gcc manual notes in:
". That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 6.5.6/8."
I can't see as much scope for problems with the uintptr_t version. But just because I can't anticipate the problems, it doesn't mean they don't exist.
It really would be better to avoid the undefined behaviour if at all possible.
Author: Xuanji Li (xuanji) *
Date: 2010-12-31 07:00
fwiw I've always found this helpful for undefined behavior: http://blog.regehr.org/archives/213 and, just as it says "x+1 > x" will be optimized to a nop, by the same logic "v >= &array[0] && v < &array[array_len]" will also be optimized to a nop.
Author: Meador Inge (meador.inge) *
Date: 2011-01-01 20:24
How is the compiler supposed to know whether a and b belong to the same array when compiling ptr_compare?
I agree with Mark, it doesn't need to know. However, many compilers [1,2] support whole program optimization and could in theory figure the address out using that technique.
[1] GCC -flto - http://gcc.gnu.org/onlinedocs/gcc-4.5.2/gcc/Optimize-Options.html#Optimize-Options [2] VC++ LTCG - http://msdn.microsoft.com/en-us/library/xbf3tbeh.aspx
Author: Steffen Daode Nurpmeso (sdaoden)
Date: 2011-05-03 10:45
Now me.
When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 6.5.6/8.
Say - isn't that a joke about farts of armchair crouchers. All it says is that if you dereference garbage you get a crash. If you're concerned, use volatile, and go shoot the compiler programmer if she dares to optimize just anything. And on ARM, isn't the interrupt table at ((void*)(char[])0x0)?? La vie est une rose beside that. And all i need is atomic_compare_and_swap().
Author: STINNER Victor (vstinner) *
Date: 2012-09-22 00:41
If I understood correctly, the optimization proposed by Antoine was somehow rejected because _PyLong_IS_SMALL_INT() may be optimized "incorrectly".
If a compiler "miscompiles" this macro, can't we disable this optimization on this specific compiler, instead of missing an interesting optimization on all compilers? I bet that no compiler will do insane optimization on such test.
Where in CPython source code do we use directly references to small_ints? The common case is to write PyLong_FromLong(0). Can compiler call PyLong_FromLong() to detect that the result is part of the small_ints array? Or that it is not part of small_ints?
The corner case is very unlikely to me.
--
I tested Antoine's patch with GCC 4.6 and CFLAGS="-O3 -flto" (-flto: standard link-time optimizer): the test suite pass. I don't see any magic optimization here, sorry.
--
"fwiw I've always found this helpful for undefined behavior: http://blog.regehr.org/archives/213 and, just as it says "x+1 > x" will be optimized to a nop, by the same logic "v >= &array[0] && v < &array[array_len]" will also be optimized to a nop."
"x+1 > x" and "v >= &array[0]" are not the same checks. In the first test, x is used in both parts. I don't understand.
Author: Mark Dickinson (mark.dickinson) *
Date: 2012-09-22 08:34
Victor: so you want to deliberately introduce undefined behaviour for the sake of a minor optimization, while crossing your fingers and hoping that that doesn't cause issues with any existing or future compilers?
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2012-09-22 09:16
Victor, the danger of this approach is that we can't verify that a particular compiler with certain options on a specific platform compiles the source code with undefined behavior in a certain way. If such a test were possible, if this aspect was guaranteed by the compiler manufacturer, or could be checked in the ./configure script, then such an approach would have some meaning.
Regardless of _PyLong_IS_SMALL_INT, the rest of patch looks a little cumbersome.
Author: Martin v. Löwis (loewis) *
Date: 2012-09-22 11:11
"x+1 > x" and "v >= &array[0]" are not the same checks. In the first test, x is used in both parts. I don't understand.
The consequences of "undefined behavior" are really difficult to understand, it seems. A compiler which evaluates both relational expressions unconditionally to true would be conforming to the C language definition. The reason for that has been said several times before, but let me repeat them.
Mathematically, x+1 > x for any real number, so it seems natural to assume that this can be determined to be "true" statically. However, what if x+1 overflows? Shouldn't then it evaluate to "false"? Not necessarily if x is signed. Then the behavior of overflow is undefined, and any result of the comparison would be conforming to the standard: the overflow may trap, the expression may evaluate to "false", or the expression may evaluate to "true", or your hard disk may get erased - all of these would be conforming to the C language definition.
For v >= &array[0], this is the same issue. If v points into the array, the expression is true. If v doesn't point into the array, the behavior is undefined, so the compiler may chose to give false, or to give true, or to actually compare the pointers as integers, or to erase your hard disk.
Of these possible behaviors, only two are plausible. For x+1>x, it may be that the compiler actually generates code to perform the addition and the comparison, or it may statically decide that the expression is always true. For v > &array[0], the compiler may either generate code to perform the comparison, or statically decide that the expression is true. Either implementation would be correct - it's not that the compiler is optimizing "incorrectly".
In the specific case, a compiler might infer that _PyLong_IS_SMALL_INT is always true, which would give incorrect results in Python. However, it is the patch which is incorrect, not the compiler.
I recommend to close the issue as rejected.
Author: Larry Hastings (larry) *
Date: 2012-09-22 14:11
I must be missing something--because I thought Python already depended on this apparently-undefined behavior. The small-block object allocator in Objects/obmalloc.c determines whether a pointer belongs to a particular arena using exactly this trick. I quote from the gigantic comment at the top of that file:
Let B be the arena base address associated with the pool,
B = arenas[(POOL)->arenaindex].address. Then P belongs to
the arena if and only if
B <= P < B + ARENA_SIZE
Subtracting B throughout, this is true iff
0 <= P-B < ARENA_SIZE
This test is implemented as the following macro:
#define Py_ADDRESS_IN_RANGE(P, POOL) \
((arenaindex_temp = (POOL)->arenaindex) < maxarenas && \
(uptr)(P) - arenas[arenaindex_temp].address < (uptr)ARENA_SIZE && \
arenas[arenaindex_temp].address != 0)
Why is that permissible but _PyLong_IS_SMALL_INT is not?
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2012-09-22 16:00
Why is that permissible but _PyLong_IS_SMALL_INT is not?
Touchér!
Author: Mark Dickinson (mark.dickinson) *
Date: 2012-09-22 16:30
Ah: nice catch, Larry.
I would say that the obmalloc case shouldn't be permissible; however, it's already there, and changing that would be an involved task that would also likely risk introducing new bugs. So I guess practicality beats purity on that one. I don't see that as an excuse for introducing new undefined behaviour though, especially for a small optimization.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2012-09-22 16:51
I recommend to close the issue as rejected.
I think _PyLong_IS_SMALL_INT can be rewritten in a safe style. For example, using a checking of several fields ((sdigit)(x)->ob_digit[0] < _MAX_SMALL_INT && PySIZE(x) <= 1) or a special flag. It is possible however that shuch checking will fully destroy the effect of optimization. We need further research.
Author: Martin v. Löwis (loewis) *
Date: 2012-09-22 22:42
Why is that permissible but _PyLong_IS_SMALL_INT is not?
It's permissable primarily because it is there. There are many places in Python which invoke undefined behavior already (most notably wrt. signed integers); we should strive to eliminate them.
OTOH, obmalloc clearly makes a lot of assumptions on the hardware architecture (including the assumption that there is page-based virtual memory, etc). It's a memory allocator, that gives permission to make such assumptions. If it turns out that they break on some system, obmalloc cannot be used there - hence usage of obmalloc is still a compile-time option.
In addition, for the specific macros: it's easy to see that a compiler may optimize _PyLong_IS_SMALL_INT as true by simple static analysis. For Py_ADDRESS_IN_RANGE, the same static analysis is not possible, since the memory doesn't come from a declared array. It would require whole-program analysis to determine that .address always points to a memory block with ARENA_SIZE bytes. If a compiler manages to figure it out, obmalloc cannot be used on that system (and if that happens, I'll recommend to drop or revise obmalloc altogether, perhaps in favor of a system pool allocator).
Author: Martin v. Löwis (loewis) *
Date: 2012-09-22 22:44
Zitat von Serhiy Storchaka <report@bugs.python.org>:
I recommend to close the issue as rejected.
I think _PyLong_IS_SMALL_INT can be rewritten in a safe style. For
example, using a checking of several fields
((sdigit)(x)->ob_digit[0] < _MAX_SMALL_INT && PySIZE(x) <= 1) or a special flag. It is possible however that shuch checking will fully
destroy the effect of optimization. We need further research.
Do we need to keep this issue open while this research is being carried out? This issue is already cluttered with the undefined-behavior discussion.
Author: Mark Dickinson (mark.dickinson) *
Date: 2013-03-31 12:49
Do we need to keep this issue open while this research is being carried out?
I'd say not.
Author: STINNER Victor (vstinner) *
Date: 2016-02-06 00:59
Issue #21955 is a spin-off of this issue.
History
Date
User
Action
Args
2022-04-11 14:57:07
admin
set
github: 54253
2016-02-06 00:59:30
vstinner
set
messages: +
2013-03-31 12:49:48
mark.dickinson
set
status: open -> closed
resolution: rejected
messages: +
2012-09-22 22:44:44
loewis
set
messages: +
2012-09-22 22:42:55
loewis
set
messages: +
2012-09-22 16:51:10
serhiy.storchaka
set
messages: +
2012-09-22 16:30:32
mark.dickinson
set
messages: +
2012-09-22 16:00:27
serhiy.storchaka
set
messages: +
2012-09-22 14:11:25
larry
set
nosy: + larry
messages: +
2012-09-22 11:11:11
loewis
set
messages: +
2012-09-22 09:16:42
serhiy.storchaka
set
messages: +
2012-09-22 08:34:12
mark.dickinson
set
messages: +
2012-09-22 00:42:09
vstinner
set
nosy: + loewis
2012-09-22 00:41:59
vstinner
set
messages: +
2012-09-17 16:13:21
serhiy.storchaka
set
nosy: + serhiy.storchaka
components: + Interpreter Core
versions: + Python 3.4, - Python 3.3
2012-09-17 15:23:07
vstinner
set
nosy: + vstinner
2011-07-24 01:16:38
asvetlov
set
nosy: + asvetlov
2011-05-03 10:45:52
sdaoden
set
nosy: + sdaoden
messages: +
2011-05-02 19:57:18
pitrou
set
versions: + Python 3.3, - Python 3.2
2011-01-01 20:24:38
meador.inge
set
nosy: + meador.inge
messages: +
2010-12-31 07:00:36
xuanji
set
nosy: + xuanji
messages: +
2010-10-07 22:03:46
mark.dickinson
set
messages: +
2010-10-07 21:47:50
pitrou
set
messages: +
2010-10-07 21:40:56
pitrou
set
messages: +
2010-10-07 21:38:43
pitrou
set
messages: +
2010-10-07 21:36:27
mark.dickinson
set
messages: +
2010-10-07 21:29:37
pitrou
set
messages: +
2010-10-07 21:22:02
mark.dickinson
set
messages: +
2010-10-07 21🔞56
eric.smith
set
messages: +
2010-10-07 21:15:30
pitrou
set
messages: +
2010-10-07 21:14:36
mark.dickinson
set
messages: +
2010-10-07 21:12:24
pitrou
set
messages: +
2010-10-07 21:04:01
mark.dickinson
set
messages: +
2010-10-07 21:00:36
mark.dickinson
set
messages: +
2010-10-07 20:29:20
pitrou
set
messages: +
2010-10-07 19:30:54
pitrou
set
messages: +
2010-10-07 19:23:31
Aahz
set
2010-10-07 17:12:32
eric.smith
set
nosy: + eric.smith
2010-10-07 16:27:17
stutzbach
set
messages: +
2010-10-07 16:23:46
pitrou
set
messages: +
2010-10-07 16:13:51
mark.dickinson
set
messages: +
2010-10-07 15:53:47
mark.dickinson
set
messages: +
2010-10-07 15:49:06
mark.dickinson
set
messages: +
2010-10-07 14:43:59
stutzbach
set
messages: +
2010-10-07 13:34:24
amaury.forgeotdarc
set
nosy: + amaury.forgeotdarc
messages: +
2010-10-07 11:56:27
pitrou
create