optimizing acmp in L-World (original) (raw)
John Rose john.r.rose at oracle.com
Mon Jan 29 20:42:49 UTC 2018
- Previous message (by thread): hg: valhalla/valhalla: 4 new changesets
- Next message (by thread): optimizing acmp in L-World
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Tobias, Roland, and I had a chat in Santa Clara last week about optimizing the acmp operation in the JVM. Here are my notes on it.
Background: The acmp operation, also known as reference comparison, is overloaded in L-World to handle values as well, with a specific semantics of returning false if either operand is a value (similar to NaN behavior for fcmp). The rationale for this is that acmp is reference comparison, even in L-World, and since values have no references, they can never be equal as references.
(Alternatively, it is as if both operands were converted to references before the acmp operation, which means that either operand, if it is a value, would be boxed to a reference value, which then, being a new reference, is automatically unequal to any other reference. This tricky account may be appealing in some circumstances.)
OK, so how do we make it fast? Ideally, we would like to strength-reduce the enhanced acmp instruction down to the original acmp instruction, which after all is just a single instruction (such as cmpq) on all platforms.
The first thing to notice, before going into details, is that in L-world all Q-values are buffered in the interpreter. This means that, at least for interpreter-generated values, all values can be treated as physical pointers, even if they are logically Q-values.
(Crib sheet: In L-world, all kinds of values except primitives are formally carried by L-values, under L-descriptors. True references can be referred to as R-values of R-type, reference type, or object class. New value types are referred to as Q-values of Q-type, value type, or value class. When you don't know what you have, you can say it's a U-type, which just like a Q-type or R-type is carried by an L-value under an L-descriptor.)
This means that the old acmp semantics are almost correct, in many cases. If you have two L-values and do a pointer comparison on their physical pointers, and if the pointers differ, you are done. If the physical pointers compare equal, then there is one more thing you have to do: Make sure that the L-value carried by both physical pointers is in fact a reference, an R-value.
So the logic looks like this, and the new semantics are in the second "if" statement:
bool new_acmp(ptr x, ptr y) { if (!old_acmp(x, y)) return false; // x != y => false if (is_value_ptr(x)) return false; // x.is_value => false return true; } bool old_acmp(ptr x, ptr y) { return x == y; // cmpq etc. }
Note that since acmp is symmetric, the JVM has the option to silently test either x or y for "is_value_ptr" if the two physical pointers compare equal.
This leads immediately to the first set of optimizations, which probably are enough to get us where we want to go. (There is a second set we can add later.) All the JVM has to do is prove or detect that one or the other physical pointer either is certainly a Q-value, or is certainly not a Q-value.
If either physical pointer is certainly a Q-value, then the new_acmp instruction is a constant false. If either physical pointer is certainly not a Q-value, then the new_acmp instruction strength reduces to the old_acmp instruction.
The strength reduction to false can be done in any place where the type of either operand is known to be a Q-type. The interpreter doesn't track types, and the acmp instruction isn't resolved so it can't be quickened to a type context, but the JIT has lots of type information. Thus, if either operand of acmp is statically known (or accurately speculated) to be a value type, then the acmp instruction constant folds to false.
Likewise, if either operand is known to be an R-type (true reference) the acmp may be strength reduced to a simple reference comparison. This also is likely to happen in the JIT. It also happens in the interpreter in the acmp_null instruction, where (obviously) the second operand is a reference (i.e., null).
(Note that the type Object is a U-type in L-world, so Object-to-Object comparisons such as are found in generic collection code do not benefit from this optimization. More later on that.)
The JIT can also track, in its profile, whether various instructions "see" only Q-values or only R-values. (The classic HotSpot already tracks "sightings" of null for similar purposes.) In that case, speculative narrowings (only Q-types or only R-types) can be used to short circuit new_acmp down to old_acmp.
This is useful because the speculation can sometimes be verified out of a loop body, where the actual acmp instruction is executed in a loop. Remember that the JIT can choose which operand of acmp to test for value-ness; if it chooses a loop invariant, and the invariant is always Q or always R, then the acmp inside the loop folds up, no matter what the dynamic value of the other operand.
For that matter, even if speculation fails, loop unswitching can get to a similar result, at the cost of two copies of the loop.
So, in many cases, contextual information can allow the JVM to fold up new_acmp to old_acmp or constant false.
In cases such folding fails, there is as simple trick which can make things easier. Recall that folding fails when both operands are statically U-types, neither operand folding to a Q-type or R-type.
In that case, the JVM must pick one operand or the other (it doesn't matter which) and perform the is_value_ptr operation on it, in such a way that the result of that operand affects the final result in the correct way. A naive implementation of new_acmp adds an extra test-and-branch to fold in the is_value_ptr test, but a clever might be able to avoid it.
Suppose there is a branch free technique for implementing is_value_ptr, such as extracting a bit from the physical pointer to the U-value, or (after a null check) extracting a bit from the physical pointer of the class metadata pointer in the value's header, or (after two indirections) extracting a bit from the structure of the class metadata. (Mr. Simm's prototype puts such a bit in the metadata pointer, in the object header.)
If such a bit can be obtained in a branch-free manner, then it can be used to perturb the result of the old_acmp in a useful manner. For example:
bool new_acmp(ptr x, ptr y) { if (x == null) return y == null; int bit = (x->class_in_header & Q_TYPE_BIT); ptr x1 = x | swar_broadcast_bit(bit != 0); return old_acmp(x1, y); } int swar_broadcast_bit(bool z) { return z ? -1 : 0; // use ALU ops }
The idea is to extract a bit which signals the presence of a Q-type (in either operand) and use it to "mess up" the equality comparison, using and, or, or xor as needed.
This perturbation technique has two costs: First, it requires a null check (unless the Q-bit is in the actual physical reference). Second, it requires an ALU operation or two to spread the perturbation. But these costs will probably be negligible, except (perhaps) in very tight loops in generic code.
Generic code loops performing frequent acmp operations on untyped (Object) operands will need to perform extra null checks and/or value/reference detections if they are not already present (by luck) in the loop context. In this case, loop unswitching may be useful.
There is one more contextual operation which may be helpful in such cases, if all else fails (and loop unswitching is not desirable). Honestly, I haven't seen any real-world code yet that needs it, but it is comforting to have as a fall-back. The idea is this: If the acmp is contextually followed by a call to Object.equals, in the usual Legacy Idiom For Equality (L.I.F.E.), then there is one more trick we can play. We'd like to force the new_acmp down to an old_acmp in these cases. Can we? Here's the choice:
bool dutiful_life(ptr x, ptr y) { if (new_acmp(x, y)) return true; return x->Object_equals(y); } bool clever_life(ptr x, ptr y) { if (old_acmp(x, y)) return true; return x->Object_equals(y); }
Note that the only semantic difference between old_acmp and new_acmp arises when both operands are the same value. (Work out the cases: In every other case, the simple pointer comparison detects the difference and produces the correct "false" answer.) Thus, in the L.I.F.E., the only case where old_acmp gives the wrong answer produces an immediate answer of "true", instead of following up with a call to x.equals(x) on the Q-value.
To put it in a nutshell, the only difference between dutiful_life and clever_life is that the former always calls Object::equals if either operand is a value, and the the latter calls Object::equals in a subset of those cases (when the physical pointers differ). When the physical pointers do not differ, there is no call to Object::equals. But clever_life always delivers the same answer as dutiful_life.
So is there a problem? Not much, although it requires us to treat Object::equals as an intrinsic with some predictable semantics. The crucial inside that allows us to adopt clever_life is that, if you call x.equals(x) on a non-null x, the result must always be true. Now, this is not enforced by the JVM, but it is clearly documented in the API specification for Object::equals, and lots of stuff would be already broken if that were ever false.
Thus, if we are willing to rule out implementations of Object::equals for which a value can ever compare not equal to itself, then we can substitute clever_life for dutiful_life. We also have to allow one more thing: That we can silently drop calls to equals (at least in the context of L.I.F.E.) when the JVM can prove that the contract of equals allows the JVM to predict the outcome. This means that if you put side effects (like print statements) into an equals method on a value type, you might not see the side effects, after the JVM has optimized things. This corner case may require further thought in order to straighten it out.
Note that L.I.F.E. is frequently found in generic code, where instead of operating on Object values it operates on type parameter types. (The JVM just sees Object, after erasure.) In that case, if the JVM were to specialize the generic code separately for some types (such as value types), along the lines of the template class proposals, then the L.I.F.E. code would fold up in the JIT using the rules given at the top of this note. The generic expression "a == b", where a or b was a value of a specialized type parameter T, would fold up to false in every specialization where T was a Q-type. Likewise, as noted above, if L.I.F.E. is applied to a true R-type (or specialized to such a type), then any new_acmp on that type can be short-circuited to the faster old_acmp.
All in all, it appears that the cost of the new acmp semantics can, for all practical purposes, be pushed down into the noise.
— John
- Previous message (by thread): hg: valhalla/valhalla: 4 new changesets
- Next message (by thread): optimizing acmp in L-World
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]