Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics by kunalspathak · Pull Request #37139 · dotnet/runtime (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation30 Commits12 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Thanks @kunalspathak!
These look great, for the .AsXX
variants you've only posted the Vector64
variants, I assume the Vector128
ones are the same except use a q
register?
As a general question, these .AsXX
are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?
Also the Create
variants have some unneeded moves
0EB01E10 mov v16.8b, v16.8b
6E180630 mov v16.d[1], v17.d[0]
4EB01E00 mov v0.16b, v16.16b
First one is a no-op
and last one can be avoided by having allocated to v0
. But that's a general issue from the looks of the dumps.
Unrelated question to your change, but why does it allocate so much stack space?
A9BD7BFD stp fp, lr, [sp,#-48]!
seems to allocate 48 bytes and only stores 16.
There's also a weird gap in between the stores
FD0017A0 str d0, [fp,#40]
FD000FA1 str d1, [fp,#24]
Could it be thinking that internally all vectors are 16
bytes?
If they were placed next to eachother you could optimize these with stp
and ldp
, to be exact your code in the second BB could be a single ldr q0, ...
if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).
@@ -937,13 +937,23 @@ static Vector128 SoftwareFallback(ulong e0, ulong e1) |
---|
/// A new initialized from and . |
public static unsafe Vector128 Create(Vector64 lower, Vector64 upper) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).
These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?
That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using return Unsafe.As<Vector64<T>, Vector64<U>(ref vector)
. The intrinsic case is that these are fully elided in importation and as such they don't even create nodes and can't generate additional code.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good
Sorry @TamarChristinaArm , I forgot to reply.
Thanks @kunalspathak!
These look great, for the
.AsXX
variants you've only posted theVector64
variants, I assume theVector128
ones are the same except use aq
register?
Correct. We already optimize Vector128.AsXX()
while this PR optimizes Vector64.AsXX()
.
As a general question, these
.AsXX
are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?
I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.
Also the
Create
variants have some unneeded moves0EB01E10 mov v16.8b, v16.8b 6E180630 mov v16.d[1], v17.d[0] 4EB01E00 mov v0.16b, v16.16b
First one is a
no-op
and last one can be avoided by having allocated tov0
. But that's a general issue from the looks of the dumps.
This has changed in this PR. See my updated comments in PR description.
4E083E20 umov x0, v17.d[0]
4E181C10 ins v16.d[1], x0
4EB01E00 mov v0.16b, v16.16b
Unrelated question to your change, but why does it allocate so much stack space?
A9BD7BFD stp fp, lr, [sp,#-48]!
seems to allocate 48 bytes and only stores 16.
There's also a weird gap in between the stores
FD0017A0 str d0, [fp,#40] FD000FA1 str d1, [fp,#24]
Could it be thinking that internally all vectors are
16
bytes?If they were placed next to eachother you could optimize these with
stp
andldp
, to be exact your code in the second BB could be a singleldr q0, ...
if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).
Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?
Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?
It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why.
I think the problem is that Compiler::getSIMDTypeAlignment
which is called byCompiler::lvaAllocLocalAndSetVirtualOffset
always returns 16 for TARGET_ARM64
.
Thanks @kunalspathak !
This has changed in this PR. See my updated comments in PR description.
4E083E20 umov x0, v17.d[0] 4E181C10 ins v16.d[1], x0 4EB01E00 mov v0.16b, v16.16b
hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0]
as before.
hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.
Yes, Initially I was doing that optimization of generating mov dstReg[index1], srcReg[index2]
in JIT, but we decided to hold that for now and instead do it once we implement InsertSelectedScalar
(hopefully sometime soon).
ghost locked as resolved and limited conversation to collaborators
Labels
CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI