Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics by kunalspathak · Pull Request #37139 · dotnet/runtime (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation30 Commits12 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

kunalspathak

@kunalspathak

danmoseley

@TamarChristinaArm

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

Also the Create variants have some unneeded moves

        0EB01E10          mov     v16.8b, v16.8b
        6E180630          mov     v16.d[1], v17.d[0]
        4EB01E00          mov     v0.16b, v16.16b

First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores

        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]

Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

tannergooding

tannergooding

tannergooding

tannergooding

@@ -937,13 +937,23 @@ static Vector128 SoftwareFallback(ulong e0, ulong e1)
/// A new initialized from and .
public static unsafe Vector128 Create(Vector64 lower, Vector64 upper)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).

@tannergooding

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using return Unsafe.As<Vector64<T>, Vector64<U>(ref vector). The intrinsic case is that these are fully elided in importation and as such they don't even create nodes and can't generate additional code.

echesakov

echesakov

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good

echesakov

@kunalspathak

Sorry @TamarChristinaArm , I forgot to reply.

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

Correct. We already optimize Vector128.AsXX() while this PR optimizes Vector64.AsXX().

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.

Also the Create variants have some unneeded moves

       0EB01E10          mov     v16.8b, v16.8b
       6E180630          mov     v16.d[1], v17.d[0]
       4EB01E00          mov     v0.16b, v16.16b

First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

This has changed in this PR. See my updated comments in PR description.

        4E083E20          umov    x0, v17.d[0]
        4E181C10          ins     v16.d[1], x0
        4EB01E00          mov     v0.16b, v16.16b

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores

       FD0017A0          str     d0, [fp,#40]
       FD000FA1          str     d1, [fp,#24]

Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

@CarolEidt

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why.

@CarolEidt

I think the problem is that Compiler::getSIMDTypeAlignment which is called byCompiler::lvaAllocLocalAndSetVirtualOffset always returns 16 for TARGET_ARM64.

@TamarChristinaArm

Thanks @kunalspathak !

This has changed in this PR. See my updated comments in PR description.

       4E083E20          umov    x0, v17.d[0]
       4E181C10          ins     v16.d[1], x0
       4EB01E00          mov     v0.16b, v16.16b

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

@kunalspathak

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

Yes, Initially I was doing that optimization of generating mov dstReg[index1], srcReg[index2] in JIT, but we decided to hold that for now and instead do it once we implement InsertSelectedScalar (hopefully sometime soon).

@kunalspathak

@ghost ghost locked as resolved and limited conversation to collaborators

Dec 9, 2020

Labels

area-CodeGen-coreclr

CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI