Add support for BSWAP intrinsic by GrabYourPitchforks · Pull Request #18398 · dotnet/coreclr (original) (raw)

This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation32 Commits4 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Addresses https://github.com/dotnet/coreclr/issues/14122

The general idea is that we expose APIs (for internal consumption) which are JIT intrinsics, replaced with BSWAP instructions instead of method calls at runtime. I've put them on BitConverter for now but haven't exposed them via the reference API. The System.Memory package already exposes BinaryPrimitives.ReverseEndianness APIs, and those can call through to these implementations due to the special relationship between corefx packages and System.Private.CoreLib.

Here's the x64 codegen of several test_bswap routines that I compiled and tested.

test_bswap(UInt32): 00007ffdac470080 8bc1 mov eax,ecx 00007ffdac470082 0fc8 bswap eax 00007ffd`ac470084 c3 ret

test_bswap(UInt64): 00007ffdab6a0080 488bc1 mov rax,rcx 00007ffdab6a0083 480fc8 bswap rax 00007ffd`ab6a0086 c3 ret

test_bswap(UInt16): 00007ffdab6d0080 0fb7c1 movzx eax,cx 00007ffdab6d0083 66c1c808 ror ax,8 00007ffd`ab6d0087 c3 ret

Open issues:

How would I even begin to test this in a sane fashion? Right now I tested this by using a DynamicMethod to reference these intrinsic APIs, then compiling and observing the codegen in windbg. It works as far as seeing what's going on, but I can't use this for things like perf runs.
An alternative implementation would be to recognize byte swap IL patterns directly, similar to how rotation IL patterns are recognized directly. I didn't pursue this because I thought it would be more error-prone than using special intrinsic methods, but perhaps somebody knows a more foolproof way of doing this.
I didn't attempt to make this work on ARM because I don't have a test box. The ARM equivalents are REV / REV16 / REV64. I may need to suppress the intrinsic entirely on ARM for now in order to avoid JIT breaks.
Calling ReverseEndianness(<some const uint64>) works but isn't optimized properly. I think I need to update morph but I'll need assistance knowing where to look.
Ideally if we detect a GT_BSWAP instruction immediately before or after a memory access, we can replace both operations with movbe on compatible architectures. I don't think this would have much impact on runtime, but it could save a scratch register and help avoid stack spillage in some scenarios.

I managed to get some perf numbers from before and after the change. The baselines here are the existing implementations of BinaryPrimitives.ReverseEndianness. Measurements are total time taken to call the method 1 billion times in a loop. I confirmed via code gen inspection that no intermediate calculations were being discarded.

Before bswap enablement

ushort: 915 ms
uint: 1120 ms
ulong: 2120 ms

After bswap enablement

ushort: 330 ms
uint: 400 ms
ulong: 590 ms

//
void CodeGen::genCodeForBswap(GenTree* tree)
{
// TODO: If we're swapping immediately after a read from memory or immediately before

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably that's why you didn't just use genCodeForNegNot?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. because the handling of a memory operand (containment) is different in this case because MOVBE has both a source memory location and a register destination (unlike neg and not which take a single operand that's modified).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the case where the operand is in memory, you would add support in Lowering to allow the operand to be contained (or regOptional) if it is the right kind of node (local or GT_IND with an address of a form that movbe can handle), which would cause it to be used directly from memory. Although this is a unary operation, the containment analysis is different from GT_NEG and GT_NOT because the memory form has the result in a register (so the source and dest don't have to be the same to do it from memory, which we don't really support for GT_NEG/GT_NOT currently). So it would be more like the check for whether the/a source operand of a binary op can be contained in Lowering::ContainCheckBinary().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn Per your other comment, I lifted knowledge of the INS_bswap instruction directly to this method, so now I think the logic is different enough that it's worth keeping negnot and bswap split. Is there still a preference to merge the two?

@CarolEidt This is useful feedback, thanks! I'm going to open a different issue for movbe support since I think it's complex enough to warrant a different PR. (It requires a cpuid capability check so it affects AoT scenarios, and I haven't yet created a C++ benchmark app demonstrating that it's a worthwhile optimization.)

}

instruction ins = genGetInsForOper(tree->OperGet(), targetType);
inst_RV(ins, targetReg, targetType);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just use INS_bswap here.

}

// The BSWAP instruction is undefined for 16-bit inputs. Instead, use ROR 8 on the
// low word register bits. Additionally, BSWAP is undefined for 64-bit inputs on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably make decomposition generate 2 BSWAPs from a long BSWAP on x86.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how this works internally. If the developer targets x86 but calls swap(ulong), the call site remains a normal method call. The implementation of swap(ulong) is written in terms of swap(uint), which gets correctly flagged as an intrinsic and converted to bswap.

The flip side is that this is the same reason swap(<const ulong>) doesn't currently get optimized properly, unfortunately. :(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I might make more sense to rely on JIT's decomposition on x86 though, if only because it avoids such ifdefs and inconsistencies between architectures. But it's not obvious what the best solution is, we'll see what the team thinks about it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mikedn that it might be cleaner to handle the "longs on 32-bit target" in the JIT. However, I'm interested in what the perf issues are that you're seeing - is it because it interferes with inlining? Or that it generates extra copies? Or something else? The decomposition approach means that the front-end (importer and optimizations) of the JIT will treat these as operations on 64-bit values. Then, the backend "decomposes" these into the appropriate sequence of operations on 32-bit values. In general, this allows the JIT to do a better job of optimizing these in the front-end (though sometimes the decomposition itself can expose code sequences that might have been optimized if they had been expanded earlier; the classic "phase ordering" problem).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf issue I mentioned was due to not enlightning morph's const folding logic to account for GT_BSWAP. This has been resolved with the latest iteration of the PR.

If you both still believe the best design is to have the JIT fully handle 64-bit byte swaps rather than falling back to the managed decomposition implementation, I can investigate that, but I need some assistance being pointed in the right direction. Is CodeGen::genCodeForShiftLong a good source of inspiration to turn to?

{
case TYP_SHORT:
case TYP_USHORT:
retNode = gtNewOperNode(GT_ROR, callType, impPopStack().val, gtNewIconNode(8));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably risky business. With some exceptions, the JIT does not use short/ushort nodes and evaluation is usually done using int/long.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit on the risk? I believe I'm switching on the target method's return type, not the local value node type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're switching on call type but calls are one of those few nodes that may use small integer types. The other are indirs and lclvar nodes (with indirs being the most common). And those end up producing int values by automatically extending from small int to int.

The net effect is that you may end up with bugs for various reasons:

ror codegen uses the "actual" node type and emits a 32 bit instruction
ror codegen use the node type and emits a 16 bit instruction but the consuming node expects a 32 bit value. The upper 16 bits of the 32 bit value may be garbage.
constant folding may mishandle small int nodes (assuming it pays attention to ror)

I think the biggest danger is that this may appear to work in most cases and fail in some rare and odd circumstances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to add this particular overload in the first place, but it turns out that byte swapping 16-bit values is fairly common in networking stacks. So it seems like this scenario might deserve first-class support in some manner. I used the ror reg_low16, 8 idiom here because it's the way MSVC++ implements byte swaps over WORD data types. In my experiments, there was always a movzx reg_32, reg_low16 either immediately before or immediately after the ror reg_low16, 8 instruction, so that should account for the garbage bits in the high WORD.

But admittedly this is outside my area of expertise, so I'll defer to your judgment here. If you have suggestions for a safer way to do this, I'm all ears. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have suggestions for a safer way to do this, I'm all ears. :)

Right now, the only idea that comes to mind is to have a GT_BSWAP16 that is defined as "swaps the lowest 2 bytes of an integer value". Though it's kind of peculiar, I don't think there's any similar case in the JIT today.

cc @CarolEidt Any thoughts one how a 16 bit byte swap could be represented in the JIT's IR? It's somewhat problematic due to JIT's tendency to do everything with 32/64 bit integers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that the right way to deal with the 16-bit case is to use something like GT_BSWAP16. The question is whether it should implicitly clear the upper 16 bits, or whether to have the JIT always generate an explicit cast to cause those bits to be cleared. In the latter case, the cast could theoretically be eliminated if the consumer uses only the low bits.

System.Memory package already exposes BinaryPrimitives.ReverseEndianness APIs, and those can call through to these implementation

We have tried similar hops in the past (e.g. with Span), but they never survived for long. It may be better to recognize BinaryPrimitives.ReverseEndianness as intrinsic directly - perhaps move BinaryPrimitives to CoreLib.

Rebased and responded to some PR comments. This should address the majority of the reliability feedback that reviews have given (thanks all!), but it's still WIP.

Summary of changes:

Condition the bswap intrinsic logic on a check for x86 / x64
Removed GT_BSWAP logic from genGetInsForOper and moved it all into genCodeForBswap
Use GT_BSWAP16 instead of GT_BSWAP for 16-bit byte swap operations
Enlighten morph's const folding to understand GT_BSWAP / GT_BSWAP16

This fixed a handful of problems the original prototype had, including mishandling of 64-bit const values. I tested both consts and non-consts across all three overloads on both x86 and x64, and the behavior was correct in all cases. If a const is provided, regardless of arch, constant folding will take over. If a non-const is provided, it emits a ror reg.16, 8 or bswap reg instruction as appropriate, except for the 64-bit overload on 32-bit archs, which gets decomposed into two bswap reg instructions.

To-do:

Not expose these methods on BitConverter; instead consider moving BinaryPrimitives into corelib (needs API review?)
Add tests (how?)
Add support for movbe (probably not as part of this PR? significantly impacts AoT compilation.)

Not expose these methods on BitConverter; instead consider moving BinaryPrimitives into corelib (needs API review?)

Moving implementations of existing APIs to CoreLib does not need API review. It should be done in separate PR that just moves the implementation, and does not have any other changes.

I believe all open issues are resolved as of the latest iteration of the PR. Here's the current status:

This causes the JIT to special-case BinaryPrimitives.ReverseEndianness instead of introducing a new internal method BitConverter.ByteSwap.
On 32-bit x86, the 16-bit and 32-bit overloads of ReverseEndianness are replaced with intrinsics. The 64-bit overload calls into the 32-bit overloads as helper methods.
On x64, the 16-bit, 32-bit, and 64-bit overloads of ReverseEndianness are treated as intrinsics.
On arm, none of the overloads are treated as intrinsics, and the runtime falls back to the existing software implementation.
I verified that constant folding is handled as expected by the changes in gentree.cpp.
There are tests both here and in corefx verifying the correct behavior of these APIs.

The only thing I wasn't able to get to was providing the 64-bit fallback implementation entirely in the VM. If this is still important, I hope it's acceptable to allow this PR to go through as-is (modulo any new feedback) and to open a new work item for providing the VM fallback.

Removing WIP from title.

GrabYourPitchforks changed the title~~[WIP] Add support for BSWAP intrinsic~~ Add support for BSWAP intrinsic

Oct 9, 2018

I rebased the PR since there were some recent commits into master that addressed flaky CI + unit tests.

JIT recognizes calls to BinaryPrimitives.ReverseEndianness and emits a bswap instruction

Only works for x86 / x64; ARM still uses fallback logic

@mikedn I also found this code in morph.cpp's Compiler::fgMorphSmpOp routine.

    case GT_NOT:
    case GT_NEG:

        /* Any constant cases should have been folded earlier */
        noway_assert(!op1->OperIsConst() || !opts.OptEnabled(CLFLG_CONSTANTFOLD) || optValnumCSE_phase);
        break;

Should this be enlightened with GT_BSWAP[16] support? It seems like it would fit the right pattern - we'd want the assert to fire if constant folding wasn't properly performed.

@dotnet/jit-contrib Any final feedback on this? I have an open question re: Compiler::fgMorphSmpOp but it doesn't seem to affect correctness.

Sorry, I missed your earlier question. I don't think that the fgMorphSmpOp is really useful. It's not like having a constant op node is breaking some kind of rule. It's surprising that a noway_assert is used there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment on a comment.

dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corefx that referenced this pull request

Nov 6, 2018

With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures.

Signed-off-by: dotnet-bot dotnet-bot@microsoft.com

stephentoub pushed a commit to dotnet/corefx that referenced this pull request

Nov 6, 2018

Signed-off-by: dotnet-bot dotnet-bot@microsoft.com

Console.WriteLine($"Expected: 0x{nonConstUInt64Expected:X16}");
}

return Pass;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks I might be wrong, but it looks like this test never fails, even if something goes wrong. The code prints to the output but anyway returns Pass and Fail is never used.

Besides that this is a great PR and a great example for introducing new Intrinsics!

dotnet-maestro-bot pushed a commit to dotnet-maestro-bot/corert that referenced this pull request

Nov 8, 2018

Signed-off-by: dotnet-bot dotnet-bot@microsoft.com

jkotas pushed a commit to dotnet/corert that referenced this pull request

Nov 8, 2018

Signed-off-by: dotnet-bot dotnet-bot@microsoft.com

A-And pushed a commit to A-And/coreclr that referenced this pull request

Nov 20, 2018

This was referenced

Feb 28, 2019

picenka21 pushed a commit to picenka21/runtime that referenced this pull request

Feb 18, 2022

Commit migrated from dotnet/coreclr@f72025c