Managed StelemRef and LdelemaRef by VSadov · Pull Request #32722 · dotnet/runtime (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Moving StelemRef and LdelemaRef JIT helpers to managed code.

significantly improves GC suspension latency in code that frequently uses these helpers
(it is not uncommon to use array element accessors in loops)
reduces some duplication of implementations in c++ and/or in platform/OS specific asm flavors.

Fixes: #32683

VSadov changed the title~~[WIP] Managed StelemRef and LdelemaRef~~ Managed StelemRef and LdelemaRef

Mar 2, 2020

VSadov marked this pull request as ready for review

March 2, 2020 03:14

I think this one is ready for review.

The change significantly improves responsiveness to GC suspension in tight loops with these helpers. From 200-500ms. it becomes less than 20ms. and most of the time much less than that.

Perf is comparable but slightly worse. In directed microbenchmarks, the fastest scenarios could be 10-20% slower.

The code is roughly the same as before. The biggest difference is PreStub indirection and it is responsible for the extra cycles. These are very fast helpers, so even small overheads are noticeable.
With some hacks the PreStub can be defeated. After that the perf becomes roughly the same as baseline in fastest scenarios (exact match, null) or actually faster when typecheck is needed.

I am not adding the PreStub hacks here, since they are just hacks and not production ready. We should consider that as a separate change. That would help other managed JIT helpers too.
The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

The idea is to tell MethodDesc that it is a JIT helper and then whenever native code is updated, update the cell in the JIT helper table as well.

The indirection is because of tiered compilation. It may be better to focus on #965 and fix it everywhere.

Even without tiering we have indirection. We ask for multicallable entry point very early, so we always get a stub. That is expected, since we do not want to force JIT-ting at that point.

Later the stub may be patched to point to actual native code (or better native code), but JIT helper table still points to the stub.

Not sure if this is a subset of #965 . In some generalized sense - perhaps, but just fixing the scenario in #965 may not fix this one.

For the perf perspective:

On the following sample, with default behavior (i.e. tiering enabled, corlib crossgenned, WKS GC )

I get about:
186ms. before change,
210ms after.

That is ~ 13% regression, but it is a bit contextual. Some code changes could make it a bit better or worse,

using System; using System.Diagnostics; class Program { static object[] arr = new Exception[10000]; static Exception val = new Exception();

static void Work()
{
    var v = val;
    var a = arr;

    for (int i = 0; i < 100000000; i++)
    {
        a[1] = v;
    }
}

static void Main()
{
    for (; ; )
    {
        var sw = Stopwatch.StartNew();
        Work();
        sw.Stop();

        {
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }
}

}

That is ~ 13% regression

Does it vary between platforms?

I did not check other platforms, but win-x64 seems to have the most optimized code originally, so I assumed if any regression, it must be worst on win-x64.

Is any other platform interesting in particular?

Re: GetValueInternal

Did you mean like the following:

        notExactMatch:
            if (elementType == (void*)RuntimeTypeHandle.GetValueInternal(typeof(object).TypeHandle))
                goto doWrite;

That does not seem to intrincify and is much slower.

00007ffa`d1611c79 48b948525dd1fa7f0000 mov     rcx, 7FFAD15D5248h
00007ffa`d1611c83 e8b8fbb15f           call    coreclr!JIT_GetRuntimeType (00007ffb`31131840)
00007ffa`d1611c88 488bc8               mov     rcx, rax
00007ffa`d1611c8b e8e045d15f           call    coreclr!RuntimeTypeHandle::GetValueInternal (00007ffb`31326270)
00007ffa`d1611c90 483bf8               cmp     rdi, rax

The following seems an interesting alternative:

        notExactMatch:
            if (arr.GetType() == typeof(object[]))
                goto doWrite;

It is slightly more code, but overall has similar performance to the variant with cached static, without the static.
I will check if other scenarios are not affected much. If not, I will keep this variant.

Is any other platform interesting in particular?

Win x86; any arm64 (one of Win or Linux is enough - the numbers should be the same)

On x86 the same sample as posted above shows ~30% regression. (340ms. vs. 260ms. baseline)

Again - most of the regression disappears if I add the back-patching hack and let the helper to be JIT-ed (and the helper table updated) before going to measure Work method.

In fact managed implementation becomes a few % faster, but these measurements are sensitive and can move a few percent for seemingly unrelated changes, so I would consider that noise.

How bad is the back-patching hack? 30% sounds too much to take. I think I would be ok with taking a workaround for it that get cleaned up in future.

An example of the hack - VSadov@edb8ae9

It is a minimal implementation that supports just 1 helper. It is basically at proof of concept stage.

To make it more reasonable, we need:

to support more than one helper, we need to associate the helper ID with a MethodDesc.
There is a shortage of bits on the instance though and this would be a rare case.
Perhaps use a bit in the methoddesc to indicate it is a helper + store methoddesc ref in helper table somewhere (it is not a big table, can just scan it for this).
we have a case where the same MethodDesc represents two helpers.
Not a big issue though. We could pick one and leave the other for now, or just duplicate the helper.
need to make sure this is the place that sees all cases when native code is updated.
I am new to this area, this may not be the [only] choke point.

This feels fragile. It means that the code JITed before the helper got (tiered) JITed will be stuck using the slow version of helper.

What is the code quality of the R2R version of the helper? Would it be an option to just use that?

The code that was jitted before the helper updated will still see the latest/best code, just via a jmp.

I think a bigger concern would be if we update native code twice - to unoptimized version and then to optimized. If we patch the helper on the first update, some code may end up forever calling the slow version. That can be mitigated by patching only when we update to the last/best variant. I assume we can know that.

Updating all the jit-ed code when indirection is no longer needed is a much bigger problem. I assume that is what #965 covers. Once there is mechanism for that, this case could plug into it.

R2R version has other issues - it does not tailcall (we are sensitive to tailcalling the barrier), statics are accessed via a call, etc. It could cost more than an extra jmp.
Right now it is hard to measure. If I just disable tiering, I end up with both R2R codegen and the stub. And that is obviously slower.

On Linux arm64 with the same sample:

baseline
1251ms.
with changes
1347ms. 7% degrade

The presence of PreStub hack does not seem to have any meaningfull effect. (I did check in debugger that the hack works on ARM64 and does eliminate the indirection)
My guess is that the whole thing is more expensive than on x64 and possibly dominated by the cost of write barrier, so extra jump matters much less.

I think we need to do something about the regression. The options that I can think of:

(easier) JIT the helper synchronously. It will add one or two methods are JITed during startup that is not end of the world.
(harder) Fix the R2R tailcall problem

#1 fix implies actually forcing the method to jit - like in PrepareMethod sense, right?

    // Do the instrumentation and publish atomically, so that the
    // instrumentation data always matches the published code.
    CrstHolder gcCoverLock(&m_GCCoverCrst);

crst not initialized . . .

@@ -195,6 +195,25 @@ void ECall::PopulateManagedCastHelpers()
pDest = pMD->GetMultiCallableAddrOfCode();
SetJitHelperFunction(CORINFO_HELP_UNBOX, pDest);

// Array element accessors are more perf sensitive than other managed helpers and indirection
// costs introduced by PreStub could be noticeable (7% to 30% depending on platform).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this should be revisited once #5857 is fixed?

@davidwrighton This will make us JIT two more small methods during startup for now. I think it is ok to take this and work on fixing the JITing separately. Are you ok with this as well?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jkotas, adding 2 tiny methods to jit on startup is not significant. If it solves a real problem, its ok. (Not ideal, but ok)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper calls in the dasm look like they always remain helper calls. A question though is could they inline and for example in LdelemaRef the if (elementType == type) test be hoisted out of the loop by the Jit?

What you are suggesting is introducing these calls early - as method calls and then let the regular inlining/cse deal with them appropriately.

The thought did occur while implementing these. It would be interesting, especially for Ldelema.

Stelem could be tricky since there is a call to write barrier method and jit does not know that it has same semantics as assignment.

Yea I was having a look at the asm since its now emitted for JitDisasm=* 😄

StelemRef is more bulky; however Ldelema would likely be more highly used? (Reads generally higher than writes, else why write)

Additionally if it was used in a for (i =0; i < a.Length; i++) type loop if Ldelema inlines there is a possibility for the array bounds check to also be eliminated (as it is the for condition)

Ldelema is used when an element is used by reference - foo(ref arr[1]); ref var x = ref arr[2]; etc.. Not sure how common element byrefs are overall. Could be less common than writes.
They can be used in tight loops though.

Ordinary reads are just reads though, without helpers.

@ghost ghost locked as resolved and limited conversation to collaborators

Dec 10, 2020

je ThrowNullReferenceException

; we only want the lower 32-bits of edx, it might be dirty
or edx, edx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code truncates the index to 32bit