Make "async ValueTask/ValueTask" methods ammortized allocation-free by stephentoub · Pull Request #26310 · dotnet/coreclr (original) (raw)

~5.5% throughput regression when the cache is off

That case should be unaffected by parallelism. If you see opportunities in this PR to address that, I'd love to hear about it.

I also put your example into Benchmark.NET form:

using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Diagnosers; using BenchmarkDotNet.Running; using System; using System.Threading.Tasks;

[MemoryDiagnoser] public class Program { static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

[Benchmark]
public void Parallel()
{
    var tasks = new Task[Environment.ProcessorCount * 4];
    for (int i = 0; i < tasks.Length; i++) tasks[i] = Work();
    Task.WaitAll(tasks);
}

static async ValueTask<int> Step(int i)
{
    if (i >= 999) await Task.Delay(1);
    else await Task.Yield();
    return i + 1;
}

static async Task Work()
{
    var r = new Random();
    for (int i = 0; i < 100000; i++)
    {
        await Step(r.Next(1000));
    }
}

}

and I get:

Method Toolchain Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 1.685 s 0.0161 s 0.0150 s 1.00 99000.0000 2000.0000 - 586.7 MB
Parallel \old\corerun.exe 1.692 s 0.0218 s 0.0204 s 1.00 99000.0000 2000.0000 - 586.7 MB

Maybe there's a machine-specific aspect to it? (There's also both randomness and "saturate the machine" parallelism here, so there's going to be variability.)

Note that the goal is for there to be 0 perf impact on an example like yours when the cache is off: it's meant to use exactly the same infrastructure as is used today with as little ceremony as we can muster to enable switching on the cache optionally, though there are a few tweaks to try to improve ValueTask perf even in the case where the cache is off.

Also note that a large portion of the time spent in your example is in the Task.Delay(1), which on Windows is going to incur an ~15ms delay. If I delete that part, I get:

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 541.3 ms 8.09 ms 7.57 ms 0.96 0.02 92000.0000 2000.0000 - 549.34 MB
Parallel \old\corerun.exe 562.8 ms 4.99 ms 4.67 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB

~2% throughput regression and more volatile performance when the cache is on

My machine isn't as quiet as I'd like for a parallelism test, but I just ran it three times (without the Task.Delay, which dominates) and got:

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 538.8 ms 10.68 ms 14.26 ms 0.97 0.03 46000.0000 1000.0000 - 275.13 MB
Parallel \old\corerun.exe 556.5 ms 8.63 ms 8.08 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB
Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 547.6 ms 10.70 ms 16.97 ms 0.95 0.02 47000.0000 1000.0000 - 280.52 MB
Parallel \old\corerun.exe 571.8 ms 2.94 ms 2.75 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB
Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 547.6 ms 10.83 ms 22.13 ms 0.98 0.03 49000.0000 1000.0000 - 296.21 MB
Parallel \old\corerun.exe 556.9 ms 5.75 ms 5.38 ms 1.00 0.00 92000.0000 2000.0000 - 549.34 MB

Putting the Task.Delay back in I again saw basically no difference (in throughput... obviously the allocation is much different):

Method Toolchain Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Parallel \new\corerun.exe 1.922 s 0.0383 s 0.0790 s 1.00 0.05 43000.0000 2000.0000 - 257.68 MB
Parallel \old\corerun.exe 1.924 s 0.0379 s 0.0729 s 1.00 0.00 99000.0000 3000.0000 - 586.73 MB

new / off: 2,659,316 bytes (2% regression)

There's a layer of helpers this is going through in order to enable the opt-in switch. If we were to switch to an always-on model, I'd expect this to go away.

new / on: 2,810,236 bytes (8% regression)

Every async ValueTask or async ValueTask<T> method is now referencing two different sets of builders, and every async ValueTask{<T>}'s builder now has the cache stuff as well. If we were to switch to an always-on model, I'd hope a good chunk of this would go away.

There's another option as well. Right now the switch flips between doing exactly what we do today (async completions are backed by Task instances) and using the cache (async completions are backed by pooled IValueTaskSource instances). I could also make it so that the switch only controls whether the cache is used, i.e. always use the IValueTaskSource instances, but if the switch is off, behave as if no instances could ever be retrieved from the cache (and don't try to return anything to it). The upside of that is async ValueTask{<T>} methods wouldn't pull in as much as they do from Task<T>, since they don't need to utilize it. The downside is the visible there-but-not-supported behaviors that Task ends up surfacing through ValueTask (e.g. .GetAwaiter().GetResult() working on incomplete ValueTasks even though we say that's not allowed) wouldn't be configurable with the switch.

Having said this, I am fine with experimenting with this to see whether somebody can demonstrate measurable throughput improvement thanks to the caching.

I'm having trouble rationalizing that with your stats and comments about GC tuning, as it sounds like your opinion is that a) this shouldn't be necessary, and b) if it is it's because of a GC perf bug that should be fixed... and then see (a). 😉

Is your preference I just close this PR?