Optimize GC.AllocateUninitializedArray and use it in StringBuilder by adamsitnik · Pull Request #27364 · dotnet/coreclr (original) (raw)
I wanted to use GC.AllocateUninitializedArray
in StringBuilder
, but it was initially too slow. Calling it for small buffers was causing quite noticeable performance degradation.
I've tuned it up to ensure that it does not slow down the StringBuilder
in "unlucky path" (small arrays) and does improve the perf in "lucky path" (big arrays). It should make this API more profitable to use in other places in the future.
Changes:
- remove one branch by changing the
int
touint
(if (length < 0)
). This changes the behavior of this internal API forlength < 0
- previously the caller would getIndexOutOfRangeException
. Since this is an internal API, I hope it's OK. - increase the threshold from 256 to 2048 bytes - please see the results below.
- enforce inlining of
GC.AllocateUninitializedArray
, move the expensive native call to separate method to not increase the size too much.
Micro benchmarks for the GC API:
[GenericTypeArguments(typeof(byte))] [GenericTypeArguments(typeof(char))] [GenericTypeArguments(typeof(object))] public class Perf_GC { private readonly Func<int, T[]> _allocateUninitializedArrayDelegate = CreateDelegate(typeof(GC), "AllocateUninitializedArray"); private readonly Func<int, T[]> _allocateArrayDelegate = CreateDelegate(typeof(Mimic), "AllocateArray");
[Params(256, 256 * 2, 256 * 3, 256 * 4, 256 * 6, 256 * 8)]
public int Length;
[Benchmark]
public T[] AllocateUninitializedArray() => _allocateUninitializedArrayDelegate(Length);
[Benchmark]
public T[] AllocateArray() => _allocateArrayDelegate(Length); // using delegate for apples to apples comparison
private static Func<N, T[]> CreateDelegate<N>(Type type, string methodName)
{
// this method is not a part of .NET Standard so we need to use reflection
var method = type
.GetMethod(methodName, BindingFlags.NonPublic | BindingFlags.Static)
.MakeGenericMethod(typeof(T));
return method != null ? (Func<N, T[]>)method.CreateDelegate(typeof(Func<N, T[]>)) : null;
}
}
public static class Mimic { internal static T[] AllocateArray(int size) => new T[size]; }
I've simplified the default BDN output to make it easier to compare the results. In the table below the "Before" is the execution time for GC.AllocateUninitializedArray
before my changes, in the "After" are with my changes. The new T[]
contains the time for calling new operator (to have some base comparison)
Type | Length | Before | After | new T[] |
---|---|---|---|---|
Byte | 256 | 78.63 ns | 18.31 ns | 18.17 ns |
Char | 256 | 79.33 ns | 31.95 ns | 31.66 ns |
Object | 256 | 113.34 ns | 113.34 ns | 113.03 ns |
Byte | 512 | 79.37 ns | 31.38 ns | 31.75 ns |
Char | 512 | 87.60 ns | 58.12 ns | 57.71 ns |
Object | 512 | 229.02 ns | 229.30 ns | 227.78 ns |
Byte | 768 | 83.24 ns | 45.51 ns | 45.85 ns |
Char | 768 | 95.92 ns | 85.20 ns | 84.34 ns |
Object | 768 | 353.66 ns | 347.39 ns | 349.48 ns |
Byte | 1024 | 85.99 ns | 58.31 ns | 57.58 ns |
Char | 1024 | 99.46 ns | 100.62 ns | 112.01 ns |
Object | 1024 | 457.07 ns | 455.94 ns | 457.47 ns |
Byte | 1536 | 92.40 ns | 84.84 ns | 84.44 ns |
Char | 1536 | 111.75 ns | 112.97 ns | 168.02 ns |
Object | 1536 | 653.64 ns | 649.47 ns | 643.37 ns |
Byte | 2048 | 100.61 ns | 101.04 ns | 111.81 ns |
Char | 2048 | 126.52 ns | 125.31 ns | 226.94 ns |
Object | 2048 | 830.92 ns | 838.90 ns | 836.48 ns |