Rewrite Buffer.BlockCopy in C# by jkotas · Pull Request #27216 · dotnet/coreclr (original) (raw)

I hope that hardware will add memmove instruction one day that will be superior to hand tuned implementations in every dimension.

I'm pretty sure this is meant to be the ERMSB support and it does basically what you ask. Two problems are that it isn't supported everywhere yet and it has some overhead that makes it undesirable for small loops.

Trying to hand-tune memmove in software is a losing battle.

I would agree that hand-tuning to have the best perf is likely a losing battle, but many of the rules around copying blocks of memory are well-defined and documented at this point (namely in the respective architecture manuals). It basically comes down to handling sizes less than 128 bytes and then everything else. The split at 128-bytes is defined because that is how much data a prefetch will grab.

If we exposed intrinsics for the above, we could just have some code like:

if (size < 128) { // small copy } else if (Cpuid.IsGenuineIntel && Ermsb.IsSupported) { Ermsb.MoveBytes(src, dst, count); } else if (size < threshold) { // large copy in 64-byte chunks using non-temporal loads/stores } else { // invoke native memcpy }

This should provide overall decent performance and fairly closely match what is recommended by the architecture manuals and done by other memcpy implementations.