Initial 128-bit SIMD proposal by stoklund · Pull Request #1 · WebAssembly/simd (original) (raw)

@rossberg-chromium, in my opinion we don't need separate types, but possibly a few extra instructions.

To give some background, Intel's SIMD instruction sets have multiple versions of a few instructions. For example, pxor, xorps, and xorpd are architecturally identical implementations of v128.xor, but current micro-architectures will issue pxor to the integer execution stack and the two others to the floating point stack. There has never been a micro-architectural difference between the float-flavored xorps and the double-flavored xorpd as far as I know.

There is a 1-cycle bypass delay when an instruction in the integer stack depends on a result computed in the floating point stack or vice versa. This additional latency only matters if the dependency is on the critical path. As soon as there's a few cycles between the instructions, operands are read from the register file and not from pipeline bypasses. Then there is no longer any difference.

Other ISAs don't make this distinction, it is an Intel-only thing.

If we want to let WebAssembly producers give hints to the code generator's fiddling with these micro-architectural details, we could do it by providing float-flavored versions of the logical, load/store, and shuffle/swizzle operations. These new operations would be identical to the existing ones except for giving this hint to the code generator. I don't think a new type is required.

I would prefer that we don't do this in the initial proposal, and only add the new instructions if they have demonstrable performance benefits. I am expecting improvements to real-world benchmarks to be in the noise.

Without the hints, a simple SSE code generator would apply a basic heuristic like looking at the instructions that computed the inputs and choose xorps if they are floating-point instructions, pxor otherwise. LLVM's algorithm is a bit more involved, but a little goes a long way.