How does Out of Order execution work with conditional instructions, Ex: CMOVcc in Intel or ADDNE (Add not equal) in ARM (original) (raw)
I know they can only correctly execute after instructions before them in Re-Order Buffer are committed.
No, they only need their own inputs to be ready: those specific previous instructions executed, not retired / committed.
Conditional-move instructions (and ARM predicated execution) treat the flags input as a data dependency, just like add-with-carry, or just like an integer input register. The conditional instruction can't be sent to an execution unit until all 3 of its inputs are ready1. (Or on ARM, flags + however many inputs the predicated instruction normally has.)
Unlike with control dependencies (branches), they don't predict or speculate what the flags will be, so a cmovcc
instead of a jcc
can create a loop-carried dependency chain and end up being worse than a predictable branch. gcc optimization flag -O3 makes code slower than -O2 is an example of that.
Linus Torvalds explains in more detail why cmov often sucks: https://yarchive.net/comp/linux/cmov.html
(ARM predicated execution might be handled slightly differently. It has to logically NOP the instruction, even for a load or store to an invalid address. This might be handled with just fault suppression for conditional loads. I don't know if an instruction with a false predicate still costs any latency in the dependency chain for the destination register.)
Footnote 1: This is why cmovcc
and adc
are 2 uops on Intel before Broadwell: a single uop couldn't have 3 input dependencies. Haswell introduced support for 3-input uops for FMA.
cmov
instructions that read CF and one of the SPAZO flags (i.e. cmova
and cmovbe
which read CF and ZF) are actually still 2 uops on Skylake. See this Q&A for detail: it seems that those two separately-renamed groups of flags are both separate inputs, avoiding flag-merging. See also https://uops.info/ for uop counts.
See also http://agner.org/optimize/, and https://stackoverflow.com/tags/x86/info for more about x86 microarch details, and optimization guides.