x86: Align jump targets to 1-byte boundaries · Rust-for-Linux/linux@be6cb02 (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

Appearance settings

Commit be6cb02

author

Ingo Molnar

committed

x86: Align jump targets to 1-byte boundaries

The following NOP in a hot function caught my attention: > 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) That's a dead NOP that bloats the function a bit, added for the default 16-byte alignment that GCC applies for jump targets. I realize that x86 CPU manufacturers recommend 16-byte jump target alignments (it's in the Intel optimization manual), to help their relatively narrow decoder prefetch alignment and uop cache constraints, but the cost of that is very significant: text data bss dec filename 12566391 1617840 1089536 15273767 vmlinux.align.16-byte 12224951 1617840 1089536 14932327 vmlinux.align.1-byte By using 1-byte jump target alignment (i.e. no alignment at all) we get an almost 3% reduction in kernel size (!) - and a probably similar reduction in I$ footprint. Now, the usual justification for jump target alignment is the following: - modern decoders tend to have 16-byte (effective) decoder prefetch windows. (AMD documents it higher but measurements suggest the effective prefetch window on curretn uarchs is still around 16 bytes) - on Intel there's also the uop-cache with cachelines that have 16-byte granularity and limited associativity. - older x86 uarchs had a penalty for decoder fetches that crossed 16-byte boundaries. These limits are mostly gone from recent uarchs. So if a forward jump target is aligned to cacheline boundary then prefetches will start from a new prefetch-cacheline and there's higher chance for decoding in fewer steps and packing tightly. But I think that argument is flawed for typical optimized kernel code flows: forward jumps often go to 'cold' (uncommon) pieces of code, and aligning cold code to cache lines does not bring a lot of advantages (they are uncommon), while it causes collateral damage: - their alignment 'spreads out' the cache footprint, it shifts followup hot code further out - plus it slows down even 'cold' code that immediately follows 'hot' code (like in the above case), which could have benefited from the partial cacheline that comes off the end of hot code. But even in the cache-hot case the 16 byte alignment brings disadvantages: - it spreads out the cache footprint, possibly making the code fall out of the L1 I$. - On Intel CPUs, recent microarchitectures have plenty of uop cache (typically doubling every 3 years) - while the size of the L1 cache grows much less aggressively. So workloads are rarely uop cache limited. The only situation where alignment might matter are tight loops that could fit into a single 16 byte chunk - but those are pretty rare in the kernel: if they exist they tend to be pointer chasing or generic memory ops, which both tend to be cache miss (or cache allocation) intensive and are not decoder bandwidth limited. So the balance of arguments strongly favors packing kernel instructions tightly versus maximizing for decoder bandwidth: this patch changes the jump target alignment from 16 bytes to 1 byte (tightly packed, unaligned). Acked-by: Denys Vlasenko dvlasenk@redhat.com Cc: Andy Lutomirski luto@amacapital.net Cc: Aswin Chandramouleeswaran aswin@hp.com Cc: Borislav Petkov bp@alien8.de Cc: Brian Gerst brgerst@gmail.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: H. Peter Anvin hpa@zytor.com Cc: Jason Low jason.low2@hp.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Paul E. McKenney paulmck@linux.vnet.ibm.com Cc: Peter Zijlstra a.p.zijlstra@chello.nl Cc: Peter Zijlstra peterz@infradead.org Cc: Thomas Gleixner tglx@linutronix.de Cc: Tim Chen tim.c.chen@linux.intel.com Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.comSigned-off-by: Ingo Molnar mingo@kernel.org

File tree

1 file changed

lines changed

1 file changed

lines changed

Original file line number Diff line number Diff line change
@@ -77,6 +77,9 @@ else
77 77 KBUILD_AFLAGS += -m64
78 78 KBUILD_CFLAGS += -m64
79 79
80 +# Align jump targets to 1 byte, not the default 16 bytes:
81 +KBUILD_CFLAGS += -falign-jumps=1
82 +
80 83 # Don't autogenerate traditional x87 instructions
81 84 KBUILD_CFLAGS += $(call cc-option,-mno-80387)
82 85 KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387)