(original) (raw)

Hi Fangrui,

Not sure why you started a new conversation when you could have just replied to the existing thread.

On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song <maskray@google.com> wrote:

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

\* There is no general disassembly work. General
disassembly work would assuredly frighten off developers. (Inherently
unreliable, memory usage heavy and difficult to deal with CFI, debug
information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R\_RISCV\_ALIGN and R\_RISCV\_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is that
Propeller intends to \*\*reorder basic block sections across translation units\*\*.

This was the intention all along with basic block sections from the very beginning.

This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (\*\*Can we fix the bottleneck of full LTO\*\* \[1\]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

\## Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel's Jump Condition Code Erratum.
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh\_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the linker.

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,
CFI and debug information generating, and probably other stuff.
(How will debug information be bloated?)

\## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.

Regular ThinLTO with minimized bitcode files:

all: compile thin\_link thinlto\_backend final\_link

compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c

thin\_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o
$(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto' -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o b.indexing.o

thinlto\_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o

final\_link exe: lto/a.o lto/b.o a.rsp
\# Propeller does basic block section reordering here.
$(clang) -fuse-ld=lld @a.rsp -o exe

We need to replace the two stages thinlto\_backend and final\_link with
three.

I am not sure I fully follow what you mean here but it seems to be along the lines of going back to MIR to do the optimizations. We are considering this and we have even discussed this with Eli in the original thread:

http://lists.llvm.org/pipermail/llvm-dev/2019-September/135455.html

For example, we are looking at inserting prefetch instructions at specific points in the binary. We would not be disassembling native code to do that but would be doing it in MIR.

Propelled ThinLTO with minimized bitcode files:

propelled\_thinlto\_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
\# Propeller emits something similar to a Machine IR file.
\# a.o and b.o are all IR files.
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir

propeller\_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
\# Propeller collects input Machine IR files,
\# spawn threads to generate object files parallelly.
$(clang) -fpropeller-backend -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir

final\_link exe: propeller/a.o propeller/b.o
\# GNU ld/gold/lld links object files.
(clang)(clang) (clang)^ -o exe

A .mir may be much large than an object file. So lto/a.mir may be
actually an object file annotated with some information, or some lower
level representation than a Machine IR (there should be a guarantee that
the produced object file will keep the basic block structure unchanged
\=> otherwise basic block profiling information will not be too useful).

\[1\]: \*\*Can we fix the bottleneck of full LTO\*\* \[1\]?

I wonder whether we have reached a "local maximum" of ThinLTO.
If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?
Apparently, if full LTO did not have the scalability problem, we would
not do so much work in the linker?

Full LTO has very high overheads for medium to large binaries. As a data point, I ran a Full LTO optimization of a binary with 350M of text and I had to kill the process after RSS went to 175G. I couldn't get it to run on my beefy machine with 192G of RAM.

Hope this helps address some of your concerns.

Thanks

Sri