Autodiff batching by ZuseZ4 · Pull Request #137880 · rust-lang/rust (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

rustbot added A-attributes

Area: Attributes (`#[…]`, `#![…]`)

S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-compiler

Relevant to the compiler team, which will review and decide on the PR/issue.

labels

Mar 2, 2025

ZuseZ4 marked this pull request as ready for review

April 3, 2025 03:15

bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Apr 3, 2025

Zalathar added a commit to Zalathar/rust that referenced this pull request

Apr 4, 2025

Autodiff batching

Enzyme supports batching, which is especially known from the ML side when training neural networks. There we would normally have a training loop, where in each iteration we would pass in some data (e.g. an image), and a target vector. Based on how close we are with our prediction we compute our loss, and then use backpropagation to compute the gradients and update our weights. That's quite inefficient, so what you normally do is passing in a batch of 8/16/.. images and targets, and compute the gradients for those all at once, allowing better optimizations.

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size, and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of

for i in 0..100 {
   df(x[i], y[i], 1234);
}

You can now do

for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}

which will give the same results, but allows better compiler optimizations. See the testcase for details.

There is a second variant, where we can mark certain arguments and instead of having to pass in N shadow arguments, Enzyme assumes that the argument is N times longer. I.e. instead of accepting 4 slices with 12 floats each, we would accept one slice with 48 floats. I'll implement this over the next days.

I will also add more tests for both modes.

For any one preferring some more interactive explanation, here's a video of Tim's llvm dev talk, where he presents his work. https://www.youtube.com/watch?v=edvaLAL5RqU I'll also add some other docs to the dev guide and user docs in another PR.

r? ghost

Tracking:

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 4, 2025

Zalathar added a commit to Zalathar/rust that referenced this pull request

Apr 4, 2025

Autodiff batching

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size, and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of

for i in 0..100 {
   df(x[i], y[i], 1234);
}

You can now do

for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}

which will give the same results, but allows better compiler optimizations. See the testcase for details.

I will also add more tests for both modes.

r? ghost

Tracking:

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 5, 2025

rust-timer added a commit to rust-lang-ci/rust that referenced this pull request

Apr 5, 2025

Rollup merge of rust-lang#137880 - EnzymeAD:autodiff-batching, r=oli-obk

Autodiff batching

Enzyme supports batching in two ways, the first one (which I implemented here) just accepts a Batch size, and then each Dual/Duplicated argument has not one, but N shadow arguments. So instead of

for i in 0..100 {
   df(x[i], y[i], 1234);
}

You can now do

for i in 0..100.step_by(4) {
   df(x[i+0],x[i+1],x[i+2],x[i+3], y[i+0], y[i+1], y[i+2], y[i+3], 1234);
}

which will give the same results, but allows better compiler optimizations. See the testcase for details.

I will also add more tests for both modes.

r? ghost

Tracking: