Tutorial combining DDP with Pipeline Parallelism to Train Transformer models by pritamdamania87 · Pull Request #1347 · pytorch/tutorials (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation24 Commits4 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

pritamdamania87

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe
on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process
drives GPUs 0 and 1 and another drives GPUs 2 and 3.

@pritamdamania

… models.

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3.

@netlify

rohan-varma

rohan-varma

mrzzd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Pritam! Looks great.
Wonder if we it helps to reduce some of modeling complexity (or the comments) in this file, since the main point is about the pipelining, not other aspects. There is extensive description of say loss function or the input generation, while it could refer to other tutorials for those parts.

pritamdamania87

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we it helps to reduce some of modeling complexity (or the comments) in this file, since the main point is about the pipelining, not other aspects. There is extensive description of say loss function or the input generation, while it could refer to other tutorials for those parts.

Agree that there is a lot of repeated stuff in this tutorial from other tutorials. But I feel it is still useful to have those sections to keep the tutorial as standalone as possible.

@pritamdamania

mrzzd

# Evaluate the model with the test dataset
# -------------------------------------
#
# Apply the best model to check the result with the test dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we find best model across replicas? Say do an all-reduce and print only if the model has best loss.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, they should be equivalent right? With DDP they should start off with the same params and gradients are synced every iteration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, although each of them my come to different val_loss and best_val_loss, but I see at the end it should not matter which one you choose.

Base automatically changed from master to main

February 16, 2021 19:33

Base automatically changed from main to master

February 16, 2021 19:37

rohan-varma

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this tutorial!

# ``PositionalEncoding`` module injects some information about the
# relative or absolute position of the tokens in the sequence. The
# positional encodings have the same dimension as the embeddings so that
# the two can be summed. Here, we use ``sine`` and ``cosine`` functions of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already mentioned the tutorial above right?

# Need to use 'checkpoint=never' since as of PyTorch 1.8, Pipe checkpointing
# doesn't work with DDP.
from torch.distributed.pipeline.sync import Pipe
model = Pipe(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to the tutorial, but as a follow up, might be useful to see the performance win we get by using Pipeline parallel here. I'm assuming this would work if the user just did a reqular nn.Sequential not wrapped with Pipe and manually handled the split across multiple devices, but it would be a lot less performant.

# Evaluate the model with the test dataset
# -------------------------------------
#
# Apply the best model to check the result with the test dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, they should be equivalent right? With DDP they should start off with the same params and gradients are synced every iteration.

mrzzd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

# Evaluate the model with the test dataset
# -------------------------------------
#
# Apply the best model to check the result with the test dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, although each of them my come to different val_loss and best_val_loss, but I see at the end it should not matter which one you choose.

@pritamdamania

@pritamdamania87

@brianjo brianjo changed the base branch from master to 1.8-RC5-TEST

March 4, 2021 17:03

brianjo added a commit that referenced this pull request

Mar 4, 2021

Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Brian Johnson brianjo@fb.com

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3.

Co-authored-by: pritam pritam.damania@fb.com

Hopefully that's the last one

Last one

Co-authored-by: moto 855818+mthrok@users.noreply.github.com Co-authored-by: Guanheng George Zhang 6156351+zhangguanheng66@users.noreply.github.com Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: James Reed jamesreed@fb.com Co-authored-by: Horace He horacehe2007@yahoo.com Co-authored-by: Pritam Damania 9958665+pritamdamania87@users.noreply.github.com Co-authored-by: pritam pritam.damania@fb.com Co-authored-by: Nikita Shulga nshulga@fb.com

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request

Nov 29, 2021

Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Brian Johnson brianjo@fb.com

Co-authored-by: Brian Johnson brianjo@fb.com

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3.

Co-authored-by: pritam pritam.damania@fb.com

Hopefully that's the last one

Last one

Co-authored-by: moto 855818+mthrok@users.noreply.github.com Co-authored-by: Guanheng George Zhang 6156351+zhangguanheng66@users.noreply.github.com Co-authored-by: Guanheng Zhang zhangguanheng@devfair0197.h2.fair Co-authored-by: James Reed jamesreed@fb.com Co-authored-by: Horace He horacehe2007@yahoo.com Co-authored-by: Pritam Damania 9958665+pritamdamania87@users.noreply.github.com Co-authored-by: pritam pritam.damania@fb.com Co-authored-by: Nikita Shulga nshulga@fb.com