[Tensor Parallelism] Megatron-LM to transformers 路 Issue #10321 路 huggingface/transformers (original) (raw)

馃殌 Feature request

Splitting the discussion that started here: #10301 (comment) to add the potential future feature of transformers and it's Tensor Parallelism (Horizontal Model Parallelism) - for bigger context please see Parallelism notes.

Let's start with important clarification: MP can mean many different things

  1. Vertical MP - slice the layers vertically - one or more full layers placed on each gpu = Vertical MP - in which case VertMP is a simple version of PP with chunks=1
  2. Horizontal MP - slice the layers horizontally - place a slice of a full model on each gpu - Example Megatron-LM

At the moment I think it's only Megatron-LM that implements Horizontal MP. @anthon-l has ported that model to transformers, except the Horizontal MP parts, since currently transformers doesn't yet have support for it. There is already naive Vertical MP in t5 and gpt2 thanks to @alexorona's work, I ported Bart too but it's unmerged, and there is an ongoing effort to figure out how to implement the Pipeline. All these will have to co-operate with each other and also share common tools.

@anton-l started sharing what needs to be done to make that important feature available - and then down the road potentially make it available to other (all?) transformers models.

@anton-l, the floor is yours.