GitHub - outerbounds/metaflow-deepspeed (original) (raw)

Introduction

Deepspeed is a highly scalable framework from Microsoft for distributed training and model serving. The Metaflow @deepspeed decorator helps you run these workflows inside of Metaflow tasks.

Features

Installation

Install this experimental module:

pip install metaflow-deepspeed

Getting Started

After installing the module, you can import the deepspeed decorator and use it in your Metaflow steps. This exposes the current.deepspeed.run method, which you can map your terminal commands for running Deepspeed.

from metaflow import FlowSpec, step, deepspeed, current, batch, environment

class HelloDeepspeed(FlowSpec):

@step
def start(self):
    self.next(self.train, num_parallel=2)

@environment(vars={"NCCL_SOCKET_IFNAME": "eth0"})
@batch(gpu=8, cpu=64, memory=256000)
@deepspeed
@step
def train(self):
    current.deepspeed.run(
        entrypoint="my_torch_dist_script.py"
    )
    self.next(self.join)

@step
def join(self, inputs):
    self.next(self.end)

@step
def end(self):
    pass
    

if name == "main": HelloDeepspeed()

Examples

Directory MPI program description
CPU Check The easiest way to check your Deepspeed infrastructure on CPUs.
Hello Deepspeed The easiest way to check your Deepspeed infrastructure on GPUs.
BERT Train your BERT model using Deepspeed!
Dolly A multi-node implementation of Databricks' Dolly.

Cloud-specific use cases

Directory MPI program description
Automatically upload a directory on AWS Push a checkpoint of any directory to S3 after the Deepspeed process completes.
Automatically upload a directory on Azure Push a checkpoint of any directory to Azure Blob storage after the Deepspeed process completes.
Use Metaflow S3 client from the Deepspeed process Upload arbitrary bytes to S3 storage from the Deepspeed process.
Use Metaflow Azure Blob client from the Deepspeed process Upload arbitrary bytes to Azure Blob storage from the Deepspeed process.
Use a Metaflow Huggingface checkpoint on S3 Push a checkpoint to S3 at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own.
Use a Metaflow Huggingface checkpoint on Azure Push a checkpoint to Azure Blob storage at the end of each epoch using a customizable Huggingface callback. See the implementation here to build your own.

License

metaflow-deepspeed is distributed under the Apache License.