Distributed Training (original) (raw)
tensor2tensor
The t2t-trainer
supports both synchronous and asynchronous distributed training.
Note that it is almost always more efficient to train on a single machine with multiple GPUs/TPUs. Async training is less stable than sync training, and sync training is much faster on 1 machine than on multiple. For these reasons, we almost always train on single machines with multiple GPUs/TPUs.
T2T uses TensorFlow Estimators and so distributed training is configured with the TF_CONFIG
environment variable that is read by theRunConfigalong with a set of flags that T2T uses to distribute the computation.
When using multiple machines, it is necessary that all nodes use the same--output_dir
, which means that it should be set to a Google Cloud Storage bucket (gs://...
) or a directory on a shared network filesystem.
Utility to produce TF_CONFIG
and flags
t2t-make-tf-configsgenerates the TF_CONFIG
json strings and the necessary command-line flags for the jobs.
Given a set of master and parameter server addresses, the script outputs, for each job, a line with the TF_CONFIG
environment variable and the command-line flags necessary for distributed training. For each job, you should invoke thet2t-trainer
with the TF_CONFIG
value and flags that are output.
Eval jobs
Eval jobs should set the following flags and do not need the TF_CONFIG
environment variable to be set as the eval jobs run locally and do not communicate to the other jobs (the eval jobs read the model checkpoints that the trainer writes out):
--schedule=continuous_eval_on_train_data
or--schedule=continuous_eval
(for dev data)--worker_job='/job:localhost'
--output_dir=$TRAIN_DIR
Note that evaluation does not work distributed. That is, distributed jobs should always use --schedule=train
.
Examples
Sync training across multiple workers
In this scenario, you wish to do synchronous training across multiple workers. Note that it is easier to simply use 1 worker with multiple GPUs and set--worker_gpu=8
, but there may be cases where you may want to have multiple machines.
You will need 1 ip:port
for the master and then 1 ip:port
for each worker.
For this example we’ll use 2 workers and these addresses:
# Master
10.0.0.1:5555
# Worker 1
10.0.0.2:5555
# Worker 2
10.0.0.3:5555
Next we generate the TF_CONFIG
and command-line-flags for each job.
$ t2t-make-tf-configs --masters='10.0.0.1:5555' --ps='10.0.0.2:5555,10.0.0.3:5555'
Assuming SYNC distributed training with a single master and 2 workers
'{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 0, "type": "master"}}' --master=grpc://10.0.0.1:5555 --ps_replicas=2 --worker_replicas=1 --worker_gpu=0 --worker_id=0 --ps_gpu=1 --sync --schedule=train --worker_job='/job:master'
'{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 0, "type": "ps"}}' --schedule=run_std_server
'{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 1, "type": "ps"}}' --schedule=run_std_server
The output here is 1 line per job. Each line contains the TF_CONFIG
to set for that job as well as the command-line flags to set for that job.
It is a bit confusing that the workers are being passed to the --ps
flag, but this is correct. When running in --sync
mode, the ps
are actually the workers. You can see in the next example below that when --sync=False
, i.e. async mode, that the ps
are in fact being used as parameter servers.
Here’s how we would start each job on their respective machines (the commands below assume that you’re ssh’d into that job’s machine):
Master:
$ export TF_CONFIG='{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 0, "type": "master"}}'
$ t2t-trainer \
--master=grpc://10.0.0.1:5555 \
--ps_replicas=2 \
--worker_replicas=1 \
--worker_gpu=0 \
--worker_id=0 \
--ps_gpu=1 \
--sync \
--schedule=train \
--worker_job='/job:master' \
--model=transformer \
--hparams_set=transformer_base \
--problem=translate_ende_wmt32k
Worker 1:
$ export TF_CONFIG='{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 0, "type": "ps"}}'
$ t2t-trainer --schedule=run_std_server
Worker 2:
$ export TF_CONFIG='{"cluster": {"master": ["10.0.0.1:5555"], "ps": ["10.0.0.2:5555", "10.0.0.3:5555"]}, "environment": "cloud", "task": {"index": 1, "type": "ps"}}'
$ t2t-trainer --schedule=run_std_server
Note that if you have more than 1 GPU on each worker machine, make sure to modify the --ps_gpu
passed to the master.
Async training across multiple workers
In this scenario, you wish to do asynchronous training across multiple workers with 1+ shared parameter servers.
Note that async training is usually less stable than sync training and for that reason we almost always prefer sync training, but there may be cases where you want to do async distributed training.
For this example we’ll use 2 workers and 2 parameter servers:
# Worker 1
10.0.0.1:5555
# Worker 2
10.0.0.2:5555
# PS 1
10.0.0.3:5555
# PS 2
10.0.0.4:5555
Next we generate the TF_CONFIG
and command-line-flags for each job.
$ t2t-make-tf-configs --masters='10.0.0.1:5555,10.0.0.2:5555' --ps='10.0.0.3:5555,10.0.0.4:5555'
Assuming ASYNC distributed training with 2 workers and 2 parameter servers
'{"task": {"index": 0, "type": "chief"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}' --master=grpc://10.0.0.1:5555 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=0 --ps_gpu=0 --schedule=train --worker_job='/job:chief'
'{"task": {"index": 0, "type": "worker"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}' --master=grpc://10.0.0.2:5555 --ps_replicas=2 --worker_replicas=2 --worker_gpu=1 --worker_id=1 --ps_gpu=0 --schedule=train --worker_job='/job:worker'
'{"task": {"index": 0, "type": "ps"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}' --schedule=run_std_server
'{"task": {"index": 1, "type": "ps"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}' --schedule=run_std_server
Here’s how we would start each job on their respective machines (the commands below assume that you’re ssh’d into that job’s machine):
Worker 1:
$ export TF_CONFIG='{"task": {"index": 0, "type": "chief"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}'
$ t2t-trainer \
--master=grpc://10.0.0.1:5555 \
--ps_replicas=2 \
--worker_replicas=2 \
--worker_gpu=1 \
--worker_id=0 \
--ps_gpu=0 \
--schedule=train \
--worker_job='/job:chief' \
--model=transformer \
--hparams_set=transformer_base \
--problem=translate_ende_wmt32k
Worker 2:
$ export TF_CONFIG='{"task": {"index": 0, "type": "worker"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}'
$ t2t-trainer \
--master=grpc://10.0.0.2:5555 \
--ps_replicas=2 \
--worker_replicas=2 \
--worker_gpu=1 \
--worker_id=1 \
--ps_gpu=0 \
--schedule=train \
--worker_job='/job:worker' \
--model=transformer \
--hparams_set=transformer_base \
--problem=translate_ende_wmt32k
PS 1:
$ export TF_CONFIG='{"task": {"index": 0, "type": "ps"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}'
$ t2t-trainer --schedule=run_std_server
PS 2:
$ export TF_CONFIG='{"task": {"index": 1, "type": "ps"}, "cluster": {"chief": ["10.0.0.1:5555"], "ps": ["10.0.0.3:5555", "10.0.0.4:5555"], "worker": ["10.0.0.2:5555"]}, "environment": "cloud"}'
$ t2t-trainer --schedule=run_std_server
Increase --worker_gpu
on each of the workers if you have multiple GPUs. If the parameter servers are also using GPUs, set --ps_gpu
to the number of GPUs on the parameter servers.