Running on Cloud ML Engine (original) (raw)

tensor2tensor

Google Cloud Platform offers a managed training environment for TensorFlow models called Cloud ML Engine and you can easily launch Tensor2Tensor on it, including for hyperparameter tuning.

Launch

It’s the same t2t-trainer you know and love with the addition of the--cloud_mlengine flag, which by default will launch on a 1-GPU machine in the default compute region. See thedocs for gcloud computeto learn how to set the default compute region.

# Note that both the data dir and output dir have to be on GCS
DATA_DIR=gs://my-bucket/data
OUTPUT_DIR=gs://my-bucket/train
t2t-trainer \
  --problem=translate_ende_wmt32k \
  --model=transformer \
  --hparams_set=transformer_base \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR \
  --cloud_mlengine

By passing --worker_gpu=4 or --worker_gpu=8 it will automatically launch on machines with 4 or 8 GPUs.

You can additionally pass the --cloud_mlengine_master_type to select another kind of machine (see the docs formasterTypefor options, includingML Engine machine typesand theirspecs). If you provide this flag yourself, make sure you pass the correct value for --worker_gpu (for non-GPU machines, you should pass--worker_gpu=0).

Note: t2t-trainer only currently supports launching with single machines, possibly with multiple GPUs. Multi-machine setups are not yet supported out of the box with the --cloud_mlengine flag, though multi-machine should in principle work just fine. Contributions/testers welcome.

`--t2t_usr_dir`

Launching on Cloud ML Engine works with --t2t_usr_dir as well as long as the directory is fully self-contained (i.e. the imports only refer to other modules in the directory). If there are additional PyPI dependencies that you need, you can include a requirements.txt file in the directory specified byt2t_usr_dir.

Hyperparameter Tuning

Hyperparameter tuning with t2t-trainer and Cloud ML Engine is also a breeze with --hparams_range and the --autotune_* flags:

t2t-trainer \
  --problem=translate_ende_wmt32k \
  --model=transformer \
  --hparams_set=transformer_base \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-translate_ende_wmt32k/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=3

The --hparams_range specifies the search space and should be registered with@register_ranged_hparams. It defines a RangedHParams object that sets search ranges and scales for various parameters. See transformer_base_rangeintransformer.pyfor an example.

The metric name passed as --autotune_objective should be exactly what you’d see in TensorBoard. To minimize a metric, set --autotune_maximize=False.

You control how many total trials to run with --autotune_max_trials and the number of jobs to launch in parallel with --autotune_parallel_trials.

Happy tuning!