GitHub - ymcui/LAMB_Optimizer_TF: LAMB Optimizer for Large Batch Training (TensorFlow version) (original) (raw)

This is a simple implementation of LAMB Optimizer, which appeared in the paper "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes".

The older name of the paper was "Reducing BERT Pre-Training Time from 3 Days to 76 Minutes"

Update: official implementation of LAMB optimizer is now available: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.py

Notes

Algorithm

LAMB optimizer is originally designed for large batch learning in neural networks, but could also used in small batch size as indicated by authors.

algorithm.png

Usage

The implementation is based on BERT repository, which uses AdamWeightDecayOptimizer (appears in optimization.py) for pre-training and fine-tuning.

Results on MNIST

Here are the numbers on several three classical neural networks (MLP, CNN, Bi-RNN, Bi-GRU, Bi-LSTM) with different optimizers (Adam, AdamW, LAMB).

I only list results of batch={64, 128, 1024, 16384}. For full results, please see FULL_RESULTS.md.

Batch=64

Optimizer MLP CNN Bi-RNN Bi-GRU Bi-LSTM Note
Adam 97.03 98.93 96.24 98.92 99.04 Just ordinary Adam
AdamW 97.11 99.01 96.50 99.11 99.04 Used in BERT
LAMB 98.27 99.33 97.73 98.83 98.94 New optimizer for large batch

Batch=128

Optimizer MLP CNN Bi-RNN Bi-GRU Bi-LSTM Note
Adam 96.38 98.76 97.73 99.08 99.09 Just ordinary Adam
AdamW 96.57 98.72 98.05 98.96 99.00 Used in BERT
LAMB 97.90 99.20 98.04 98.87 98.76 New optimizer for large batch

Batch=1024

Optimizer MLP CNN Bi-RNN Bi-GRU Bi-LSTM Note
Adam 93.05 97.92 98.10 98.94 98.67 Just ordinary Adam
AdamW 93.67 98.00 98.19 98.86 98.82 Used in BERT
LAMB 97.68 98.82 98.27 98.61 98.47 New optimizer for large batch

Batch=16384

Optimizer MLP CNN Bi-RNN Bi-GRU Bi-LSTM Note
Adam 88.46 95.06 95.98 97.81 97.74 Just ordinary Adam
AdamW 91.46 96.57 96.34 98.45 98.39 Used in BERT
LAMB 93.23 97.89 93.76 87.60 80.36 New optimizer for large batch

Several Conclusions

Note: The conclusions are only made by the results above.

Reproducibility

Check mnist_tensorflow.ipynb for details.

Note: You know the GPU/TPU won't get exactly the same results even we use fixed random seed.

References

Issues

For help or issues, please submit a GitHub issue.