Train With Mixed Precision (original) (raw)

9.1. General FAQs

Q: What additional resources are available for how to use mixed precision?

A: Here are some additional resources that can help with understanding mixed precision:

Mixed Precision Training (ICLR 2018).
Mixed-Precision Training of Deep Neural Networks (NVIDIA Developer Blog).

Q: What is Automatic Mixed Precision (AMP) and how can it help with training my model?

A: Automatic Mixed Precision (AMP) makes all the required adjustments to train models using mixed precision, providing two benefits over manual operations:

Developers need not modify network model code, reducing development and maintenance effort.
Using AMP maintains forward and backward compatibility with all the APIs for defining and running models.

The benefits of mixed precision training are:

Speed up of math-intensive operations, such as linear and convolution layers, by using Tensor Cores.
Speed up memory-limited operations by accessing half the bytes compared to single-precision.
Reduction of memory requirements for training models, enabling larger models or larger minibatches.

For more information, refer to Automatic Mixed Precision for Deep Learning.

Q: How does AMP automate mixed precision?

A: Using mixed precision training requires two steps:

Porting the model to use the FP16 data type where appropriate.
Using loss scaling to preserve small gradient values.

AMP automates both these steps. In particular in TF-AMP, this is controlled by means of a single environment variable.

Q: How does dynamic scaling work?

A: Dynamic loss scaling basically attempts to ride the edge of the highest loss scale it can use without causing gradient overflow, to make full use of the FP16 dynamic range.

What if, at some later point, training has stabilized and a higher loss scale is permissible? For example, later in training, gradient magnitudes tend to be smaller, and may require a higher loss scale to prevent underflow. Therefore, dynamic loss scaling also attempts to increase the loss scale by a factor of F every N iterations (N=2000 by default). If increasing the loss scale causes an overflow once more, the step is skipped and the loss scale is reduced back to the pre-increase value as usual. In this way, by: reducing the loss scale whenever a gradient overflow is encountered, and Intermittently attempting to increase the loss scale, the goal of riding the edge of the highest loss scale that can be used without causing overflow is (roughly) accomplished.

Q: How do you increase the batch size when AMP is enabled? Do you just increase the batch size by 2?

A: It depends on how much memory you saved, which depends on the model. A quick way is to watch -n 0.5 nvidia-smi from a separate terminal while you launch your run, to see how much device memory you're using. In general, using a larger batch per GPU tends to improve utilization, as long as you obey the guidelines to allow Tensor Core usage (refer to Issue #221 for more information).

Q: How is AllowList/DenyList/InferList determined? What are the corresponding ops that are in each list?

A: We determine these based on our experience with numeric stability from our research. AllowList operations are operations that take advantage of our GPU Tensor Cores. DenyList operations are operations that may overflow the range of FP16, or require the higher precision of FP32. InferList operations are operations that are safely done in either FP32 or FP16. Typical ops included in each list are:

AllowList: Convolutions, Fully-connected layers
DenyList: Large reductions, Cross entropy loss, L1 Loss, Exponential
InferList: Element-wise operations (add, multiply by a constant)

To view/review, modify, and recompile to experiment, or to use environment variables in our container to modify AllowList/DenyList, see:

For TensorFlow, to modify, use this.
For PyTorch, to modify, use this. Reinstall APEX, however, you don’t need to recompile PyTorch at this time.

Q: What are the minimum hardware and software requirements to use AMP?

A: In order to run AMP effectively, you need Tensor Cores in your GPU; for training, we recommend V100; and for inference, we recommend T4. You can access this hardware through cloud service providers (AWS, Azure or Google Cloud).

When using a framework, TensorFlow 1.14 supports AMP natively or support for AMP is available using NVIDIA’s containers 19.07+. In PyTorch, 1.0 AMP is available through APEX.

Q: How do I enable AMP for my deep learning training?

A: Enabling AMP is framework dependent:

In TensorFlow, AMP is controlled by wrapping the optimizer as follows:
```
      `
```

tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
`

In PyTorch, AMP is available through the APEX extension:
```
      `
```

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
`

In MXNET, AMP is available through the contrib library:
```
      `
```

amp.init()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
autograd.backward(scaled_loss)
`

Q: What are the models that are suitable for AMP? And what kind of speed-up can I expect?

A: All models are suitable for AMP, although the speed-up may vary from model to model. The following table provides some examples of the speed-up for different models:

Table 3. FP32 Speedup and Mixed Precision Models

Model Script	Framework	Data Set	FP32 Accuracy	Mixed Precision Accuracy	FP32 Throughput	Mixed Precision Throughput	Speed-up
BERT Q&A	TensorFlow	SQuAD	90.83 Top 1%	90.99 Top 1%	66.65 sentences/sec	129.16 sentences/sec	1.94
SSD w/RN50	TensorFlow	COCO 2017	0.268 mAP	0.269 mAP	569 images/sec	752 images/sec	1.32
GNMT	PyTorch	WMT16 English to German	24.16 BLEU	24.22 BLEU	314,831 tokens/sec	738,521 tokens/sec	2.35
Neural Collaborative Filter	PyTorch	MovieLens 20M	0.959 HR	0.960 HR	55,004,590 samples/sec	99,332,230 samples/sec	1.81
U-Net Industrial	TensorFlow	DAGM 2007	0.965-0.988	0.960-0.988	445 images/sec	491 images/sec	1.10
ResNet-50 v1.5	MXNet	ImageNet	76.67 Top 1%	76.49 Top 1%	2,957 images/sec	10,263 images/sec	3.47
Tacotron 2 / WaveGlow 1.0	PyTorch	LJ Speech Dataset	0.3629/-6.1087	0.3645/-6.0258	10,843 tok/s 257,687 smp/s	12,742 tok/s 500,375 smp/s	1.18/1.94

Values are measured with the model running on DGX-1V 8GPU 16G, DGX-1V 8GPU 32G, or DGX-2V 16GPU 32G.

When enabling AMP, there are other aspects to consider such as the reduction in memory and in bandwidth needed to train the mixed precision model.

Q: How much faster will my model run with AMP?

A: There are no precise rules for mixed precision speedups, but here are a few guidelines:

The more time is spent in matrix multiplication (linear layers) or convolutions, the more Tensor Cores can accelerate the model. This means that "bigger" models often see larger speedups.
- In particular, very small linear and convolution layers will see limited benefit from AMP, since there is not enough math to fully exploit Tensor Cores.
Mixed precision models use less memory than FP32, so it is possible to increase the batch size when running with AMP. Therefore, you can often increase the speedup by increasing the batch size after enabling AMP.

Q: How do I see reduced memory consumption?

A: In TensorFlow, set the allow_growth flag so it only allocates what it needs and view in nvidia-smi. For PyTorch, nvidia-smi can show memory utilization. The best way to test, is to try a larger batch size that would have otherwise led to out-of-memory when AMP is not enabled.

Q: What if I have already implemented manual mixed precision, how will AMP further improve my model performance? What benefits should I expect from AMP?

A: If the code is already written in such a way to follow the NVIDIA Mixed Precision Training Guide, then AMP will leave things as they are.

Q: Why do I observe only a little speedup with AMP turned on?

A: First, you need to identify the bottleneck in your workflow, is it data I/O or compute bound? To find out what is limiting the performance of your workflow use DLProf to profile it.

If the slowest part of the workflow is in the GPU, check if the layers of your model are actually making use of mixed precision. This can be done in a TensorBoard extension after profiling your network with DLProf, or manually by profiling with Nsight Systems or nvprof and looking for kernel names including the strings [i|s|h]884 or [i|s|h]1688 (for example, volta_h884gemm_… or turing_fp16_s1688cudnn_fp16_…). Some layers of a network are DenyListed, meaning that they cannot use mixed precision for accuracy reasons. The DenyList is framework dependent. Refer to the following resources for more information:

Furthermore, Tensor Cores are optimizing GEMMs (generalized (dense) matrix-matrix multiplies) operations, there are restrictions on the dimensions of the matrices in order to effectively optimize such operations:

For A x B where A has size (M, K) and B has size (K, N):
- N, M, K should be multiples of 8
GEMMs in fully connected layers:
- Batch size, input features, output features should be multiples of 8
GEMMs in RNNs:
- Batch size, hidden size, embedding size, and dictionary size should be multiples of 8

Q: Is accuracy worse when AMP is turned on?

A: AMP is designed to leave accuracy unchanged with respect to FP32 training. And, in practice, we never observed noticeable degradation of accuracy when training with AMP.

Q: What if the model code crashes after I have enabled AMP?

A: First, make sure that your model doesn’t crash without using AMP. Then, if you have experienced such issues after enabling AMP, file a bug.

Q: How do I know that AMP is working for me or Tensor Cores are being enabled?

A: The log outputs whether AMP is working, and is framework specific. In TensorFlow, for instance, you will see log messages similar to the following:

TF AMP log messages are of the form ‘Converted 405/4897 nodes to float16 precision using 2 cast(s) to float16 (excluding Const and Variable casts)

9.2. TensorFlow FAQs

Q: Is Automatic Mixed Precision (AMP) dependent on a TensorFlow version or can any TensorFlow version enable AMP?

A: AMP is available in the NGC TensorFlow containers starting from 19.03 and is enabled using the TF_ENABLE_AUTO_MIXED_PRECISION=1 environment variable. It is now enabled by wrapping the optimizer object as follows:

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

More information is available in the following webinar. Starting with TensorFlow 1.14, AMP will be available natively in the framework.

Q: What is the scheme for TensorFlow to decide which operations to cast to FP16 (which level of the graph or where to decide)? Does TensorFlow also keep a DenyList and an AllowList like PyTorch?

A: Our GTC Silicon Valley session S91029, Automated Mixed-Precision Tools for TensorFlow Training discusses how this works. TensorFlow also uses the DenyList and AllowList concepts, but with some subtle differences because TensorFlow has the advantage of a static graph to analyze and convert.

Q: What is TF-AMP and what is the goal?

A:The top-level goal is that our customers who use TensorFlow to train on V100 have a great mixed precision training experience utilizing all the acceleration possible offered by the hardware. That means accuracy that matches FP32 and real speedups without much manual effort. In practice, achieving that goal requires a few things to happen:

Correctly porting the model to mixed precision. Meaning, updating dtypes in code to FP16 as well as making sure that numerically “unsafe” operations stay in FP32.
Using loss scaling to avoid gradient flush-to-zero (important for accuracy).
The existence of fast FP16 kernels for the relevant operations, along with the software stack from the user down to the kernels ensuring that those kernels get called correctly.

Q: How is TF-AMP implemented?

A:TF-AMP optimizes the model graph mainly by:

Inserting the appropriate cast operations into your TensorFlow graph to use FP16 execution and storage where appropriate; this enables both the use of Tensor Cores along with memory storage and bandwidth savings.
Turn on automatic loss scaling inside the Optimizer object.

It is possible to separately enable the automatic insertion of cast operations and automatic loss scaling. For more details, refer to this NVIDIA TensorFlow User Guide.

It must be emphasized that this is only one part of making mixed precision successful, the most important part is to ensure that these changes do not reduce accuracy.

Q: Is AMP dependent on a TensorFlow version or can any TensorFlow version enable AMP?

A: AMP is available in the NGC TensorFlow containers:

The environment variable method for enabling TF-AMP is available starting in 19.03.
The optimizer-wrapper method for enabling TF-AMP is available starting in the 19.06 container.

Furthermore AMP is available with the official distribution of TensorFlow starting with version 1.14. More information is available in the following webinar.

Q: How does AMP know which layer of the model to optimize?

A: AMP maintains lists of the layers that can be optimized:

AllowList: cast everything to FP16
DenyList: cast everything to FP32
Everything else: cast everything to match the widest input type (can’t allow type mismatch)

The TensorFlow list is located here. TensorFlow has the advantage of a static graph to analyze and convert with respect to other frameworks.

Our GTC Silicon Valley session S91029, Automated Mixed-Precision Tools for TensorFlow Training discusses how this works in more detail.

Q: How can I see what changes automatic mixed precision makes to my model?

A: Because automatic mixed precision operates at the level of TensorFlow graphs, it can be challenging to quickly grasp the changes it makes: often it will tweak thousands of TensorFlow operations, but those correspond to many fewer logical layers. You can set the environment variable TF_CPP_VMODULE="auto_mixed_precision=2" to see a full log of the decisions automatic mixed precision makes (note that this may generate a lot of output).

Q: Why do I see only FP32 datatypes in my saved model GraphDef?

A: When you save a model graph or inspect the graph with Session.graph for Session.graph_def, TensorFlow returns the unoptimized version of the graph. TF-AMP works as an optimization pass over the original graph, so its changes are not included in the unoptimized graph. You can set the environment variable TF_AMP_LOG_PATH=some_directory, and TF-AMP will save pre- and post-optimization copies of each graph it processes to that directory.

Note:

There will be many hard-to-distinguish graph files since TensorFlow processes initialization (for example) as a disjoint graph.

Q: Why do I see step=0 repeated multiple times when training with TF-AMP?

A: The automatic loss scaling algorithm that TF-AMP enables can choose to “skip” training iterations as it searches for the optimal loss scale. When it does so, it does not increment the global step count. Since most of the skips occur at the beginning of training (usually fewer than ten iterations), this behavior manifests as multiple iterations where the step counter stays at zero.

Q: How are user-defined custom TF operations handled?

A: By default, TF-AMP will leave alone any op types it doesn’t know about, including custom operations. That means the types of op’s inputs and outputs are not changed, and TF-AMP will insert casts as necessary to interoperate with the rest of the (possibly-changed) graph. If you would like to make TF-AMP aware of a custom op type, there are three environment variables you can use:

TF_AMP_ALLOWLIST_ADD

These are ops for which it is worth casting the inputs to FP16 to get FP16 execution. Mostly, they are ops that can take advantage of Tensor Cores.

TF_AMP_INFERLIST_ADD

These are ops for which FP16 execution is available, so they can use FP16 if the inputs happen to already be in FP16 because of an upstream AllowList op.

TF_AMP_DENYLIST_ADD

These are ops for which FP32 is necessary for numerical precision, and the outputs are not safe to cast back to FP16. Example ops include Exp and Log.

Each of these environment variables takes a comma-separated list of string op names. For example, you might set export TF_AMP_ALLOWLIST_ADD=MyOp1,MyOp2. The op name is the string name used in the call to REGISTER_OP, which corresponds to the name attribute on the operation’s OpDef.

Q: Can I change the algorithmic behavior of automatic mixed precision?

A: The primary lever for controlling automatic mixed precision behavior is to manipulate what ops lie on each of the AllowList, InferList, and DenyList. You can add ops to each using the three environment variables above, and there is a corresponding variable TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_{ALLOWLIST,INFERLIST,DENYLIST}_REMOVE to take built-in ops off of each list.

Q: Why doesn’t my model achieve full accuracy when I enable AMP?

A: The most likely explanation is that loss scaling is not being applied during gradient evaluation. This can happen if the optimizer is not wrapped by tf.trian.experimental.enable_mixed_precision_graph_rewrite() or if gradients are computed directly using tf.gradients() rather than with Optimizer.minimize() or Optimizer.compute_gradients().

Q: Do we have examples or documentation showing how to use AMP with tf.gradients() along with static and/or dynamic loss scaling?

A: For static loss scaling, it’s straightforward:

``

loss = some_loss() loss *= loss_scale # Scale by the loss scale scaled_grads = tf.gradients(loss, …) # Compute gradients

Now unscale, handling sparse grads

grads = [] for scaled_grad in scaled_grads: if scaled_grad is not None: if isinstance(scaled_grad, tf.IndexedSlices): grads.append(tf.IndexedSlices( scaled_grad.values * (1. / loss_scale), scaled_grad.indices, scaled_grad.dense_shape)) else: grads.append(scaled_grad * (1. / loss_scale)) else: grads.append(None)

Now use `grads` as you would normally

9.3. PyTorch FAQs

Q: Is Automatic Mixed Precision (AMP) dependent on a PyTorch version or can any PyTorch version enable AMP?

A: AMP with CUDA and CPP extensions requires PyTorch 1.0 or later. The Python-only build might be able to work with PyTorch 0.4, however, 1.0+ is strongly recommended.

Q: How does dynamic scaling choose a good scaling factor?

A: Dynamic loss scaling basically attempts to ride the edge of the highest loss scale it can use without causing gradient overflow, to make full use of the FP16 dynamic range.

It does so by beginning with a high loss scale value (say, 2^24), then in each iteration, checking the gradients for overflows (infs/NaNs). If none of the gradients overflowed, gradients are unscaled (in FP32) and optimizer.step() is applied as usual. If an overflow was detected, optimizer.step is patched to skip the actual weight update (so that the inf/NaN gradients do not pollute the weights) and the loss scale is reduced by some factor F (F=2 by default). This takes care of reducing the loss scale to a range where overflows are not produced. However, it's only half the story. What if, at some later point, training has stabilized and a higher loss scale is permissible? For example, later in training, gradient magnitudes tend to be smaller, and may require a higher loss scale to prevent underflow. Therefore, dynamic loss scaling also attempts to increase the loss scale by a factor of F every N iterations (N=2000 by default). If increasing the loss scale causes an overflow once more, the step is skipped and the loss scale is reduced back to the pre-increase value as usual. In this way, by:

Reducing the loss scale whenever a gradient overflow is encountered, and
Intermittently attempting to increase the loss scale, the goal of riding the edge of the highest loss scale that can be used without causing overflow is (roughly) accomplished.

Q: How do you increase the batch size when AMP is enabled? Do you just increase the batch size by 8?

A: It depends on how much memory you saved, which depends on the model. A quick-and-dirty way is to watch -n 0.5 nvidia-smi from a separate terminal while you launch your run, to see how much device memory you're using. In general, using a larger batch per GPU tends to improve utilization, as long as you obey the guidelines to allow Tensor Core usage (refer to Issue #221 for more information).

Q: Is AMP dependent on a PyTorch version or can any PyTorch version enable AMP?

A: AMP with CUDA and CPP extensions requires PyTorch 1.0 or later. The Python-only build might be able to work with PyTorch 0.4, however, 1.0+ is strongly recommended.

Q: How to use O0, O1, O2, O3? Which is recommended for AMP? What are the differences?

Use O0 for baseline FP32.
Use O1 for AMP.
Note:
In the future, AMP O1 functionality will be moved upstream.
O2 is slightly faster, but could be harder to converge/stabilize, or may not converge to FP32 results. In O2, all the ops are in FP16, so generally not recommended.
Use O3 for everything in FP16, no primary weights. O3 is intended for performance comparison to see the AMP overhead.

Train With Mixed Precision (original) (raw)

Now unscale, handling sparse grads

Now use grads as you would normally

Now use `grads` as you would normally