Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud - MATLAB & Simulink (original) (raw)

Training deep networks is computationally intensive and can take many hours of computing time; however, neural networks are inherently parallel algorithms. You can take advantage of this parallelism by running in parallel using high-performance GPUs and computer clusters.

It is recommended to train using a GPU or multiple GPUs. Only use single CPU or multiple CPUs if you do not have a GPU. CPUs are normally much slower than GPUs for both training and inference. Running on a single GPU typically offers much better performance than running on multiple CPU cores.

If you do not have a suitable GPU, you can rent high-performance GPUs and clusters in the cloud. For more information on how to access MATLAB® in the cloud for deep learning, see Deep Learning in the Cloud.

Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.

Tip

For trainnet workflows, GPU support is automatic. By default, the trainnet function uses a GPU if one is available. If you have access to a machine with multiple GPUs, specify theExecutionEnvironment training option as"multi-gpu".

To run custom training workflows, on the GPU, use minibatchqueue to automatically convert data togpuArray objects.

You can use parallel resources to scale up deep learning for a single network. You can also train multiple networks simultaneously. The following sections show the available options for deep learning in parallel in MATLAB:

Note

If you run MATLAB on a single remote machine for example, a cloud machine that you connect to via ssh or remote desktop protocol, then follow the steps for local resources. For more information on connecting to cloud resources, see Deep Learning in the Cloud.

Train Single Network in Parallel

Use Local Resources to Train Single Network in Parallel

The following table shows you the available options for training and inference with single network on your local workstation.

Resource	trainnet Workflows	Custom Training Workflows	Required Products
Single CPU	Automatic if no GPU is available.Training using a single CPU is not recommended.	Training using a single CPU is not recommended.	MATLABDeep Learning Toolbox™
Multiple CPU cores	Training using multiple CPU cores is not recommended if you have access to a GPU.	Training using multiple CPU cores is not recommended if you have access to a GPU.	MATLABDeep Learning ToolboxParallel Computing Toolbox
Single GPU	Automatic. By default, training and inference run on the GPU if one is available.Alternatively, specify the ExecutionEnvironment training option as "gpu".	Use minibatchqueue to automatically convert data to gpuArray objects. For more information, see Run Custom Training Loops on a GPU and in Parallel.For an example, see Train Network Using Custom Training Loop.
Multiple GPUs	Specify the ExecutionEnvironment training option as"multi-gpu".For an example, see Train Network Using Automatic Multi-GPU Support.	Start a local parallel pool with as many workers as available GPUs. For more information, see Deep Learning with MATLAB on Multiple GPUs.Use parpool to execute training or inference with portion of a mini-batch on each worker. Convert each partial mini-batch of data togpuArray objects. For training, aggregate gradients, loss and state parameters after each iteration. For more information, see Run Custom Training Loops on a GPU and in Parallel.For an example, see Train Network in Parallel with Custom Training Loop. Set the executionEnvironment variable to "auto" or"gpu".

Use Remote Cluster Resources to Train Single Network in Parallel

The following table shows you the available options for training and inference with single network on a remote cluster.

Resource	trainnet Workflows	Custom Training Workflows	Required Products
Any	Specify the desired cluster as your default cluster profile. For more information, see Manage Cluster Profiles and Automatic Pool Creation.Specify theExecutionEnvironment training option as "parallel-auto".If the pool has access to GPUs, then only workers with a unique GPU perform training computation and excess workers become idle. If the pool does not have GPUs, then training takes place on all available CPU workers instead.	Specify the desired cluster as your default cluster profile. For more information, see Manage Cluster Profiles and Automatic Pool Creation.Use parpool to execute training or inference with a portion of a mini-batch on each worker. For training, aggregate gradients, loss and state parameters after each iteration.The software, by default, performs calculations using only the CPU.For an example, see Train Network in Parallel with Custom Training Loop. Set the executionEnvironment variable to "cpu".	MATLABDeep Learning ToolboxParallel Computing ToolboxMATLAB Parallel Server
Multiple CPUs	Training using multiple CPU cores is not recommended if you have access to a GPU.Specify the desired cluster as your default cluster profile. For more information, see Manage Cluster Profiles and Automatic Pool Creation.Specify theExecutionEnvironment training option as "parallel-cpu". If the pool has access to GPUs, the GPUs will not be used.	Training using multiple CPU cores is not recommended if you have access to a GPU.
Multiple GPUs	Specify the desired cluster as your default cluster profile. For more information, see Manage Cluster Profiles and Automatic Pool Creation.Specify theExecutionEnvironment training option as "parallel-auto" or"parallel-gpu".If you use the "parallel-auto" option and the pool has access to GPUs, then only workers with a unique GPU perform training computation and excess workers become idle. If the pool does not have GPUs, then training takes place on all available CPU workers instead.If you use the "parallel-gpu" option, then workers with a unique GPU perform training and excess workers become idle. If the pool does not have GPUs, then the software throws an error.	Start a parallel pool in the desired cluster with as many workers as available GPUs. For more information, seeDeep Learning with MATLAB on Multiple GPUs.Use parpool to execute training or inference with a portion of a mini-batch on each worker. Convert each partial mini-batch of data togpuArray objects. For training, aggregate gradients, loss and state parameters after each iteration. For more information, see Run Custom Training Loops on a GPU and in Parallel.For an example, see Train Network in Parallel with Custom Training Loop. Set the executionEnvironment variable to "auto" or"gpu".

Train Multiple Networks in Parallel

Use Local or Remote Cluster Resources to Train Multiple Network in Parallel

To train multiple networks in parallel, train each network on a different parallel worker. You can modify the network or training parameters on each worker to perform parameter sweeps in parallel.

Use parfor (Parallel Computing Toolbox) or parfeval (Parallel Computing Toolbox) to train a single network on each worker. To run in the background without blocking your local MATLAB, use parfeval. You can plot results using theOutputFcn training option.

You can run locally or using a remote cluster. Using a remote cluster requiresMATLAB Parallel Server.

Use Experiment Manager to Train Multiple Networks in Parallel

You can use Experiment Manager to run trials on multiple parallel workers simultaneously. Set up your parallel environment and, on the Experiment Manager toolstrip, set Mode to Simultaneous before running your experiment. Experiment Manager runs as many simultaneous trials as there are workers in your parallel pool. For more information, seeRun Experiments in Parallel.

Batch Deep Learning

You can offload deep learning computations to run in the background using thebatch (Parallel Computing Toolbox) function. This means that you can continue using MATLAB while your computation runs in the background, or you can close your client MATLAB and fetch results later.

You can run batch jobs in a local or remote cluster. To offload your deep learning computations, use batch to submit a script or function that runs in the cluster. You can perform any kind of deep learning computation as a batch job, including parallel computations. For an example, see Send Deep Learning Batch Job to Cluster.

When you submit a batch job as a script, by default, workspace variables are copied from the client to the workers. To avoid copying workspace variables to the workers, submit batch jobs as functions.

To run in parallel, use a script or function that contains the same code that you would use to run in parallel locally or in a cluster. For example, your script or function can run trainnet with theExecutionEnvironment training option set to"parallel-auto", or run a custom training loop in parallel. Use batch to submit the script or function to the cluster and use the Pool option to specify the number of workers you want to use. For more information on running parallel computations withbatch, see Run Batch Parallel Jobs (Parallel Computing Toolbox).

To run deep learning computation on multiple networks, it is recommended to submit a single batch job for each network. Doing so avoids the overhead required to start a parallel pool in the cluster and allows you to use the job monitor to observe the progress of each network computation individually.

You can submit multiple batch jobs. If the submitted jobs require more workers than are currently available in the cluster, then later jobs are queued until earlier jobs have finished. Queued jobs start when enough workers are available to run the job.

Depending on your cluster, there might be several options for providing training data to your cluster workers. If your data set is large, you should consult your cluster administrator for advice.

If your cluster has a shared network file system, then this is the simplest method for sharing data as workers can directly access the data. As the default search paths of the workers might not be the same as that of your client MATLAB, you can ensure that workers in the cluster have access to the required files by specifying paths to add to workers using theAdditionalPaths option of thebatch function. For more information, see Share Code with Workers (Parallel Computing Toolbox).
You can use a cloud storage, which is particularly useful if your are using a cloud cluster for training. Cloud storage can be convenient but accessing training data from remote storage can slow down network training. For more information, see Work with Deep Learning Data in the Cloud.
You can copy data training to each worker. This ensures that cluster workers have local access to the training data, which improves training speed, but creating multiple copies of the data might not be appropriate if your data set is large. For an example showing how to copy the data to each worker, see Send Deep Learning Batch Job to Cluster.

To retrieve results after the job is finished, use the fetchOutputs (Parallel Computing Toolbox) function.fetchOutputs retrieves all variables in the batch worker workspace. When you submit batch jobs as a script, by default, workspace variables are copied from the client to workers. To avoid recursion of workspace variables, submit batch jobs as functions instead of as scripts.

You can use the diary (Parallel Computing Toolbox) to capture command line output while running batch jobs. This can be useful when executing thetrainnet function with the Verbose option set to true.

Manage Cluster Profiles and Automatic Pool Creation

Parallel Computing Toolbox comes pre-configured with the cluster profileProcesses for running parallel code on your local desktop machine. By default, MATLAB starts all parallel pools using the Processes cluster profile. If you want to run code on a remote cluster, you must start a parallel pool using the remote cluster profile. You can manage cluster profiles using the Cluster Profile Manager. For more information about managing cluster profiles, see Discover Clusters and Use Cluster Profiles (Parallel Computing Toolbox).

Some functions, including trainnet,parfor, and parfeval can automatically start a parallel pool. To take advantage of automatic parallel pool creation, set your desired cluster as the default cluster profile in the Cluster Profile Manager. Alternatively, you can create the pool manually and specify the desired cluster resource when you create the pool.

If you want to use multiple GPUs in a remote cluster to train multiple networks in parallel or for custom training loops, best practice is to manually start a parallel pool in the desired cluster with as many workers as available GPUs. For more information, see Deep Learning with MATLAB on Multiple GPUs.

Deep Learning Precision

For best performance, it is recommended to use a GPU for all deep learning workflows. Because single-precision and double-precision performance of GPUs can differ substantially, it is important to know in which precision computations are performed. Typically, GPUs offer much better performance for calculations in single precision.

If you only use a GPU for deep learning, then single-precision performance is one of the most important characteristics of a GPU. If you also use a GPU for other computations using Parallel Computing Toolbox, then high double-precision performance is important. This is because many functions in MATLAB use double-precision arithmetic by default. For more information, seePerform Calculations in Single Precision (Parallel Computing Toolbox)

By default, the software performs computations using single-precision, floating-point arithmetic to train a neural network using the trainnet function. The trainnet function returns a network with single-precision learnables and state parameters.

When you use prediction or validation functions with a dlnetwork object with single-precision learnable and state parameters, the software performs the computations using single-precision, floating-point arithmetic.

When you use prediction or validation functions with a dlnetwork object with double-precision learnable and state parameters:

If the input data is single precision, the software performs the computations using single-precision, floating-point arithmetic.
If the input data is double precision, the software performs the computations using double-precision, floating-point arithmetic.

For custom training workflows, it is recommended to convert data to single precision for training and inference. If you use minibatchqueue to manage mini-batches, your data is converted to single precision by default.

Reproducibility

To provide the best performance, deep learning using a GPU in MATLAB is not guaranteed to be deterministic. Depending on your network architecture, under some conditions you might get different results when using a GPU to train two identical networks or make two predictions using the same network and data. If you require determinism when performing deep learning operations using a GPU, use the deep.gpu.deterministicAlgorithms function (since R2024b).