Improve parfor Performance - MATLAB & Simulink (original) (raw)
You can improve the performance of parfor
-loops in various ways. This includes parallel creation of arrays inside the loop; profilingparfor
-loops; slicing arrays; and optimizing your code on local workers before running on a cluster.
Where to Create Arrays
When you create a large array in the client before yourparfor
-loop, and access it within the loop, you might observe slow execution of your code. To improve performance, tell each MATLABĀ® worker to create its own arrays, or portions of them, in parallel. You can save the time of transferring data from client to workers by asking each worker to create its own copy of these arrays, in parallel, inside the loop. Consider changing your usual practice of initializing variables before afor
-loop, avoiding needless repetition inside the loop. You might find that parallel creation of arrays inside the loop improves performance.
Performance improvement depends on different factors, including
- size of the arrays
- time needed to create arrays
- worker access to all or part of the arrays
- number of loop iterations that each worker performs
Consider all factors in this list when you are considering to convertfor
-loops to parfor
-loops. For more details, see Convert for-Loops into parfor-Loops.
As an alternative, consider the parallel.pool.Constant function to establish variables on the pool workers before the loop. These variables remain on the workers after the loop finishes, and remain available for multipleparfor
-loops. You might improve performance usingparallel.pool.Constant
, because the data is transferred only once to the workers.
In this example, you first create a big data set D
and execute a parfor
-loop accessing D
. Then you use D
to build aparallel.pool.Constant
object, which allows you to reuse the data by copying D
to each worker. Measure the elapsed time using tic
and toc
for each case and note the difference.
function constantDemo D = rand(1e7, 1); tic for i = 1:20 a = 0; parfor j = 1:60 a = a + sum(D); end end toc
tic
D = parallel.pool.Constant(D);
for i = 1:20
b = 0;
parfor j = 1:60
b = b + sum(D.Value);
end
end
toc
end
constantDemo Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers. Elapsed time is 63.839702 seconds. Elapsed time is 10.194815 seconds.
In the second case, you send the data only once. You can enhance the performance of the parfor
-loop by using theparallel.pool.Constant
object.
Profiling parfor
-loops
You can profile a parfor
-loop by measuring the time elapsed using tic
and toc
. You can also measure how much data is transferred to and from the workers in the parallel pool by usingticBytes
and tocBytes
. Note that this is different from profiling MATLAB code in the usual sense using the MATLAB profiler, see Profile Your Code to Improve Performance.
This example calculates the spectral radius of a matrix and converts afor
-loop into a parfor
-loop. Measure the resulting speedup and the amount of transferred data.
- In the MATLAB Editor, enter the following
for
-loop. Addtic
andtoc
to measure the time elapsed. Save the file asMyForLoop.m
.
function a = MyForLoop(A)
tic
for i = 1:200
a(i) = max(abs(eig(rand(A))));
end
toc
end - Run the code, and note the elapsed time.
Elapsed time is 31.935373 seconds. - In
MyForLoop.m
, replace thefor
-loop with aparfor
-loop. AddticBytes
andtocBytes
to measure how much data is transferred to and from the workers in the parallel pool. Save the file asMyParforLoop.m
.
ticBytes(gcp);
parfor i = 1:200
a(i) = max(abs(eig(rand(A))));
end
tocBytes(gcp) - Run the new code, and run it again. Note that the first run is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time for the second run.
By default, MATLAB automatically opens a parallel pool of workers on your local machine.
Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.
...
BytesSentToWorkers BytesReceivedFromWorkers
__________________ ________________________
1 15340 7024
2 13328 5712
3 13328 5704
4 13328 5728
Total 55324 24168
Elapsed time is 10.760068 seconds.
The elapsed time is 31.9 seconds in serial and 10.8 seconds in parallel, and shows that this code benefits from converting to aparfor
-loop.
Slicing Arrays
If a variable is initialized before a parfor
-loop, then used inside the parfor
-loop, it has to be passed to each MATLAB worker evaluating the loop iterations. Only those variables used inside the loop are passed from the client workspace. However, if all occurrences of the variable are indexed by the loop variable, each worker receives only the part of the array it needs.
As an example, you first run a parfor
-loop using a sliced variable and measure the elapsed time.
% Sliced version
M = 100; N = 1e6; data = rand(M, N);
tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ N; end toc
Elapsed time is 2.261504 seconds.
Now suppose that you accidentally use a reference to the variabledata
instead of N
inside theparfor
-loop. The problem here is that the call tosize(data, 2)
converts the sliced variable into a broadcast (non-sliced) variable.
% Accidentally non-sliced version
clear
M = 100; N = 1e6; data = rand(M, N);
tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ size(data, 2); end toc
Elapsed time is 8.369071 seconds.
Note that the elapsed time is greater for the accidentally broadcast variable.
In this case, you can easily avoid the non-sliced usage ofdata
, because the result is a constant, and can be computed outside the loop. In general, you can perform computations that depend only on broadcast data before the loop starts, since the broadcast data cannot be modified inside the loop. In this case, the computation is trivial, and results in a scalar result, so you benefit from taking the computation out of the loop.
Optimizing on Local vs. Cluster Workers
Running your code on local workers might offer the convenience of testing your application without requiring the use of cluster resources. However, there are certain drawbacks or limitations with using local workers. Because the transfer of data does not occur over the network, transfer behavior on local workers might not be indicative of how it will typically occur over a network.
With local workers, because all the MATLAB worker sessions are running on the same machine, you might not see any performance improvement from a parfor
-loop regarding execution time. This can depend on many factors, including how many processors and cores your machine has. The key point here is that a cluster might have more cores available than your local machine. If your code can be multithreaded by MATLAB, then the only way to go faster is to use more cores to work on the problem, using a cluster.
You might experiment to see if it is faster to create the arrays before the loop (as shown on the left below), rather than have each worker create its own arrays inside the loop (as shown on the right).
Try the following examples running a parallel pool locally, and notice the difference in time execution for each loop. First open a local parallel pool:
Run the following examples, and execute again. Note that the first run for each case is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time, for each case, for the second run.
tic; n = 200; M = magic(n); R = rand(n); parfor i = 1:n A(i) = sum(M(i,:).*R(n+1-i,:)); end toc | tic; n = 200; parfor i = 1:n M = magic(n); R = rand(n); A(i) = sum(M(i,:).*R(n+1-i,:)); end toc |
---|
Running on a remote cluster, you might find different behavior, as workers can simultaneously create their arrays, saving transfer time. Therefore, code that is optimized for local workers might not be optimized for cluster workers, and vice versa.