Partition a Datastore in Parallel - MATLAB & Simulink (original) (raw)

Partitioning a datastore in parallel, with a portion of the datastore on each worker in a parallel pool, can provide benefits in many cases:

Read Data from Datastore in Parallel

This example shows how to use the partition function to parallelize the reading of data from a datastore. It uses a small datastore of airline data provided in MATLABĀ®, and finds the mean of the non-NaN values from its 'ArrDelay' column.

Serial Execution

A simple way to calculate the mean is to divide the sum of all the non-NaN values by the number of non-NaN values. The code in the sumAndCountArrivalDelay helper function does this for the datastore first in a non-parallel way.

First, delete any existing parallel pools.

Create a datastore from the collection of worksheets in airlinesmall_subset.xlsx and select the ArrDelay variables to import.

Use the function sumAndCountArrivalDelay to calculate the mean without any parallel execution. Use the tic and toc functions to time the execution, here and in the later parallel cases.

ds = spreadsheetDatastore(repmat({'airlinesmall_subset.xlsx'},20,1)); ds.SelectedVariableNames = 'ArrDelay'; reset(ds); tic [total,count] = sumAndCountArrivalDelay(ds)

Parallel Execution

The partition function allows you to partition the datastore into smaller parts, each represented as a datastore itself. These smaller datastores work completely independently of each other, so that you can work with them inside of parallel language features such as parfor loops and spmd blocks.

Use Automatic Partitions

You can use the numpartitions function to specify the number of partitions, which is based on the datastore itself and the parallel pool size. This does not necessarily equal the number of workers in the pool. Set the number of loop iterations to the number of partitions (N).

The following code starts a parallel pool on a local cluster, then partitions the datastore among workers for iterating over the loop. This code calls the helper function parforSumAndCountArrivalDelay, which includes a parfor loop to amass the count and sum totals in parallel loop iterations.

p = parpool('Processes',4);

Starting parallel pool (parpool) using the 'Processes' profile ... Connected to parallel pool with 4 workers.

reset(ds); tic [total,count] = parforSumAndCountArrivalDelay(ds)

Specify Number of Partitions

Rather than let the software calculate the number of partitions, you can explicitly set this value, so that the data can be appropriately partitioned to fit your algorithm. For example, to parallelize data from within an spmd block, you can specify the number of workers (spmdSize) as the number of partitions to use. The spmdSumAndCountArrivalDelay helper function uses an spmd block to perform a parallel read, and explicitly sets the number of partitions equal to the number of workers.

reset(ds); tic [total,count] = spmdSumAndCountArrivalDelay(ds)

When you are done with your computation, you can delete the current parallel pool.

Helper Functions

Create a helper function to amass the count and sum in a non-parallel way.

function [total,count] = sumAndCountArrivalDelay(ds) total = 0; count = 0; while hasdata(ds) data = read(ds); total = total + sum(data.ArrDelay,1,'OmitNaN'); count = count + sum(~isnan(data.ArrDelay)); end end

Create a helper function to amass the count and sum in parallel using parfor.

function [total, count] = parforSumAndCountArrivalDelay(ds) N = numpartitions(ds,gcp); total = 0; count = 0;
parfor ii = 1:N % Get partition ii of the datastore. subds = partition(ds,N,ii);

    [localTotal,localCount] = sumAndCountArrivalDelay(subds);
    total = total + localTotal;
    count = count + localCount;
end

end

Create a helper function to amass the count and sum in parallel using spmd.

function [total,count] = spmdSumAndCountArrivalDelay(ds) spmd subds = partition(ds,spmdSize,spmdIndex); [total,count] = sumAndCountArrivalDelay(subds);
end total = sum([total{:}]); count = sum([count{:}]); end

See Also

datastore | spreadsheetDatastore

More About