Distributing Arrays to Parallel Workers - MATLAB & Simulink (original) (raw)

Using Distributed Arrays to Partition Data Across Workers

Depending on how your data fits in memory, choose one of the following methods:

Load Distributed Arrays in Parallel Using datastore

If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use datastore with thedistributed function to create distributed arrays and partition the data among your workers.

This example shows how to create and load distributed arrays usingdatastore. Create a datastore using a tabular file of airline flight data. This data set is too small to show equal partitioning of the data over the workers. To simulate a large data set, artificially increase the size of the datastore using repmat.

files = repmat({'airlinesmall.csv'}, 10, 1); ds = tabularTextDatastore(files);

Select the example variables.

ds.SelectedVariableNames = {'DepTime','DepDelay'}; ds.TreatAsMissing = 'NA';

Create a distributed table by reading the datastore in parallel. Partition the datastore with one partition per worker. Each worker then reads all data from the corresponding partition. The files must be in a shared location that is accessible by the workers.

Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers.

Display summary information about the distributed table.

Variables:

DepTime: 1,235,230×1 double
    Values:

        min          1
        max       2505
        NaNs    23,510

DepDelay: 1,235,230×1 double
    Values:

        min      -1036
        max       1438
        NaNs    23,510

Determine the size of the tall table.

Return the first few rows of dt.

ans =

DepTime    DepDelay
_______    ________

 642       12      
1021        1      
2055       20      
1332       12      
 629       -1      
1446       63      
 928       -2      
 859       -1      
1833        3      
1041        1      

Finally, check how much data each worker has loaded.

Worker 1:

This worker stores dt2(1:370569,:).

      LocalPart: [370569×2 table]
  Codistributor: [1×1 codistributor1d]

Worker 2:

This worker stores dt2(370570:617615,:).

      LocalPart: [247046×2 table]
  Codistributor: [1×1 codistributor1d]

Worker 3:

This worker stores dt2(617616:988184,:).

      LocalPart: [370569×2 table]
  Codistributor: [1×1 codistributor1d]

Worker 4:

This worker stores dt2(988185:1235230,:).

      LocalPart: [247046×2 table]
  Codistributor: [1×1 codistributor1d]

Note that the data is partitioned equally over the workers. For more details ondatastore, see What Is a Datastore?

For more details about workflows for big data, see Choose a Parallel Computing Solution.

Alternative Methods for Creating Distributed and Codistributed Arrays

If your data fits in the memory of your local machine, you can use distributed arrays to partition the data among your workers. Use thedistributed function to create a distributed array in the MATLAB client, and store its data on the workers of the open parallel pool. A distributed array is distributed in one dimension, and as evenly as possible along that dimension among the workers. You cannot control the details of distribution when creating a distributed array.

You can create a distributed array in several ways:

The first two techniques do not involve spmd in creating the array, but you can use spmd to manipulate arrays created this way. For example:

Create an array in the client workspace, and then make it a distributed array.

parpool('Processes',2) % Create pool W = ones(6,6); W = distributed(W); % Distribute to the workers spmd T = W*2; % Calculation performed on workers, in parallel. % T and W are both codistributed arrays here. end T % View results in client. whos % T and W are both distributed arrays here. delete(gcp) % Stop pool

Alternatively, you can use the codistributed function, which allows you to control more options such as dimensions and partitions, but is often more complicated. You can create a codistributed array by executing on the workers themselves, either inside an spmd statement or inside a communicating job. When creating a codistributed array, you can control all aspects of distribution, including dimensions and partitions.

The relationship between distributed and codistributed arrays is one of perspective. Codistributed arrays are partitioned among the workers from which you execute code to create or manipulate them. When you create a distributed array in the client, you can access it as a codistributed array inside anspmd statement. When you create a codistributed array in anspmd statement, you can access it as a distributed array in the client. Only spmd statements let you access the same array data from two different perspectives.

You can create a codistributed array in several ways:

Create a codistributed array inside an spmd statement using a nondefault distribution scheme. First, define 1-D distribution along the third dimension, with 4 parts on worker 1, and 12 parts on worker 2. Then create a 3-by-3-by-16 array of zeros.

parpool('Processes',2) % Create pool spmd codist = codistributor1d(3,[4,12]); Z = zeros(3,3,16,codist); Z = Z + spmdIndex; end Z % View results in client. % Z is a distributed array here. delete(gcp) % Stop pool

For more details on codistributed arrays, see Working with Codistributed Arrays.

See Also

distributed | codistributed | tall | datastore | spmd | repmat | eye | rand

More About