Run mapreduce on a Hadoop Cluster - MATLAB & Simulink (original) (raw)

Main Content

Cluster Preparation

Before you can run mapreduce on a Hadoop® cluster, make sure that the cluster and client machine are properly configured. Consult your system administrator, or see Configure a Hadoop Cluster (MATLAB Parallel Server).

Output Format and Order

When running mapreduce on a Hadoop cluster with binary output (the default), the resultingKeyValueDatastore points to Hadoop Sequence files, instead of binary MAT-files as generated bymapreduce in other environments. For more information, see the 'OutputType' argument description on themapreduce reference page.

When running mapreduce on a Hadoop cluster, the order of the key-value pairs in the output is different compared to running mapreduce in other environments. If your application depends on the arrangement of data in the output, you must sort the data according to your own requirements.

Calculate Mean Delay

This example shows how to modify the MATLAB® example for calculating mean airline delays to run on a Hadoop cluster.

First, you must set environment variables and cluster properties as appropriate for your specific Hadoop configuration. See your system administrator for the values for these and other properties necessary for submitting jobs to your cluster.

setenv('HADOOP_HOME', '/path/to/hadoop/install') cluster = parallel.cluster.Hadoop;

Note

The specified outputFolder must not already exist. Themapreduce output from a Hadoop cluster cannot overwrite an existing folder.

You will lose your data, if mapreducer is changed or deleted.

Create a MapReducer object to specify that mapreduce should use your Hadoop cluster.

mr = mapreducer(cluster);

Create and preview the datastore. The data set is available in_`matlabroot`_/toolbox/matlab/demos.

ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); preview(ds)

ArrDelay
________

 8
 8
21
13
 4
59
 3
11

Next, specify your output folder, output outds and callmapreduce to execute on the Hadoop cluster specified by mr. The map and reduce functions are available in_`matlabroot`_/toolbox/matlab/demos.

outputFolder = 'hdfs:///home/myuser/out1'; outds = mapreduce(ds,@myMapperFcn,@myReducerFcn,... 'OutputFolder',outputFolder); meanDelay = mapreduce(ds,@meanArrivalDelayMapper,... @meanArrivalDelayReducer,mr,... 'OutputFolder',outputFolder)

Parallel mapreduce execution on the Hadoop cluster:

```
 MAPREDUCE PROGRESS      *
```

Map 0% Reduce 0% Map 66% Reduce 0% Map 100% Reduce 66% Map 100% Reduce 100%

meanDelay =

KeyValueDatastore with properties:

   Files: {
          ' .../tmp/myuser/tpc00621b1_4eef_4abc_8078_646aa916e7d9/part0.seq'
          }
ReadSize: 1 key-value pairs
FileType: 'seq'

Read the result.

       Key             Value
__________________    ________

'MeanArrivalDelay'    [7.1201]

Although for demonstration purposes this example uses a local data set, it is likely when using Hadoop that your data set is stored in an HDFS™ file system. Likewise, you might be required to store themapreduce output in HDFS. For details about accessing HDFS in MATLAB, see Work with Remote Data.

Run mapreduce on a Hadoop Cluster - MATLAB & Simulink (original) (raw)

Cluster Preparation

Output Format and Order

Calculate Mean Delay

See Also

Functions

More About

Run mapreduce on a Hadoop Cluster - MATLAB & Simulink (original) (raw)

Cluster Preparation

Output Format and Order

Calculate Mean Delay

See Also

Functions

Related Examples

More About