Include MATLAB Map and Reduce Functions into Hadoop Job - MATLAB & Simulink (original) (raw)

Supported Platform: Linux® only.

This example shows you how to use the mcc command to create a deployable archive consisting of MATLAB® map and reduce functions and then pass the deployable archive as a payload argument to a job submitted to a Hadoop® cluster.

Goal: Calculate the maximum arrival delay of an airline from the given dataset.

Dataset: airlinesmall.csv
Description: Airline departure and arrival information from 1987-2008.
Location: To download the airlinesmall.csv file, at the MATLAB command prompt type:setupExample("matlab/AddKeysValuesExample", pwd)Ignore the AddKeysValuesExample.mlx live script file that is automatically downloaded along with theairlinesmall.csv file.

Note

This workflow requires the explicit creation of a Hadoop settings file. Follow the example for details.

Prerequisites

  1. Start this example by creating a new work folder that is visible to the MATLAB search path.
  2. Before starting MATLAB, at a terminal, set the environment variable HADOOP_PREFIX to point to the Hadoop installation folder. For example:
    Shell Command
    csh / tcsh % setenv HADOOP_PREFIX /usr/lib/hadoop
    bash $ export HADOOP_PREFIX=/usr/lib/hadoop
    Note
    This example uses /usr/lib/hadoop as directory where Hadoop is installed. Your Hadoop installation directory maybe different.
    If you forget setting the HADOOP_PREFIX environment variable prior to starting MATLAB, set it up using the MATLAB function setenv at the MATLAB command prompt as soon as you start MATLAB. For example:
    setenv('HADOOP_PREFIX','/usr/lib/hadoop')
  3. Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. This example uses/usr/local/MATLAB/MATLAB_Runtime/R2025a as the location of the MATLAB Runtime folder.
    If you don’t have the MATLAB Runtime, you can download it from the website at: https://www.mathworks.com/products/compiler/mcr.
    Note
    For information about MATLAB Runtime version numbers corresponding MATLAB releases, see this list.
  4. Copy the map function maxArrivalDelayMapper.m from /usr/local/MATLAB/R2025a/toolbox/matlab/demos folder to the work folder.
    maxArrivalDelayMapper.m
    function maxArrivalDelayMapper (data, info, intermKVStore)
    partMax = max(data.ArrDelay);
    add(intermKVStore,'PartialMaxArrivalDelay',partMax);
    For more information, see Write a Map Function.
  5. Copy the reduce function maxArrivalDelayReducer.m from _`matlabroot`_/toolbox/matlab/demos folder to the work folder.
    maxArrivalDelayReducer.m
    function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore)
    maxVal = -inf;
    while hasnext(intermValIter)
    maxVal = max(getnext(intermValIter), maxVal);
    end
    add(outKVStore,'MaxArrivalDelay',maxVal);
    For more information, see Write a Reduce Function.
  6. Create the directory/user/_`<username>`_/datasets on HDFS™ and copy the fileairlinesmall.csv to that directory. Here_`<username>`_ refers to your user name in HDFS.
$ ./hadoop fs -copyFromLocal airlinesmall.csv hdfs://host:54310/user/<username>/datasets  

Procedure

  1. Start MATLAB and verify that the HADOOP_PREFIX environment variable has been set. At the command prompt, type:

    getenv('HADOOP_PREFIX')
    If ans is empty, review the Prerequisites section above to see how you can set theHADOOP_PREFIX environment variable.

  2. Create a datastore to the fileairlinesmall.csv and save it to a.mat file. This datastore object is meant to capture the structure of your actual dataset on HDFS.
    ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
    'SelectedVariableNames','ArrDelay','ReadSize',1000);
    save('infoAboutDataset.mat','ds')
    In most cases, you will start off by working on a small sample dataset residing on a local machine that is representative of the actual dataset on the cluster. This sample dataset has the same structure and variables as the actual dataset on the cluster. By creating adatastore object to the dataset residing on your local machine you are taking a snapshot of that structure. By having access to this datastore object, a Hadoop job executing on the cluster will know how to access and process the actual dataset residing on HDFS.
    Note
    In this example, the sample dataset (local) and the actual dataset on HDFS are the same.
  3. Create a configuration file (config.txt) that specifies the input type of the data, the format of the data specified by the datastore created in the previous step, the output type of the data, the name of map function, and the name of reduce function.
    mw.ds.in.type = tabulartext
    mw.ds.in.format = infoAboutDataset.mat
    mw.ds.out.type = keyvalue
    mw.mapper = maxArrivalDelayMapper
    mw.reducer = maxArrivalDelayReducer
    For more information, see Configuration File for Creating Deployable Archive Using the mcc Command.
  4. Use the mcc command with the -H and -W flags to create a deployable archive. However, the mcc command cannot package the results in an installer. The command must be entered as a single line.
    mcc -H -W 'hadoop:maxArrivalDelay,CONFIG:config.txt'
    maxArrivalDelayMapper.m maxArrivalDelayReducer.m
    -a infoAboutDataset.mat
    For more information, see mcc.
    MATLAB Compiler™ creates a shell scriptrun_maxarrivaldelay.sh, a deployable archiveairlinesmall.ctf, and a log filemccExcludedfiles.log.
    Incorporate the deployable archive containing MATLAB map and reduce functions into a Hadoop MapReduce job from a Linux shell using the following command:
hadoop \  
jar /usr/local/MATLAB/MATLAB_Runtime/R2025a/toolbox/mlhadoop/jar/a2.2.0/mwmapreduce.jar \  
com.mathworks.hadoop.MWMapReduceDriver \  
-D mw.mcrroot=/usr/local/MATLAB/MATLAB_Runtime/R2025a \  
maxArrivalDelay.ctf \  
hdfs://host:54310/user/<username>/datasets/airlinesmall.csv \  
hdfs://host:54310/user/<username>/results  
  1. To examine the results, switch to the MATLAB desktop and create a datastore to the results on HDFS. You can then view the results using theread method.
    d = datastore('hdfs:///user//results/part*');
    read(d)
    ans =
    Key Value
    'MaxArrivalDelay' [1014]

To learn more about using the map and reduce functions, see Getting Started with MapReduce.

See Also

datastore | TabularTextDatastore | KeyValueDatastore | mcc