mapreduce - Programming technique for analyzing data sets that do

not fit in memory - MATLAB (original) (raw)

Programming technique for analyzing data sets that do not fit in memory

Syntax

Description

[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun)) applies map function mapfun to input datastore ds, and then passes the values associated with each unique key to reduce function reducefun. The output datastore is a KeyValueDatastore object that points to .mat files in the current folder.

example

[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun),[mr](#bufooy6-mr)) optionally specifies the run-time configuration settings for mapreduce. The mr input is the result of a call to the mapreducer function. Typically, this argument is used with Parallel Computing Toolbox™, MATLAB® Parallel Server™, or MATLAB Compiler™. For more information, see Speed Up and Deploy MapReduce Using Other Products.

[outds](#bufooy6-outds) = mapreduce(___,[Name,Value](#namevaluepairarguments)) specifies additional options with one or more Name,Value pair arguments using any of the previous syntaxes. For example, you can specify 'OutputFolder' followed by a character vector specifying a path to the output folder.

Examples

collapse all

Count Flights by Airline

Use mapreduce to count the number of flights made by each unique airline carrier in a data set.

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, select UniqueCarrier (airline name) as the variable of interest. Specify the 'TreatAsMissing' name-value pair so that the datastore treats 'NA' values as missing, and specify the 'MissingValue' name-value pair to replace missing values with zeros.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA',... 'MissingValue',0); ds.SelectedVariableNames = 'UniqueCarrier'; ds.SelectedFormats = '%C';

Preview the data.

ans=8×1 table UniqueCarrier _____________

     PS      
     PS      
     PS      
     PS      
     PS      
     PS      
     PS      
     PS      

Run mapreduce on the data. The map and reduce functions count the number of instances of each airline carrier name in each block of data, then combine those intermediate counts into a final count. This method leverages the intermediate sorting by unique key performed by mapreduce. The functions countMapper and countReducer are included at the end of this script.

outds = mapreduce(ds, @countMapper, @countReducer);



Map 0% Reduce 0% Map 16% Reduce 0% Map 32% Reduce 0% Map 48% Reduce 0% Map 65% Reduce 0% Map 81% Reduce 0% Map 97% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 10% Map 100% Reduce 21% Map 100% Reduce 31% Map 100% Reduce 41% Map 100% Reduce 52% Map 100% Reduce 62% Map 100% Reduce 72% Map 100% Reduce 83% Map 100% Reduce 93% Map 100% Reduce 100%

ans=29×2 table Key Value
__________ _________

{'AA'    }    {[14930]}
{'AS'    }    {[ 2910]}
{'CO'    }    {[ 8138]}
{'DL'    }    {[16578]}
{'EA'    }    {[  920]}
{'HP'    }    {[ 3660]}
{'ML (1)'}    {[   69]}
{'NW'    }    {[10349]}
{'PA (1)'}    {[  318]}
{'PI'    }    {[  871]}
{'PS'    }    {[   83]}
{'TW'    }    {[ 3805]}
{'UA'    }    {[13286]}
{'US'    }    {[13997]}
{'WN'    }    {[15931]}
{'AQ'    }    {[  154]}
  ⋮

The map function countMapper leverages the fact that the data is categorical. The countcats and categories functions are used on each block of the input data to generate key/value pairs of the airline name and associated count.

function countMapper(data, info, intermKV) % Counts unique airline carrier names in each block. a = data.UniqueCarrier; c = num2cell(countcats(a)); keys = categories(a); addmulti(intermKV, keys, c) end

The reduce function countReducer reads in the intermediate data produced by the map function and adds together all of the counts to produce a single final count for each airline carrier.

function countReducer(key, intermValIter, outKV) % Combines counts from all blocks to produce final counts. count = 0; while hasnext(intermValIter) data = getnext(intermValIter); count = count + data; end add(outKV, key, count) end

Input Arguments

collapse all

ds — Input datastore

datastore object

Input datastore, specified as a datastore object. Use the datastore function to create a datastore object from your data set.

mapreduce works only with datastores that are deterministic. That is, if you use read on the datastore, reset the datastore with reset, and then read the datastore again, then the data returned must be the same in both cases.mapreduce calculations involving a datastore that is not deterministic can produce unpredictable results. See Select Datastore for File Format or Application for more information.

mapfun — Function handle to map function

function handle

Function handle to map function. mapfun receives blocks from input datastore ds, and then uses the add and addmulti functions to add key-value pairs to an intermediate KeyValueStore object. The number of calls to the map function by mapreduce is equal to the number of blocks in the datastore (the number of blocks is determined by the ReadSize property of the datastore).

The inputs to the map function are data, info, and intermKVStore, which mapreduce automatically creates and passes to the map function:

An example of a template for the map function is

function myMapper(data, info, intermKVStore) %do a calculation with the data block add(intermKVStore, key, value) end

Example: @myMapper

Data Types: function_handle

reducefun — Function handle to reduce function

function handle

Function handle to reduce function. mapreduce calls reducefun once for each unique key added to the intermediate KeyValueStore by the map function. In each call, mapreduce passes the values associated with the active key to reducefun as a ValueIterator object. The reducefun function loops through the values for each key using the hasnext and getnext functions. Then, after performing some calculation(s), it writes key-value pairs to the final output.

The inputs to the reduce function are intermKey, intermValIter, and outKVStore, which mapreduce automatically creates and passes to the reduce function:

An example of a template for the reduce function is

function myReducer(intermKey, intermValIter, outKVStore) while hasnext(intermValIter) X = getnext(intermValIter); %do a calculation with the current value, X end add(outKVStore, key, value) end

Example: @myReducer

Data Types: function_handle

mr — Execution environment

MapReducer object

Execution environment, specified as a MapReducer object. mr is the result of a call to the mapreducer function. The default mr argument is a call to gcmr, which uses the default global execution environment for mapreduce (in MATLAB the default is mapreducer(0), which returns a SerialMapReducer object).

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: outds = mapreduce(ds, @mapfun, @reducefun, 'Display', 'off', 'OutputFolder', 'C:\Users\username\Desktop')

OutputType — Type of datastore output

'Binary' (default) | 'TabularText'

Type of datastore output, specified as 'Binary' or'TabularText'. The default setting of'Binary' returns a KeyValueDatastore output datastore that points to binary (.mat or.seq) files in the output folder. The'TabularText' option returns a tabularTextDatastore output datastore that points to .txt files in the output folder.

The table provides the details for each of the output types.

'OutputType' Type of datastore output Datastore points to files of type Values that the Reduce function can add Keys that the Reduce function can add Details
'Binary' (default) KeyValueDatastore .mat (or .seq when running against Hadoop®). Any valid MATLAB object. Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. N/A
'TabularText' TabularTextDatastore .txt Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. File is UTF-8 encoded.Keys and values are tab (\t) separated.The row delimiter is \r\n on Windows®, and \n on Linux® and Mac.

Data Types: char | string

OutputFolder — Destination folder of mapreduce output

pwd (default) | file path

Destination folder for mapreduce output, specified as a file path. The default output folder is the current folder, pwd. You can specify a different path with a fully qualified path or with a path relative to the current folder.

Example: mapreduce(..., 'OutputFolder', 'MyOutputFolder\Results') specifies a file path relative to the current folder for the output.

Data Types: char | string

Display — Toggle for command line progress output

'on' (default) | 'off'

Toggle for command line progress output, specified as 'on' or 'off'. The default is 'on', so that mapreduce displays progress information in the command window during the map and reduce phases of execution.

Data Types: char | string

Output Arguments

collapse all

outds — Output datastore

KeyValueDatastore (default) | TabularTextDatastore

Output datastore, returned as a KeyValueDatastore orTabularTextDatastore object. By default,outds is a KeyValueDatastore object that points to .mat files in the current folder. Use theName,Value pair arguments for'OutputType' and 'OutputFolder' to return a tabularTextDatastore object or change the location of the output files, respectively.

mapreduce does not sort the key-value pairs in outds. Their order may differ when using other products with mapreduce.

To view the contents of outds, use the preview, read, or readall functions of datastore.

Tips

Extended Capabilities

Automatic Parallel Support

Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

If you have Parallel Computing Toolbox installed, when you use mapreduce, MATLAB automatically opens a parallel pool of workers on your local machine. MATLAB runs the calculations across the available workers. Control parallel behavior with the parallel preferences, including scaling up to a cluster.

For details, see Run mapreduce on a Parallel Pool (Parallel Computing Toolbox). For more general information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).

Version History

Introduced in R2014b