mapreduce - Programming technique for analyzing data sets that do

not fit in memory - MATLAB (original) (raw)

Programming technique for analyzing data sets that do not fit in memory

Syntax

Description

[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun)) applies map function mapfun to input datastore ds, and then passes the values associated with each unique key to reduce function reducefun. The output datastore is a KeyValueDatastore object that points to .mat files in the current folder.

example

[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun),[mr](#bufooy6-mr)) optionally specifies the run-time configuration settings for mapreduce. The mr input is the result of a call to the mapreducer function. Typically, this argument is used with Parallel Computing Toolbox™, MATLAB® Parallel Server™, or MATLAB Compiler™. For more information, see Speed Up and Deploy MapReduce Using Other Products.

[outds](#bufooy6-outds) = mapreduce(___,[Name,Value](#namevaluepairarguments)) specifies additional options with one or more Name,Value pair arguments using any of the previous syntaxes. For example, you can specify 'OutputFolder' followed by a character vector specifying a path to the output folder.

Examples

collapse all

Count Flights by Airline

Use mapreduce to count the number of flights made by each unique airline carrier in a data set.

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, select UniqueCarrier (airline name) as the variable of interest. Specify the 'TreatAsMissing' name-value pair so that the datastore treats 'NA' values as missing, and specify the 'MissingValue' name-value pair to replace missing values with zeros.

ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA',... 'MissingValue',0); ds.SelectedVariableNames = 'UniqueCarrier'; ds.SelectedFormats = '%C';

Preview the data.

ans=8×1 table UniqueCarrier _____________

Run mapreduce on the data. The map and reduce functions count the number of instances of each airline carrier name in each block of data, then combine those intermediate counts into a final count. This method leverages the intermediate sorting by unique key performed by mapreduce. The functions countMapper and countReducer are included at the end of this script.

outds = mapreduce(ds, @countMapper, @countReducer);

```
 MAPREDUCE PROGRESS      *
```

Map 0% Reduce 0% Map 16% Reduce 0% Map 32% Reduce 0% Map 48% Reduce 0% Map 65% Reduce 0% Map 81% Reduce 0% Map 97% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 10% Map 100% Reduce 21% Map 100% Reduce 31% Map 100% Reduce 41% Map 100% Reduce 52% Map 100% Reduce 62% Map 100% Reduce 72% Map 100% Reduce 83% Map 100% Reduce 93% Map 100% Reduce 100%

ans=29×2 table Key Value
__________ _________

{'AA'    }    {[14930]}
{'AS'    }    {[ 2910]}
{'CO'    }    {[ 8138]}
{'DL'    }    {[16578]}
{'EA'    }    {[  920]}
{'HP'    }    {[ 3660]}
{'ML (1)'}    {[   69]}
{'NW'    }    {[10349]}
{'PA (1)'}    {[  318]}
{'PI'    }    {[  871]}
{'PS'    }    {[   83]}
{'TW'    }    {[ 3805]}
{'UA'    }    {[13286]}
{'US'    }    {[13997]}
{'WN'    }    {[15931]}
{'AQ'    }    {[  154]}
  ⋮

The map function countMapper leverages the fact that the data is categorical. The countcats and categories functions are used on each block of the input data to generate key/value pairs of the airline name and associated count.

function countMapper(data, info, intermKV) % Counts unique airline carrier names in each block. a = data.UniqueCarrier; c = num2cell(countcats(a)); keys = categories(a); addmulti(intermKV, keys, c) end

The reduce function countReducer reads in the intermediate data produced by the map function and adds together all of the counts to produce a single final count for each airline carrier.

function countReducer(key, intermValIter, outKV) % Combines counts from all blocks to produce final counts. count = 0; while hasnext(intermValIter) data = getnext(intermValIter); count = count + data; end add(outKV, key, count) end

Input Arguments

collapse all

`ds` — Input datastore

datastore object

Input datastore, specified as a datastore object. Use the datastore function to create a datastore object from your data set.

mapreduce works only with datastores that are deterministic. That is, if you use read on the datastore, reset the datastore with reset, and then read the datastore again, then the data returned must be the same in both cases.mapreduce calculations involving a datastore that is not deterministic can produce unpredictable results. See Select Datastore for File Format or Application for more information.

`mapfun` — Function handle to map function

function handle

Function handle to map function. mapfun receives blocks from input datastore ds, and then uses the add and addmulti functions to add key-value pairs to an intermediate KeyValueStore object. The number of calls to the map function by mapreduce is equal to the number of blocks in the datastore (the number of blocks is determined by the ReadSize property of the datastore).

The inputs to the map function are data, info, and intermKVStore, which mapreduce automatically creates and passes to the map function:

The data and info inputs are the result of a call to the read function of datastore, which mapreduce executes automatically before each call to the map function.
intermKVStore is the name of the intermediate KeyValueStore object to which the map function needs to add key-value pairs. If none of the calls to the map function add key-value pairs to intermKVStore, then mapreduce does not call the reduce function and the output datastore is empty.

An example of a template for the map function is

function myMapper(data, info, intermKVStore) %do a calculation with the data block add(intermKVStore, key, value) end

Example: @myMapper

Data Types: function_handle

`reducefun` — Function handle to reduce function

function handle

Function handle to reduce function. mapreduce calls reducefun once for each unique key added to the intermediate KeyValueStore by the map function. In each call, mapreduce passes the values associated with the active key to reducefun as a ValueIterator object. The reducefun function loops through the values for each key using the hasnext and getnext functions. Then, after performing some calculation(s), it writes key-value pairs to the final output.

The inputs to the reduce function are intermKey, intermValIter, and outKVStore, which mapreduce automatically creates and passes to the reduce function:

intermKey is the active key from the intermediate KeyValueStore object. Each call to the reduce function by mapreduce specifies a new unique key from the keys in the intermediate KeyValueStore object.
intermValIter is the ValueIterator associated with the active key, intermKey. This ValueIterator object contains all of the values associated with the active key. Scroll through the values using the hasnext and getnext functions.
outKVStore is the name for the final KeyValueStore object to which the reduce function needs to add key-value pairs. mapreduce takes the output key-value pairs from outKVStore and returns them in the output datastore, outds, which is a KeyValueDatastore object by default. If none of the calls to the reduce function add final key-value pairs to outKVStore, then the output datastore is empty.

An example of a template for the reduce function is

function myReducer(intermKey, intermValIter, outKVStore) while hasnext(intermValIter) X = getnext(intermValIter); %do a calculation with the current value, X end add(outKVStore, key, value) end

Example: @myReducer

Data Types: function_handle

`mr` — Execution environment

MapReducer object

Execution environment, specified as a MapReducer object. mr is the result of a call to the mapreducer function. The default mr argument is a call to gcmr, which uses the default global execution environment for mapreduce (in MATLAB the default is mapreducer(0), which returns a SerialMapReducer object).

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: outds = mapreduce(ds, @mapfun, @reducefun, 'Display', 'off', 'OutputFolder', 'C:\Users\username\Desktop')

`OutputType` — Type of datastore output

'Binary' (default) | 'TabularText'

Type of datastore output, specified as 'Binary' or'TabularText'. The default setting of'Binary' returns a KeyValueDatastore output datastore that points to binary (.mat or.seq) files in the output folder. The'TabularText' option returns a tabularTextDatastore output datastore that points to .txt files in the output folder.

The table provides the details for each of the output types.

'OutputType'	Type of datastore output	Datastore points to files of type	Values that the Reduce function can add	Keys that the Reduce function can add	Details
'Binary' (default)	KeyValueDatastore	.mat (or .seq when running against Hadoop®).	Any valid MATLAB object.	Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse.	N/A
'TabularText'	TabularTextDatastore	.txt	Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse.	Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse.	File is UTF-8 encoded.Keys and values are tab (\t) separated.The row delimiter is \r\n on Windows®, and \n on Linux® and Mac.

Data Types: char | string

`OutputFolder` — Destination folder of `mapreduce` output

pwd (default) | file path

Destination folder for mapreduce output, specified as a file path. The default output folder is the current folder, pwd. You can specify a different path with a fully qualified path or with a path relative to the current folder.

Example: mapreduce(..., 'OutputFolder', 'MyOutputFolder\Results') specifies a file path relative to the current folder for the output.

Data Types: char | string

`Display` — Toggle for command line progress output

'on' (default) | 'off'

Toggle for command line progress output, specified as 'on' or 'off'. The default is 'on', so that mapreduce displays progress information in the command window during the map and reduce phases of execution.

Data Types: char | string

Output Arguments

collapse all

`outds` — Output datastore

KeyValueDatastore (default) | TabularTextDatastore

Output datastore, returned as a KeyValueDatastore orTabularTextDatastore object. By default,outds is a KeyValueDatastore object that points to .mat files in the current folder. Use theName,Value pair arguments for'OutputType' and 'OutputFolder' to return a tabularTextDatastore object or change the location of the output files, respectively.

mapreduce does not sort the key-value pairs in outds. Their order may differ when using other products with mapreduce.

To view the contents of outds, use the preview, read, or readall functions of datastore.

Tips

Debugging your mapreduce algorithms to examine how key-value pairs move through the different phases is always useful. To examine the movement of data, set breakpoints in your map and reduce functions. The breakpoints stop execution of mapreduce, allowing you to examine the current status of relevant variables, like the KeyValueStore or ValueIterator. For more information, see Debug MapReduce Algorithms.
Some recommendations to optimize mapreduce performance on any platform are:
- Minimize the number of calls to the map function. The easiest approach is to increase the value of the ReadSize property of the input datastore. The result is that mapreduce passes larger blocks of data to the map function, and the datastore depletes with fewer reads.
- Decrease the amount of intermediate data sent between map and reduce functions. One approach is to use unique inside a map function to combine similar keys. See Compute Mean by Group Using MapReduce for an example of this technique.

Extended Capabilities

Automatic Parallel Support

Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

If you have Parallel Computing Toolbox installed, when you use mapreduce, MATLAB automatically opens a parallel pool of workers on your local machine. MATLAB runs the calculations across the available workers. Control parallel behavior with the parallel preferences, including scaling up to a cluster.

For details, see Run mapreduce on a Parallel Pool (Parallel Computing Toolbox). For more general information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).

Version History

Introduced in R2014b

mapreduce - Programming technique for analyzing data sets that do

Syntax

Description

Examples

Count Flights by Airline

Input Arguments

ds — Input datastore

mapfun — Function handle to map function

reducefun — Function handle to reduce function

mr — Execution environment

Name-Value Arguments

OutputType — Type of datastore output

OutputFolder — Destination folder of mapreduce output

Display — Toggle for command line progress output

Output Arguments

outds — Output datastore

Tips

Extended Capabilities

Automatic Parallel Support

Version History

`ds` — Input datastore

`mapfun` — Function handle to map function

`reducefun` — Function handle to reduce function

`mr` — Execution environment

`OutputType` — Type of datastore output

`OutputFolder` — Destination folder of `mapreduce` output

`Display` — Toggle for command line progress output

`outds` — Output datastore