mapreduce - Programming technique for analyzing data sets that do
not fit in memory - MATLAB (original) (raw)
Programming technique for analyzing data sets that do not fit in memory
Syntax
Description
[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun))
applies map function mapfun
to input datastore ds
, and then passes the values associated with each unique key to reduce function reducefun
. The output datastore is a KeyValueDatastore
object that points to .mat
files in the current folder.
[outds](#bufooy6-outds) = mapreduce([ds](#bufooy6-ds),[mapfun](#bufooy6-mapfun),[reducefun](#bufooy6-reducefun),[mr](#bufooy6-mr))
optionally specifies the run-time configuration settings for mapreduce
. The mr
input is the result of a call to the mapreducer
function. Typically, this argument is used with Parallel Computing Toolbox™, MATLAB® Parallel Server™, or MATLAB Compiler™. For more information, see Speed Up and Deploy MapReduce Using Other Products.
[outds](#bufooy6-outds) = mapreduce(___,[Name,Value](#namevaluepairarguments))
specifies additional options with one or more Name,Value
pair arguments using any of the previous syntaxes. For example, you can specify 'OutputFolder'
followed by a character vector specifying a path to the output folder.
Examples
Count Flights by Airline
Use mapreduce
to count the number of flights made by each unique airline carrier in a data set.
Create a datastore using the airlinesmall.csv
data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, select UniqueCarrier
(airline name) as the variable of interest. Specify the 'TreatAsMissing'
name-value pair so that the datastore treats 'NA'
values as missing, and specify the 'MissingValue'
name-value pair to replace missing values with zeros.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA',... 'MissingValue',0); ds.SelectedVariableNames = 'UniqueCarrier'; ds.SelectedFormats = '%C';
Preview the data.
ans=8×1 table UniqueCarrier _____________
PS
PS
PS
PS
PS
PS
PS
PS
Run mapreduce
on the data. The map and reduce functions count the number of instances of each airline carrier name in each block of data, then combine those intermediate counts into a final count. This method leverages the intermediate sorting by unique key performed by mapreduce
. The functions countMapper
and countReducer
are included at the end of this script.
outds = mapreduce(ds, @countMapper, @countReducer);
MAPREDUCE PROGRESS *
Map 0% Reduce 0% Map 16% Reduce 0% Map 32% Reduce 0% Map 48% Reduce 0% Map 65% Reduce 0% Map 81% Reduce 0% Map 97% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 10% Map 100% Reduce 21% Map 100% Reduce 31% Map 100% Reduce 41% Map 100% Reduce 52% Map 100% Reduce 62% Map 100% Reduce 72% Map 100% Reduce 83% Map 100% Reduce 93% Map 100% Reduce 100%
ans=29×2 table
Key Value
__________ _________
{'AA' } {[14930]}
{'AS' } {[ 2910]}
{'CO' } {[ 8138]}
{'DL' } {[16578]}
{'EA' } {[ 920]}
{'HP' } {[ 3660]}
{'ML (1)'} {[ 69]}
{'NW' } {[10349]}
{'PA (1)'} {[ 318]}
{'PI' } {[ 871]}
{'PS' } {[ 83]}
{'TW' } {[ 3805]}
{'UA' } {[13286]}
{'US' } {[13997]}
{'WN' } {[15931]}
{'AQ' } {[ 154]}
⋮
The map function countMapper
leverages the fact that the data is categorical. The countcats
and categories
functions are used on each block of the input data to generate key/value pairs of the airline name and associated count.
function countMapper(data, info, intermKV) % Counts unique airline carrier names in each block. a = data.UniqueCarrier; c = num2cell(countcats(a)); keys = categories(a); addmulti(intermKV, keys, c) end
The reduce function countReducer
reads in the intermediate data produced by the map function and adds together all of the counts to produce a single final count for each airline carrier.
function countReducer(key, intermValIter, outKV) % Combines counts from all blocks to produce final counts. count = 0; while hasnext(intermValIter) data = getnext(intermValIter); count = count + data; end add(outKV, key, count) end
Input Arguments
ds
— Input datastore
datastore object
Input datastore, specified as a datastore object. Use the datastore function to create a datastore object from your data set.
mapreduce
works only with datastores that are deterministic. That is, if you use read on the datastore, reset the datastore with reset, and then read the datastore again, then the data returned must be the same in both cases.mapreduce
calculations involving a datastore that is not deterministic can produce unpredictable results. See Select Datastore for File Format or Application for more information.
mapfun
— Function handle to map function
function handle
Function handle to map function. mapfun
receives blocks from input datastore ds
, and then uses the add
and addmulti
functions to add key-value pairs to an intermediate KeyValueStore object. The number of calls to the map function by mapreduce
is equal to the number of blocks in the datastore
(the number of blocks is determined by the ReadSize
property of the datastore).
The inputs to the map function are data
, info
, and intermKVStore
, which mapreduce
automatically creates and passes to the map function:
- The
data
andinfo
inputs are the result of a call to the read function of datastore, whichmapreduce
executes automatically before each call to the map function. intermKVStore
is the name of the intermediate KeyValueStore object to which the map function needs to add key-value pairs. If none of the calls to the map function add key-value pairs tointermKVStore
, thenmapreduce
does not call the reduce function and the output datastore is empty.
An example of a template for the map function is
function myMapper(data, info, intermKVStore) %do a calculation with the data block add(intermKVStore, key, value) end
Example: @myMapper
Data Types: function_handle
reducefun
— Function handle to reduce function
function handle
Function handle to reduce function. mapreduce
calls reducefun
once for each unique key added to the intermediate KeyValueStore by the map function. In each call, mapreduce
passes the values associated with the active key to reducefun
as a ValueIterator object. The reducefun
function loops through the values for each key using the hasnext
and getnext
functions. Then, after performing some calculation(s), it writes key-value pairs to the final output.
The inputs to the reduce function are intermKey
, intermValIter
, and outKVStore
, which mapreduce
automatically creates and passes to the reduce function:
intermKey
is the active key from the intermediate KeyValueStore object. Each call to the reduce function bymapreduce
specifies a new unique key from the keys in the intermediateKeyValueStore
object.intermValIter
is the ValueIterator associated with the active key,intermKey
. ThisValueIterator
object contains all of the values associated with the active key. Scroll through the values using thehasnext
andgetnext
functions.outKVStore
is the name for the finalKeyValueStore
object to which the reduce function needs to add key-value pairs.mapreduce
takes the output key-value pairs fromoutKVStore
and returns them in the output datastore,outds
, which is aKeyValueDatastore
object by default. If none of the calls to the reduce function add final key-value pairs tooutKVStore
, then the output datastore is empty.
An example of a template for the reduce function is
function myReducer(intermKey, intermValIter, outKVStore) while hasnext(intermValIter) X = getnext(intermValIter); %do a calculation with the current value, X end add(outKVStore, key, value) end
Example: @myReducer
Data Types: function_handle
mr
— Execution environment
MapReducer object
Execution environment, specified as a MapReducer object. mr
is the result of a call to the mapreducer function. The default mr
argument is a call to gcmr, which uses the default global execution environment for mapreduce
(in MATLAB the default is mapreducer(0)
, which returns a SerialMapReducer
object).
Name-Value Arguments
Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN
, where Name
is the argument name and Value
is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name
in quotes.
Example: outds = mapreduce(ds, @mapfun, @reducefun, 'Display', 'off', 'OutputFolder', 'C:\Users\username\Desktop')
OutputType
— Type of datastore output
'Binary'
(default) | 'TabularText'
Type of datastore output, specified as 'Binary'
or'TabularText'
. The default setting of'Binary'
returns a KeyValueDatastore output datastore that points to binary (.mat
or.seq
) files in the output folder. The'TabularText'
option returns a tabularTextDatastore output datastore that points to .txt
files in the output folder.
The table provides the details for each of the output types.
'OutputType' | Type of datastore output | Datastore points to files of type | Values that the Reduce function can add | Keys that the Reduce function can add | Details |
---|---|---|---|---|---|
'Binary' (default) | KeyValueDatastore | .mat (or .seq when running against Hadoop®). | Any valid MATLAB object. | Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. | N/A |
'TabularText' | TabularTextDatastore | .txt | Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. | Character vectors, strings, or numeric scalars that are not NaN, complex, logical, or sparse. | File is UTF-8 encoded.Keys and values are tab (\t) separated.The row delimiter is \r\n on Windows®, and \n on Linux® and Mac. |
Data Types: char
| string
OutputFolder
— Destination folder of mapreduce
output
pwd
(default) | file path
Destination folder for mapreduce
output, specified as a file path. The default output folder is the current folder, pwd
. You can specify a different path with a fully qualified path or with a path relative to the current folder.
Example: mapreduce(..., 'OutputFolder', 'MyOutputFolder\Results')
specifies a file path relative to the current folder for the output.
Data Types: char
| string
Display
— Toggle for command line progress output
'on'
(default) | 'off'
Toggle for command line progress output, specified as 'on'
or 'off'
. The default is 'on'
, so that mapreduce
displays progress information in the command window during the map and reduce phases of execution.
Data Types: char
| string
Output Arguments
outds
— Output datastore
KeyValueDatastore
(default) | TabularTextDatastore
Output datastore, returned as a KeyValueDatastore
orTabularTextDatastore
object. By default,outds
is a KeyValueDatastore object that points to .mat
files in the current folder. Use theName,Value
pair arguments for'OutputType'
and 'OutputFolder'
to return a tabularTextDatastore object or change the location of the output files, respectively.
mapreduce
does not sort the key-value pairs in outds
. Their order may differ when using other products with mapreduce
.
To view the contents of outds
, use the preview
, read
, or readall
functions of datastore
.
Tips
- Debugging your
mapreduce
algorithms to examine how key-value pairs move through the different phases is always useful. To examine the movement of data, set breakpoints in your map and reduce functions. The breakpoints stop execution ofmapreduce
, allowing you to examine the current status of relevant variables, like theKeyValueStore
orValueIterator
. For more information, see Debug MapReduce Algorithms. - Some recommendations to optimize
mapreduce
performance on any platform are:- Minimize the number of calls to the map function. The easiest approach is to increase the value of the
ReadSize
property of the input datastore. The result is thatmapreduce
passes larger blocks of data to the map function, and the datastore depletes with fewer reads. - Decrease the amount of intermediate data sent between map and reduce functions. One approach is to use
unique
inside a map function to combine similar keys. See Compute Mean by Group Using MapReduce for an example of this technique.
- Minimize the number of calls to the map function. The easiest approach is to increase the value of the
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
If you have Parallel Computing Toolbox installed, when you use mapreduce
, MATLAB automatically opens a parallel pool of workers on your local machine. MATLAB runs the calculations across the available workers. Control parallel behavior with the parallel preferences, including scaling up to a cluster.
For details, see Run mapreduce on a Parallel Pool (Parallel Computing Toolbox). For more general information about parallel computing, see Run MATLAB Functions with Automatic Parallel Support (Parallel Computing Toolbox).
Version History
Introduced in R2014b