Analyze Large Data in Database Using MapReduce - MATLAB & Simulink Example (original) (raw)

Main Content

This example determines the mean arrival delay of a large set of flight data that is stored in a database. You can access large data sets using a [databaseDatastore](matlab.io.datastore.databasedatastore.html) object with Database Toolbox™. After creating a DatabaseDatastore object, you can write a MapReduce algorithm that defines the chunking and reduction of the data. Alternatively, you can use a tall array to run algorithms on large data sets.

The DatabaseDatastore object does not support using a parallel pool with Parallel Computing Toolbox™ installed. To analyze data using tall arrays or run MapReduce algorithms, set the global execution environment to be the local MATLAB® session.

This example uses a preconfigured JDBC data source to create the database connection. For more information, see the [configureJDBCDataSource](configurejdbcdatasource.html) function.

Create DatabaseDatastore Object

Set the global execution environment to be the local MATLAB® session.

The file airlinesmall.csv contains a large set of flight data. Load this file into the Microsoft® SQL Server® database table airlinesmall. This table contains 123,523 records.

Create a database connection to the JDBC data source MSSQLServerJDBCAuth. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.

datasource = "MSSQLServerJDBCAuth"; username = ""; password = ""; conn = database(datasource,username,password);

Create a DatabaseDatastore object using the database connection and an SQL query. This SQL query retrieves arrival-delay data from the airlinesmall table.

sqlquery = 'select ArrDelay from airlinesmall';

dbds = databaseDatastore(conn,sqlquery);

Define Mapper and Reducer Functions

To process large data sets in chunks, you can write your own mapper function. For each chunk in this example, use meanArrivalDelayMapper.m to:

The meanArrivalDelayMapper.m file contains this code.

function meanArrivalDelayMapper (data, info, intermKVStore) % Mapper function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% Data is an n-by-1 table of the ArrDelay. Remove missing value first: data(isnan(data.ArrDelay),:) = [];

% Record the partial counts and sums and the reducer will accumulate them. partCountSum = [length(data.ArrDelay), sum(data.ArrDelay)]; add(intermKVStore, 'PartialCountSumDelay',partCountSum);

You also can write your own reducer function. In this example, use meanArrivalDelayReducer.m to read intermediate values for the number of delays and the total arrival delay. Then, determine the overall mean arrival delay. mapreduce calls this reducer function only once because the mapper function adds just one key to KeyValueStore. The meanArrivalDelayReducer.m file contains this code.

function meanArrivalDelayReducer(intermKey, intermValIter, outKVStore) % Reducer function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% intermKey is 'PartialCountSumDelay' count = 0; sum = 0; while hasnext(intermValIter) countSum = getnext(intermValIter); count = count + countSum(1); sum = sum + countSum(2); end

meanDelay = sum/count;

% The key-value pair added to outKVStore will become the output of mapreduce add(outKVStore,'MeanArrivalDelay',meanDelay);

Run MapReduce Using Mapper and Reducer Functions

To determine the mean arrival delay in the flight data, run MapReduce with the DatabaseDatastore object, mapper function, and reducer function.

outds = mapreduce(dbds,@meanArrivalDelayMapper,@meanArrivalDelayReducer);



Map 0% Reduce 0% Map 15% Reduce 0% Map 30% Reduce 0% Map 46% Reduce 0% Map 61% Reduce 0% Map 76% Reduce 0% Map 92% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%

Display Output from MapReduce

Read the table from the output datastore using readall.

outtab =

1×2 table

       Key             Value  
__________________    ________

'MeanArrivalDelay'    [7.1201]

The table has only one row containing one key-value pair.

Display the mean arrival delay from the table.

meanArrDelay = outtab.Value{1}

Close DatabaseDatastore Object and Database Connection

See Also

setdbprefs | database | databaseDatastore | readall | close | mapreduce | TabularTextDatastore

External Websites