Read and Analyze Hadoop Sequence File - MATLAB & Simulink (original) (raw)
Main Content
This example shows how to create a datastore for a Sequence file containing key-value data. Then, you can read and process the data one block at a time. Sequence files are outputs of mapreduce
operations that use Hadoop®.
Set the appropriate environment variable to the location where Hadoop is installed. In this case, set the MATLAB_HADOOP_INSTALL
environment variable.
setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')
hadoop-folder
is the folder where Hadoop is installed and mypath
is the path to that folder.
Create a datastore from the sample file, mapredout.seq
, using the datastore
function. The sample file contains unique keys representing airline carrier codes and corresponding values that represent the number of flights operated by that carrier.
ds = datastore('mapredout.seq')
ds = KeyValueDatastore with properties:
Files: {
' ...\matlab\toolbox\matlab\demos\mapredout.seq'
}
ReadSize: 1 key-value pairs
FileType: 'seq'
datastore
returns a KeyValueDatastore
. The datastore
function automatically determines the appropriate type of datastore to create.
Set the ReadSize
property to six so that each call to read
reads at most six key-value pairs.
Read subsets of the data from ds
using the read
function in a while
loop. For each subset of data, compute the sum of the values. Store the sum for each subset in an array named sums
. The while
loop executes until hasdata(ds)
returns false
.
sums = []; while hasdata(ds) T = read(ds); T.Value = cell2mat(T.Value); sums(end+1) = sum(T.Value); end
View the last subset of key-value pairs read.
T =
Key Value
________ _____
'WN' 15931
'XE' 2357
'YV' 849
'ML (1)' 69
'PA (1)' 318
Compute the total number of flights operated by all carriers.
See Also
datastore | KeyValueDatastore | mapreduce | tall