Read and Analyze Large Tabular Text File - MATLAB & Simulink (original) (raw)
Main Content
This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.
Create a Datastore
Create a datastore from the sample file airlinesmall.csv
using the tabularTextDatastore
function. When you create the datastore, you can specify that the text, NA
, in the data is treated as missing data.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
You can modify the properties of the datastore by changing its properties. Modify the MissingValue
property to specify that missing values are treated as 0.
In this example, select the variable for the arrival delay, ArrDelay
, as the variable of interest.
ds.SelectedVariableNames = 'ArrDelay';
Preview the data using the preview
function. This function does not affect the state of the datastore.
data=8×1 table ArrDelay ________
8
8
21
13
4
59
3
11
Read Subsets of Data
By default, read
reads from a TabularTextDatastore
20000 rows at a time. To read a different number of rows in each call to read
, modify the ReadSize
property of ds
.
Read subsets of the data from ds
using the read
function in a while
loop. The loop executes until hasdata(ds)
returns false
.
sums = []; counts = []; while hasdata(ds) T = read(ds);
sums(end+1) = sum(T.ArrDelay);
counts(end+1) = length(T.ArrDelay);
end
Compute the average arrival delay.
avgArrivalDelay = sum(sums)/sum(counts)
Reset the datastore to allow rereading of the data.
Read One File at a Time
A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting the ReadSize
property to 'file'
.
When you change the value of ReadSize
from a number to 'file'
or vice versa, MATLAB® resets the datastore.
Read from ds
using the read
function in a while
loop, as before, and compute the average arrival delay.
sums = []; counts = []; while hasdata(ds) T = read(ds);
sums(end+1) = sum(T.ArrDelay);
counts(end+1) = length(T.ArrDelay);
end avgArrivalDelay = sum(sums)/sum(counts)
See Also
tabularTextDatastore | tall | mapreduce