TabularTextDatastore - Datastore for tabular text files - MATLAB (original) (raw)

Datastore for tabular text files

Description

Use a TabularTextDatastore object to manage large collections of text files containing column-oriented or tabular data where the collection does not necessarily fit in memory. Tabular data is data that is arranged in a rectangular fashion with each row having the same number of entries. You can create aTabularTextDatastore object using thetabularTextDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

Description

`ttds` = tabularTextDatastore([location](#d126e350101)) creates a datastore from the collection of data specified bylocation.

`ttds` = tabularTextDatastore([location](#d126e350101),[Name,Value](#namevaluepairarguments)) specifies additional parameters and properties for ttds using one or more name-value pair arguments. For example,tabularTextDatastore(location,"FileExtensions",[".txt",".csv"]) creates a datastore from only the files in location with extensions .txt and .csv.

example

Input Arguments

expand all

location — Files or folders included in datastore

FileSet | DsFileSet object | string array | character vector | cell array of character vectors

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

The tabularTextDatastore function supports files with .txt, .csv,.dat, .dlm,.asc, and .text extensions as well as files with no extension.

Example: "file1.csv"

Example: "../dir/data/file1"

Example: ["C:\dir\data\file1.csv","C:\dir\data\file2.dat"]

Example: "C:\dir\data\*.text"

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: ttds = tabularTextDatastore("C:\dir\textdata",FileExtensions=[".csv",".txt"])

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: ttds = tabularTextDatastore("C:\dir\textdata","FileExtensions",[".csv",".txt"])

FileExtensions — Text file extensions

character vector | cell array of character vectors | string scalar | string array

Text file extensions, specified as the comma-separated pair consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array. The specified extensions do not require a supported format. If you want to include unsupported extensions, then specify all extensions. Use empty quotes "" to represent files without extensions.

Example: "FileExtensions",".txt"

Example: "FileExtensions",[".text",".csv"]

Data Types: char | cell | string

IncludeSubfolders — Subfolder inclusion flag

true or false | 0 or 1

Subfolder inclusion flag, specified as the name-value argument consisting of "IncludeSubfolders" andtrue, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

When you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

OutputType — Output datatype

"auto" (default) | "table" | "timetable"

Output datatype, specified as the name-value argument consisting of"OutputType" and one of these values:

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the"RowTimes" name-value pair to return timetables fromTabularTextDatastore.

Example: "OutputType","timetable"

Data Types: char | string

AlternateFileSystemRoots — Alternate file system root paths

string vector | cell array

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

TextType — Output data type of text variables

"char" (default) | "string"

Output data type of text variables, specified as the comma-separated pair consisting of "TextType" and either "char" or "string". If the output table from the read,readall, or preview functions contains text variables, then"TextType" specifies the data type of those variables for TabularTextDatastore. If"TextType" is "char", then the output is a cell array of character vectors. If"TextType" is "string", then the output has type string.

Data Types: char | string

DatetimeType — Type for imported date and time data

"datetime" (default) | "text"

Type for imported date and time data, specified as the comma-separated pair consisting of "DatetimeType" and one of these values: "datetime" or"text".

Value Type for Imported Date and Time Data
"datetime" MATLABdatetime data typeFor more information, see datetime.
"text" If "DatetimeType" is specified as "text", then the type for imported date and time data depends on the value specified in the"TextType" property: If "TextType" is"char", then thetabularTextdatastore imports dates as a cell array of character vectors.If "TextType" is"string", then thetabularTextdatastore imports dates as an array of strings.

If the specified TextscanFormats property contains a %D, then thetabularTextdatastore ignores the value specified in DatetimeType.

Example: "DatetimeType","datetime"

Data Types: char | string

DurationType — Output data type of duration data

"duration" (default) | "text"

Output data type of duration data, specified as the name-value argument consisting of "DurationType" and either"duration" or "text".

Value Type for Imported Duration Data
"duration" MATLABduration data typeFor more information, see duration.
"text" If "DurationType" is specified as "text", then the type for imported duration data depends on the value specified in the"TextType" parameter: If "TextType" is"char", then the importing function returns duration data as a cell array of character vectors. If "TextType" is"string", then the importing function returns duration data as an array of strings.

Data Types: char | string

DatetimeLocale — Locale to interpret dates

"en_US" (default) | character vector | string scalar

Locale to interpret dates, specified as a character vector or string scalar. The DatetimeLocale value determines how the importing function interprets text that represents dates and times.

When specifying the DatetimeLocale, use the form_`xx`__ _`YY`_, where xx is a lowercase ISO 639-1 two-letter code that specifies a language, and_YY_ is an uppercase ISO 3166-1 alpha-2 code that specifies a country.

This table lists some common values for the locale.

Locale Language Country
"de_DE" German Germany
"en_GB" English United Kingdom
"en_US" English United States
"es_ES" Spanish Spain
"fr_FR" French France
"it_IT" Italian Italy
"ja_JP" Japanese Japan
"ko_KR" Korean Korea
"nl_NL" Dutch Netherlands
"zh_CN" Chinese (simplified) China

Note

The Locale value determines how input values are interpreted. The display format and language is specified by the Locale option in the Datetime format section of the Preferences window. To change the default datetime locale, see Set Command Window Preferences.

Data Types: char | string

In addition to these name-value pairs, you also can specify the properties on this page as name-value pairs, with the exception of theFiles property.

Properties

expand all

TabularTextDatastore properties describe the files associated with a TabularTextDatastore object. Specifically, the properties describe the format of the data in the files and control how the data should be read from the datastore. When you create a TabularTextDatastore object, the datastore function uses the first file in the Files property to determine the values of the properties. With the exception of theFiles property, you can specify the value ofTabularTextDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation:

ds = tabularTextDatastore("airlinesmall.csv"); ds.TreatAsMissing = "NA"; ds.MissingValue = 0;

File Properties

Files — Files included in datastore

cell array of character vectors | string array

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument in thetabularTextDatastore anddatastore functions define these files.

The first file specified by the Files property determines the variable names and format information for all files in the datastore.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: {"C:\dir\data\mydata1.csv";"C:\dir\data\mydata2.csv"}

Data Types: cell | string

FileEncoding — File encoding

"UTF-8" (default) | "US-ASCII" | "Macintosh" | ...

File encoding, specified as a character vector or a string scalar like one of these values.

"IBM866" "ISO-8859-1" "windows-847"
"KOI8-R" "ISO-8859-2" "windows-1250"
"KOI8-U" "ISO-8859-3" "windows-1251"
"Macintosh" "ISO-8859-4" "windows-1252"
"US-ASCII" "ISO-8859-5" "windows-1253"
"UTF-8" "ISO-8859-6" "windows-1254"
"ISO-8859-7" "windows-1255"
"ISO-8859-8" "windows-1256"
"ISO-8859-9" "windows-1257"
"ISO-8859-11" "windows-1258"
"ISO-8859-13"
"ISO-8859-15"

If each file in the datastore fits into memory, thenFileEncoding also can be one of these values.

"Big5" "EUC-KR" "GB18030" "Shift_JIS"
"Big5-HKSCS" "EUC-JP" "GB2312" "windows-949"
"CP949" "EUC-TW" "GBK"

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | string

ReadVariableNames — Read variable names

true | false

Read variable names, specified as a logical true orfalse.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: logical

VariableNamingRule — Flag to preserve variable names

"modify" (default) | "preserve"

Flag to preserve variable names, specified as either "modify" or"preserve".

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

VariableNames — Names of variables

cell array of character vectors | string array

Names of variables in the datastore, specified as a cell array of character vectors or a string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, they are detected from the first nonheader line in the first file of the datastore. When modifying theVariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of theVariableNamingRule parameter to"preserve".

If ReadVariableNames is false, then VariableNames defaults to{"Var1","Var2", ...}.

Example: {"Time","Name","Quantity"}

Data Types: cell | string

Text Format Properties

— Number of lines to skip

non-negative integer

Number of lines to skip at the beginning of the file, specified as a nonnegative integer. If unspecified, thetabularTextDatastore function detects the number of lines to skip automatically.

The tabularTextDatastore function ignores the specified number of header lines before reading the variable names or data.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: double

Delimiter — Field delimiter characters

character vector | cell array of character vectors | string scalar | string array

Field delimiter characters, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify multiple delimiters in a cell array of character vectors or a string array. If unspecified, the tabularTextDatastore function detects the delimiter automatically.

Example: "|"

Example: {";","*"}

Repeated delimiter characters in a file are interpreted as separate delimiters with empty fields between them. If unspecified, the read function detects the delimiter automatically by default.

When you specify one of the following escape sequences as a delimiter, it is converted to the corresponding control character.

\b Backspace
\n Newline
\r Carriage return
\t Tab
\\ Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | cell | string

RowDelimiter — Row delimiter character

\r\n (default) | character vector | string scalar

Row delimiter character, specified as a character vector or string scalar that must be either a single character or one of"\r", "\n", or"\r\n".

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: ":"

Data Types: char | string

TreatAsMissing — Text to treat as missing values

"" (default) | character vector | cell array of character vectors | string scalar | string array

Text to treat as missing values, specified as a single character vector, cell array of character vectors, string scalar, or string array. Values specified as TreatAsMissing are substituted with the value defined in the MissingValue property. For instance, if MissingValue is defined to be aNaN, and the TreatAsMissing is specified as "NA". Then, in the imported data, all occurrences of "NA" are replaced byNaN.

This option only applies to numeric fields. Also, this property is equivalent to the TreatAsEmpty name-value pair argument for the textscan function.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: "NA"

Example: {"-",""}

Data Types: char | cell | string

MissingValue — Value for missing numeric fields

NaN (default) | scalar

Value for missing numeric fields in delimited text files, specified as a scalar. This property is equivalent to theEmptyValue name-value pair argument for thetextscan function.

Data Types: double

Advanced Text Format Properties

TextscanFormats — Data field format

cell array of character vectors | string array

Data field format, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier.

When you specify or modify the TextscanFormats property, you can use the same conversion specifiers that the textscan function accepts for the formatSpec argument. Valid values forTextscanFormats include conversion specifiers that skip fields using an asterisk (*) character and ones that skip literal text. The number of conversion specifiers must match the number of variables in the VariableNames property.

Example: {"%s","%s","%f"}

Data Types: cell | string

ExponentCharacters — Exponent characters

"eEdD" (default) | character vector | string scalar

Exponent characters, specified as a character vector or string scalar. The default exponent characters are e,E, d, andD.

Data Types: char | string

— Comment character

"" (default) | character vector | string scalar | two-element array

Comment character used to distinguish comments in the file, specified as character vector, string scalar, or two-element string or cell array.

When reading from a TabularTextDatastore, theread function checks for comments only at the start of each field, not within a field.

When you change the value of this property, thedatastore function reevaluates the values of theTabularTextDatastore properties.

Example: "CommentStyle",{"/*", "*/"}

Data Types: char | cell | string

Whitespace — White-space characters

" \b\t" (default) | character vector | string scalar

White-space characters, specified as a character vector or a string scalar of one or more characters.

When you specify one of the following escape sequences as any white-space character, the datastore function converts that sequence to the corresponding control character.

\b Backspace
\n Newline
\r Carriage return
\t Tab
\\ Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: " \b\t"

Data Types: char | string

MultipleDelimitersAsOne — Multiple delimiter handling

0 (false) (default) | 1 (true)

Multiple delimiter handling, specified as eithertrue or false. Iftrue, then datastore treats consecutive delimiters as a single delimiter. Repeated delimiters separated by white-space are also treated as a single delimiter.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Properties for preview, read, readall Table

SelectedVariableNames — Variables to read

cell array of character vectors | string array

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of theVariableNamingRule parameter to"preserve".

Example: ["Var3","Var7","Var4"]

Data Types: cell | string

SelectedFormats — Formats of selected variables

cell array of character vectors | string array

Formats of the selected variables to read, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier. The variables to read are indicated by the SelectedVariableNames property. The number of character vectors or strings inSelectedFormats must match the number of variables to read.

You can use the same conversion specifiers that thetextscan function accepts, including specifiers that skip literal text. However, you cannot use a conversion specifier that skips a field. That is, the conversion specifier cannot include an asterisk character (*).

Example: ["%d","%d"]

Data Types: cell | string

ReadSize — Amount of data to read

20000 (default) | positive scalar | "file"

Amount of data to read in a call to the read function, specified as a positive scalar or "file".

When you change ReadSize from a numeric scalar to "file" or vice versa, MATLAB resets the datastore to the state where no data has been read from it.

Data Types: double | char | string

TextType — Output data type of text variables

"char" (default) | "string"

Output data type of text variables, specified as"char" or "string".TextType specifies the data type of text variables formatted with %s, %q, or [...].

Data Types: char | string

RowTimes — Name of row times variable

variable name | variable index

Name of row times variable, specified as the name-value argument consisting of"RowTimes" and a variable name (such as"Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain adatetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then TabularTextDatastore uses the first datetime or duration variable as the row times for the timetable.

Properties for use by writeall

Folders — Folders used to construct datastore

cell array of character vectors

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. Thelocation argument in thetabularTextDatastore anddatastore functions definesFolders when the datastore is created.

The Folders property is reset when you modify theFiles property of aTabularTextDatastore object.

Data Types: cell

SupportedOutputFormats — Formats supported for writing

string row vector

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

DefaultOutputFormat — Default output format

string scalar

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

Examples

collapse all

Select Variables to Read

Using either a FileSet object or a file path, create a datastore from the sample file airlinesmall_subset.csv. Specify how to treat missing values. Select only variables of interest.

Create a FileSet object from the file airlinesmall_subset.csv. Create a datastore where values of "NA" are treated as missing and missing values are treated as 0.

fs = matlab.io.datastore.FileSet("airlinesmall_subset.csv"); ds = tabularTextDatastore(fs,"TreatAsMissing","NA",... "MissingValue",0);

View the variables in the datastore.

ans = 1x19 cell {'Year'} {'Month'} {'DayofMonth'} {'DayOfWeek'} {'DepTime'} {'CRSDepTime'} {'ArrTime'} {'CRSArrTime'} {'UniqueCarrier'} {'FlightNum'} {'TailNum'} {'ActualElapsedTime'} {'CRSElapsedTime'} {'AirTime'} {'ArrDelay'} {'DepDelay'} {'Origin'} {'Dest'} {'Distance'}

Modify the SelectedVariableNames property to specify the variables of interest.

ds.SelectedVariableNames = ["Year","Month","TailNum"];

Alternatively, you can specify the location of your data as a file path.

ds = tabularTextDatastore("airlinesmall_subset.csv");

Using either location format, you can also specify your variables of interest when you create the datastore.

ds = tabularTextDatastore("airlinesmall_subset.csv","TreatAsMissing","NA",... "MissingValue",0,"SelectedVariableNames",["Year","Month","TailNum"]);

Specify Format of Data to Read

Create a datastore from the sample file airlinesmall_subset.csv, which contains tabular data.

ds = tabularTextDatastore("airlinesmall_subset.csv","TreatAsMissing","NA",... "MissingValue",0);

Specify the variables of interest.

ds.SelectedVariableNames = ["Year","Month","UniqueCarrier"];

View the SelectedFormats property.

ans = 1x3 cell {'%f'} {'%f'} {'%q'}

The SelectedFormats property specifies how the tabularTextDatastore function interprets the format of the variables. The Year and Month variables are read as columns of floating-point values and the UniqueCarrier variable as a column of text.

Modify the SelectedFormats property to read the first two variables as signed integers and the third variable as a categorical value.

ds.SelectedFormats = ["%d","%d","%C"];

Preview the data.

T=8×3 table Year Month UniqueCarrier ____ _____ _____________

1996      1           HP      
1996      1           HP      
1996      1           HP      
1996      1           HP      
1996      1           US      
1996      1           US      
1996      1           US      
1996      1           NW      

Return Timetable from Tabular Text Datastore

Use the OutputType and RowTimes name-value pairs to make tabulartextDatastore return timetables instead of tables.

Create a datastore for outages.csv. Specify the "OutputType" name-value pair as "timetable".

ttds = tabularTextDatastore("outages.csv","OutputType","timetable"); preview(ttds)

ans=8×5 timetable OutageTime Region Loss Customers RestorationTime Cause
________________ _____________ ______ __________ ________________ ___________________

2002-02-01 12:18    {'SouthWest'}    458.98    1.8202e+06    2002-02-07 16:50    {'winter storm'   }
2003-01-23 00:49    {'SouthEast'}    530.14    2.1204e+05                 NaT    {'winter storm'   }
2003-02-07 21:15    {'SouthEast'}     289.4    1.4294e+05    2003-02-17 08:14    {'winter storm'   }
2004-04-06 05:44    {'West'     }    434.81    3.4037e+05    2004-04-06 06:10    {'equipment fault'}
2002-03-16 06:18    {'MidWest'  }    186.44    2.1275e+05    2002-03-18 23:23    {'severe storm'   }
2003-06-18 02:49    {'West'     }         0             0    2003-06-18 10:54    {'attack'         }
2004-06-20 14:39    {'West'     }    231.29           NaN    2004-06-20 19:16    {'equipment fault'}
2002-06-06 19:28    {'West'     }    311.86           NaN    2002-06-07 00:51    {'equipment fault'}

When you do not also specify 'RowTimes', tabularTextDatastore uses the first datetime or duration variable as the row times. In this case, the OutageTime variable is used for the row times.

Specify the 'RowTimes' option to use the restoration times (RestorationTime variable) as the row times, instead of the time of the power outages.

ttds = tabularTextDatastore("outages.csv","OutputType","timetable","RowTimes","RestorationTime"); preview(ttds)

ans=8×5 timetable RestorationTime Region OutageTime Loss Customers Cause
________________ _____________ ________________ ______ __________ ___________________

2002-02-07 16:50    {'SouthWest'}    2002-02-01 12:18    458.98    1.8202e+06    {'winter storm'   }
NaT                 {'SouthEast'}    2003-01-23 00:49    530.14    2.1204e+05    {'winter storm'   }
2003-02-17 08:14    {'SouthEast'}    2003-02-07 21:15     289.4    1.4294e+05    {'winter storm'   }
2004-04-06 06:10    {'West'     }    2004-04-06 05:44    434.81    3.4037e+05    {'equipment fault'}
2002-03-18 23:23    {'MidWest'  }    2002-03-16 06:18    186.44    2.1275e+05    {'severe storm'   }
2003-06-18 10:54    {'West'     }    2003-06-18 02:49         0             0    {'attack'         }
2004-06-20 19:16    {'West'     }    2004-06-20 14:39    231.29           NaN    {'equipment fault'}
2002-06-07 00:51    {'West'     }    2002-06-06 19:28    311.86           NaN    {'equipment fault'}

Limitations

Version History

Introduced in R2014b

expand all

R2024b: Read data over HTTP and HTTPS using datastore functions

You can read data from primary online sources by performing datastore operations over an internet URL.