TabularTextDatastore - Datastore for tabular text files - MATLAB (original) (raw)

Datastore for tabular text files

Description

Use a TabularTextDatastore object to manage large collections of text files containing column-oriented or tabular data where the collection does not necessarily fit in memory. Tabular data is data that is arranged in a rectangular fashion with each row having the same number of entries. You can create aTabularTextDatastore object using thetabularTextDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

Description

`ttds` = tabularTextDatastore([location](#d126e353135)) creates a datastore from the collection of data specified bylocation.

`ttds` = tabularTextDatastore([location](#d126e353135),[Name,Value](#namevaluepairarguments)) specifies additional parameters and properties for ttds using one or more name-value pair arguments. For example,tabularTextDatastore(location,"FileExtensions",[".txt",".csv"]) creates a datastore from only the files in location with extensions .txt and .csv.

example

Input Arguments

expand all

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

The tabularTextDatastore function supports files with .txt, .csv,.dat, .dlm,.asc, and .text extensions as well as files with no extension.

Example: "file1.csv"

Example: "../dir/data/file1"

Example: ["C:\dir\data\file1.csv","C:\dir\data\file2.dat"]

Example: "C:\dir\data\*.text"

Name-Value Arguments

expand all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: ttds = tabularTextDatastore("C:\dir\textdata",FileExtensions=[".csv",".txt"])

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: ttds = tabularTextDatastore("C:\dir\textdata","FileExtensions",[".csv",".txt"])

Text file extensions, specified as the comma-separated pair consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array. The specified extensions do not require a supported format. If you want to include unsupported extensions, then specify all extensions. Use empty quotes "" to represent files without extensions.

Example: "FileExtensions",".txt"

Example: "FileExtensions",[".text",".csv"]

Data Types: char | cell | string

Subfolder inclusion flag, specified as the name-value argument consisting of "IncludeSubfolders" andtrue, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

When you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

Output datatype, specified as the name-value argument consisting of"OutputType" and one of these values:

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the"RowTimes" name-value pair to return timetables fromTabularTextDatastore.

Example: "OutputType","timetable"

Data Types: char | string

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Output data type of text variables, specified as the comma-separated pair consisting of "TextType" and either "char" or "string". If the output table from the read,readall, or preview functions contains text variables, then"TextType" specifies the data type of those variables for TabularTextDatastore. If"TextType" is "char", then the output is a cell array of character vectors. If"TextType" is "string", then the output has type string.

Data Types: char | string

Type for imported date and time data, specified as the comma-separated pair consisting of "DatetimeType" and one of these values: "datetime" or"text".

Value Type for Imported Date and Time Data
"datetime" MATLABdatetime data typeFor more information, see datetime.
"text" If "DatetimeType" is specified as "text", then the type for imported date and time data depends on the value specified in the"TextType" property: If "TextType" is"char", then thetabularTextdatastore imports dates as a cell array of character vectors.If "TextType" is"string", then thetabularTextdatastore imports dates as an array of strings.

If the specified TextscanFormats property contains a %D, then thetabularTextdatastore ignores the value specified in DatetimeType.

Example: "DatetimeType","datetime"

Data Types: char | string

Output data type of duration data, specified as the name-value argument consisting of "DurationType" and either"duration" or "text".

Value Type for Imported Duration Data
"duration" MATLABduration data typeFor more information, see duration.
"text" If "DurationType" is specified as "text", then the type for imported duration data depends on the value specified in the"TextType" parameter: If "TextType" is"char", then the importing function returns duration data as a cell array of character vectors. If "TextType" is"string", then the importing function returns duration data as an array of strings.

Data Types: char | string

Locale to interpret dates, specified as a character vector or string scalar. The DatetimeLocale value determines how the importing function interprets text that represents dates and times.

When specifying the DatetimeLocale, use the form_`xx`__ _`YY`_, where xx is a lowercase ISO 639-1 two-letter code that specifies a language, and_YY_ is an uppercase ISO 3166-1 alpha-2 code that specifies a country.

This table lists some common values for the locale.

Locale Language Country
"de_DE" German Germany
"en_GB" English United Kingdom
"en_US" English United States
"es_ES" Spanish Spain
"fr_FR" French France
"it_IT" Italian Italy
"ja_JP" Japanese Japan
"ko_KR" Korean Korea
"nl_NL" Dutch Netherlands
"zh_CN" Chinese (simplified) China

Note

The Locale value determines how input values are interpreted. The display format and language is specified by the Locale option in the Datetime format section of the Settings window. To change the default datetime locale, see Modify Command Window Settings.

Data Types: char | string

In addition to these name-value pairs, you also can specify the properties on this page as name-value pairs, with the exception of theFiles property.

Properties

expand all

TabularTextDatastore properties describe the files associated with a TabularTextDatastore object. Specifically, the properties describe the format of the data in the files and control how the data should be read from the datastore. When you create a TabularTextDatastore object, the datastore function uses the first file in the Files property to determine the values of the properties. With the exception of theFiles property, you can specify the value ofTabularTextDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation:

ds = tabularTextDatastore("airlinesmall.csv"); ds.TreatAsMissing = "NA"; ds.MissingValue = 0;

File Properties

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument in thetabularTextDatastore anddatastore functions define these files.

The first file specified by the Files property determines the variable names and format information for all files in the datastore.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: {"C:\dir\data\mydata1.csv";"C:\dir\data\mydata2.csv"}

Data Types: cell | string

File encoding, specified as a character vector or a string scalar like one of these values.

"IBM866" "ISO-8859-1" "windows-847"
"KOI8-R" "ISO-8859-2" "windows-1250"
"KOI8-U" "ISO-8859-3" "windows-1251"
"Macintosh" "ISO-8859-4" "windows-1252"
"US-ASCII" "ISO-8859-5" "windows-1253"
"UTF-8" "ISO-8859-6" "windows-1254"
"ISO-8859-7" "windows-1255"
"ISO-8859-8" "windows-1256"
"ISO-8859-9" "windows-1257"
"ISO-8859-11" "windows-1258"
"ISO-8859-13"
"ISO-8859-15"

If each file in the datastore fits into memory, thenFileEncoding also can be one of these values.

"Big5" "EUC-KR" "GB18030" "Shift_JIS"
"Big5-HKSCS" "EUC-JP" "GB2312" "windows-949"
"CP949" "EUC-TW" "GBK"

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | string

Read variable names, specified as a logical true orfalse.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: logical

Flag to preserve variable names, specified as either "modify" or"preserve".

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

Names of variables in the datastore, specified as a cell array of character vectors or a string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, they are detected from the first nonheader line in the first file of the datastore. When modifying theVariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of theVariableNamingRule parameter to"preserve".

If ReadVariableNames is false, then VariableNames defaults to{"Var1","Var2", ...}.

Example: {"Time","Name","Quantity"}

Data Types: cell | string

Text Format Properties

Number of lines to skip at the beginning of the file, specified as a nonnegative integer. If unspecified, thetabularTextDatastore function detects the number of lines to skip automatically.

The tabularTextDatastore function ignores the specified number of header lines before reading the variable names or data.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: double

Field delimiter characters, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify multiple delimiters in a cell array of character vectors or a string array. If unspecified, the tabularTextDatastore function detects the delimiter automatically.

Example: "|"

Example: {";","*"}

Repeated delimiter characters in a file are interpreted as separate delimiters with empty fields between them. If unspecified, the read function detects the delimiter automatically by default.

When you specify one of the following escape sequences as a delimiter, it is converted to the corresponding control character.

\b Backspace
\n Newline
\r Carriage return
\t Tab
\\ Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Data Types: char | cell | string

Row delimiter character, specified as a character vector or string scalar that must be either a single character or one of"\r", "\n", or"\r\n".

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: ":"

Data Types: char | string

Text to treat as missing values, specified as a single character vector, cell array of character vectors, string scalar, or string array. Values specified as TreatAsMissing are substituted with the value defined in the MissingValue property. For instance, if MissingValue is defined to be aNaN, and the TreatAsMissing is specified as "NA". Then, in the imported data, all occurrences of "NA" are replaced byNaN.

This option only applies to numeric fields. Also, this property is equivalent to the TreatAsEmpty name-value pair argument for the textscan function.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: "NA"

Example: {"-",""}

Data Types: char | cell | string

Value for missing numeric fields in delimited text files, specified as a scalar. This property is equivalent to theEmptyValue name-value pair argument for thetextscan function.

Data Types: double

Advanced Text Format Properties

Data field format, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier.

When you specify or modify the TextscanFormats property, you can use the same conversion specifiers that the textscan function accepts for the formatSpec argument. Valid values forTextscanFormats include conversion specifiers that skip fields using an asterisk (*) character and ones that skip literal text. The number of conversion specifiers must match the number of variables in the VariableNames property.

Example: {"%s","%s","%f"}

Data Types: cell | string

Exponent characters, specified as a character vector or string scalar. The default exponent characters are e,E, d, andD.

Data Types: char | string

Comment character used to distinguish comments in the file, specified as character vector, string scalar, or two-element string or cell array.

When reading from a TabularTextDatastore, theread function checks for comments only at the start of each field, not within a field.

When you change the value of this property, thedatastore function reevaluates the values of theTabularTextDatastore properties.

Example: "CommentStyle",{"/*", "*/"}

Data Types: char | cell | string

White-space characters, specified as a character vector or a string scalar of one or more characters.

When you specify one of the following escape sequences as any white-space character, the datastore function converts that sequence to the corresponding control character.

\b Backspace
\n Newline
\r Carriage return
\t Tab
\\ Backslash (\)

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Example: " \b\t"

Data Types: char | string

Multiple delimiter handling, specified as eithertrue or false. Iftrue, then datastore treats consecutive delimiters as a single delimiter. Repeated delimiters separated by white-space are also treated as a single delimiter.

When you change the value of this property, the datastore function reevaluates the values of the TabularTextDatastore properties.

Properties for preview, read, readall Table

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of theVariableNamingRule parameter to"preserve".

Example: ["Var3","Var7","Var4"]

Data Types: cell | string

Formats of the selected variables to read, specified as a cell array of character vectors or a string array, where each character vector or string contains one conversion specifier. The variables to read are indicated by the SelectedVariableNames property. The number of character vectors or strings inSelectedFormats must match the number of variables to read.

You can use the same conversion specifiers that thetextscan function accepts, including specifiers that skip literal text. However, you cannot use a conversion specifier that skips a field. That is, the conversion specifier cannot include an asterisk character (*).

Example: ["%d","%d"]

Data Types: cell | string

Amount of data to read in a call to the read function, specified as a positive scalar or "file".

When you change ReadSize from a numeric scalar to "file" or vice versa, MATLAB resets the datastore to the state where no data has been read from it.

Data Types: double | char | string

Output data type of text variables, specified as"char" or "string".TextType specifies the data type of text variables formatted with %s, %q, or [...].

Data Types: char | string

Name of row times variable, specified as the name-value argument consisting of"RowTimes" and a variable name (such as"Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain adatetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then TabularTextDatastore uses the first datetime or duration variable as the row times for the timetable.

Properties for use by writeall

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. Thelocation argument in thetabularTextDatastore anddatastore functions definesFolders when the datastore is created.

The Folders property is reset when you modify theFiles property of aTabularTextDatastore object.

Data Types: cell

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

Examples

collapse all

Using either a FileSet object or a file path, create a datastore from the sample file airlinesmall_subset.csv. Specify how to treat missing values. Select only variables of interest.

Create a FileSet object from the file airlinesmall_subset.csv. Create a datastore where values of "NA" are treated as missing and missing values are treated as 0.

fs = matlab.io.datastore.FileSet("airlinesmall_subset.csv"); ds = tabularTextDatastore(fs,"TreatAsMissing","NA",... "MissingValue",0);

View the variables in the datastore.

ans = 1×19 cell {'Year'} {'Month'} {'DayofMonth'} {'DayOfWeek'} {'DepTime'} {'CRSDepTime'} {'ArrTime'} {'CRSArrTime'} {'UniqueCarrier'} {'FlightNum'} {'TailNum'} {'ActualElapsedTime'} {'CRSElapsedTime'} {'AirTime'} {'ArrDelay'} {'DepDelay'} {'Origin'} {'Dest'} {'Distance'}

Modify the SelectedVariableNames property to specify the variables of interest.

ds.SelectedVariableNames = ["Year","Month","TailNum"];

Alternatively, you can specify the location of your data as a file path.

ds = tabularTextDatastore("airlinesmall_subset.csv");

Using either location format, you can also specify your variables of interest when you create the datastore.

ds = tabularTextDatastore("airlinesmall_subset.csv","TreatAsMissing","NA",... "MissingValue",0,"SelectedVariableNames",["Year","Month","TailNum"]);

Create a datastore from the sample file airlinesmall_subset.csv, which contains tabular data.

ds = tabularTextDatastore("airlinesmall_subset.csv","TreatAsMissing","NA",... "MissingValue",0);

Specify the variables of interest.

ds.SelectedVariableNames = ["Year","Month","UniqueCarrier"];

View the SelectedFormats property.

ans = 1×3 cell {'%f'} {'%f'} {'%q'}

The SelectedFormats property specifies how the tabularTextDatastore function interprets the format of the variables. The Year and Month variables are read as columns of floating-point values and the UniqueCarrier variable as a column of text.

Modify the SelectedFormats property to read the first two variables as signed integers and the third variable as a categorical value.

ds.SelectedFormats = ["%d","%d","%C"];

Preview the data.

T=8×3 table Year Month UniqueCarrier ____ _____ _____________

1996      1           HP      
1996      1           HP      
1996      1           HP      
1996      1           HP      
1996      1           US      
1996      1           US      
1996      1           US      
1996      1           NW      

Use the OutputType and RowTimes name-value pairs to make tabulartextDatastore return timetables instead of tables.

Create a datastore for outages.csv. Specify the "OutputType" name-value pair as "timetable".

ttds = tabularTextDatastore("outages.csv","OutputType","timetable"); preview(ttds)

ans=8×5 timetable OutageTime Region Loss Customers RestorationTime Cause
________________ _____________ ______ __________ ________________ ___________________

2002-02-01 12:18    {'SouthWest'}    458.98    1.8202e+06    2002-02-07 16:50    {'winter storm'   }
2003-01-23 00:49    {'SouthEast'}    530.14    2.1204e+05                 NaT    {'winter storm'   }
2003-02-07 21:15    {'SouthEast'}     289.4    1.4294e+05    2003-02-17 08:14    {'winter storm'   }
2004-04-06 05:44    {'West'     }    434.81    3.4037e+05    2004-04-06 06:10    {'equipment fault'}
2002-03-16 06:18    {'MidWest'  }    186.44    2.1275e+05    2002-03-18 23:23    {'severe storm'   }
2003-06-18 02:49    {'West'     }         0             0    2003-06-18 10:54    {'attack'         }
2004-06-20 14:39    {'West'     }    231.29           NaN    2004-06-20 19:16    {'equipment fault'}
2002-06-06 19:28    {'West'     }    311.86           NaN    2002-06-07 00:51    {'equipment fault'}

When you do not also specify 'RowTimes', tabularTextDatastore uses the first datetime or duration variable as the row times. In this case, the OutageTime variable is used for the row times.

Specify the 'RowTimes' option to use the restoration times (RestorationTime variable) as the row times, instead of the time of the power outages.

ttds = tabularTextDatastore("outages.csv","OutputType","timetable","RowTimes","RestorationTime"); preview(ttds)

ans=8×5 timetable RestorationTime Region OutageTime Loss Customers Cause
________________ _____________ ________________ ______ __________ ___________________

2002-02-07 16:50    {'SouthWest'}    2002-02-01 12:18    458.98    1.8202e+06    {'winter storm'   }
NaT                 {'SouthEast'}    2003-01-23 00:49    530.14    2.1204e+05    {'winter storm'   }
2003-02-17 08:14    {'SouthEast'}    2003-02-07 21:15     289.4    1.4294e+05    {'winter storm'   }
2004-04-06 06:10    {'West'     }    2004-04-06 05:44    434.81    3.4037e+05    {'equipment fault'}
2002-03-18 23:23    {'MidWest'  }    2002-03-16 06:18    186.44    2.1275e+05    {'severe storm'   }
2003-06-18 10:54    {'West'     }    2003-06-18 02:49         0             0    {'attack'         }
2004-06-20 19:16    {'West'     }    2004-06-20 14:39    231.29           NaN    {'equipment fault'}
2002-06-07 00:51    {'West'     }    2002-06-06 19:28    311.86           NaN    {'equipment fault'}

Limitations

Version History

Introduced in R2014b

expand all

You can create a datastore from multiple remote files. Specify multiple remote files using wildcard characters anywhere in the URL.

You can read data from primary online sources by performing datastore operations over an internet URL.