datastore - Create datastore for large collections of data - MATLAB (original) (raw)

Create datastore for large collections of data

Syntax

Description

[ds](#budis9p-1-ds) = datastore([location](#budis9p-1-location)) creates a datastore from the collection of data specified bylocation. A datastore is a repository for collections of data that are too large to fit in memory. After creating ds, you can read and process the data.

example

[ds](#budis9p-1-ds) = datastore([location](#budis9p-1-location),[Name,Value](#namevaluepairarguments)) specifies additional parameters for ds using one or more name-value pair arguments. For example, you can create a datastore for image files by specifying 'Type','image'.

example

Examples

collapse all

Create Datastore for Text Data

Create a datastore associated with the sample fileairlinesmall.csv. This file contains airline data from the years 1987 through 2008.

To manage the import of missing data in numeric columns, use the"TreatAsMissing" and"MissingValue" name-value arguments. Replace every instance of "NA" with a 0 in the imported data by specifying the value of "TreatAsMissing" as"NA" and the value of"MissingValue" as 0.

ds = datastore("airlinesmall.csv","TreatAsMissing","NA",... "MissingValue",0)

ds = TabularTextDatastore with properties:

                  Files: {
                         'B:\matlab\toolbox\matlab\demos\airlinesmall.csv'
                         }
                Folders: {
                         'B:\matlab\toolbox\matlab\demos'
                         }
           FileEncoding: 'UTF-8'

AlternateFileSystemRoots: {} VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US

Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\r\n' TreatAsMissing: 'NA' MissingValue: 0

Advanced Text Format Properties: TextscanFormats: {'%f', '%f', '%f' ... and 26 more} TextType: 'char' ExponentCharacters: 'eEdD' CommentStyle: '' Whitespace: ' \b\t' MultipleDelimitersAsOne: false

Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedFormats: {'%f', '%f', '%f' ... and 26 more} ReadSize: 20000 rows OutputType: 'table' RowTimes: []

Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "txt"

datastore creates aTabularTextDatastore.

Create Datastore for Image Data

Create a datastore containing all .tif files in the MATLAB® path and its subfolders.

ds = datastore(fullfile(matlabroot,"toolbox","matlab"),... "IncludeSubfolders",true,"FileExtensions",".tif","Type","image")

ds = ImageDatastore with properties: Files: { 'H:\matlab\toolbox\matlab\demos\example.tif'; 'H:\matlab\toolbox\matlab\imagesci\corn.tif' } Folders: { 'H:\matlab\toolbox\matlab' } AlternateFileSystemRoots: {} ReadSize: 1 Labels: {} SupportedOutputFormats: ["png" "jpg" "jpeg" "tif" "tiff"] DefaultOutputFormat: "png" ReadFcn: @readDatastoreImage

Input Arguments

collapse all

location — Files or folders to include in datastore

FileSet | DsFileSet object | string array | character vector | cell array of character vectors

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

For KeyValueDatastore, the location argument must be MAT files or Sequence files generated by themapreduce function. MAT files must be in a local file system or in a network file system. Sequence files can be in a local, network, or HDFS™ file system. For a datastore of typeDatabaseDatastore, the location argument need not be files. For more information, see DatabaseDatastore (Database Toolbox).

Example: "file1.csv"

Example: "../dir/data/file1.jpg"

Example: ["C:\dir\data\file1.xls","C:\dir\data\file2.xlsx"]

Example: "C:\dir\data\*.mat"

Example: "hdfs:///data/file1.txt"

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: datastore("myfolder","FileExtensions"=[".jpg",".tif"]) includes all extensions with a .jpg or .tif extension for an ImageDatastore object.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: datastore("myfolder","FileExtensions",[".jpg",".tif"]) includes all extensions with a .jpg or .tif extension for an ImageDatastore object.

Type — Type of datastore

'tabulartext' | 'image' | 'spreadsheet' | 'keyvalue' | 'file' | 'tall' | ...

Type of datastore, specified as the comma-separated pair consisting of'Type' and one of the following:

Value of'Type' Description
'tabulartext' Text files containing tabular data. The encoding of the data must be ASCII or UTF-8.
'image' Image files in a format such as JPEG or PNG. Acceptable files include imformats formats.
'spreadsheet' Spreadsheet files containing one or more sheets.
'keyvalue' Key-value pair data contained in MAT files or Sequence files with data generated bymapreduce. For more information, see KeyValueDatastore
'file' Custom format files, which require a specified read function to read the data. For more information, seeFileDatastore.
'tall' MAT-files or Sequence files produced by the write function of the tall data type. For more information see, TallDatastore.
'parquet' Parquet files containing column-oriented data. For more information see, ParquetDatastore.
'database' Data stored in database. Requires Database Toolbox™. Requires specification of additional input argument when using the type parameter. For more information, see DatabaseDatastore (Database Toolbox).

Data Types: char | string

IncludeSubfolders — Include subfolders within folder

true or false | 0 or 1

Include subfolders within a folder, specified as the comma-separated pair consisting of 'IncludeSubfolders' andtrue (1) or false (0). Specifytrue to include all files and subfolders within each folder or false to include only the files within each folder.

When you do not specify 'IncludeSubfolders', then the default value is false.

The 'IncludeSubfolders' name-value pair is only valid when creating these objects:

Example: 'IncludeSubfolders',true

Data Types: logical | double

FileExtensions — Extensions of files

character vector | cell array of character vectors | string scalar | string array

Extensions of files, specified as the comma-separated pair consisting of 'FileExtensions' and a character vector, cell array of character vectors, string scalar, or string array. When specifying 'FileExtensions', also specify'Type'. You can use the empty quotes'' to represent files without extensions.

If 'FileExtensions' is not specified, thendatastore automatically includes all supported file extensions depending on the datastore type. If you want to include unsupported extensions, then specify each extension you want to include individually.

The 'FileExtensions' name-value pair is only valid when creating these objects:

Example: 'FileExtensions','.jpg'

Example: 'FileExtensions',{'.txt','.text'}

Data Types: char | cell | string

AlternateFileSystemRoots — Alternate file system root paths

string vector | cell array

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

TextType — Output data type of text variables

'char' (default) | 'string'

Output data type of text variables, specified as the comma-separated pair consisting of 'TextType' and either'char' or 'string'. If the output table from the read,readall, or preview functions contains text variables, then 'TextType' specifies the data type of those variables forTabularTextDatastore andSpreadsheetDatastore objects only. If'TextType' is 'char', then the output is a cell array of character vectors. If'TextType' is 'string', then the output has type string.

Data Types: char | string

DatetimeType — Type for imported date and time data

'datetime' (default) | 'text'

Type for imported date and time data, specified as the comma-separated pair consisting of 'DatetimeType' and one of these values: 'datetime' or 'text'. The'DatetimeType' argument only applies when creating a TabularTextDatastore object.

Value Type for Imported Date and Time Data
'datetime' MATLABdatetime data typeFor more information, see datetime.
'text' If 'DatetimeType' is specified as 'text', then the type for imported date and time data depends on the value specified in the'TextType' parameter: If 'TextType' is'char', then thedatastore returns dates as a cell array of character vectors. If 'TextType' is'string', then thedatastore returns dates as an array of strings.

Example: 'DatetimeType','datetime'

Data Types: char | string

DurationType — Output data type of duration data

'duration' (default) | 'text'

Output data type of duration data from text files, specified as the comma-separated pair consisting of 'DurationType' and either 'duration' or 'text'.

Value Type for Imported Duration Data
'duration' MATLABduration data typeFor more information, see duration.
'text' If 'DurationType' is specified as 'text', then the type for imported duration data depends on the value specified in the'TextType' parameter: If 'TextType' is'char', then the importing function returns duration data as a cell array of character vectors. If 'TextType' is'string', then the importing function returns duration data as an array of strings.

Data Types: char | string | datetime

VariableNamingRule — Flag to preserve variable names

"modify" (default) | "preserve"

Flag to preserve variable names, specified as either "modify" or"preserve".

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

In addition to these name-value pairs, you also can specify any of the properties of the following objects as name-value pairs, except for the Files property:

Output Arguments

collapse all

ds — Datastore for collection of data

TabularTextDatastore | ImageDatastore | SpreadsheetDatastore | KeyValueDatastore | FileDatastore | TallDatastore | ...

Datastore for a collection of data, returned as one of these objects:TabularTextDatastore,ImageDatastore,SpreadsheetDatastore,KeyValueDatastore, FileDatastore,TallDatastore, ParquetDatastore, or DatabaseDatastore. The type of the datastore depends on the type of files or the location argument. For more information, click the datastore name in the following table:

For each of these datastore types, the Files property is a cell array of character vectors. Each character vector is an absolute path to a file resolved by the location argument.

Limitations

Version History

Introduced in R2014b

expand all

R2024b: Read data over HTTP and HTTPS using datastore functions

You can read data from primary online sources by performing datastore operations over an internet URL.