FileDatastore - Datastore with custom file reader - MATLAB (original) (raw)

Datastore with custom file reader

Description

Use a FileDatastore object to manage large collections of custom format files where the collection does not necessarily fit in memory or when a large custom format file does not fit in memory. You can create aFileDatastore object using the fileDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

Description

`fds` = fileDatastore([location](#d126e337526),"ReadFcn",[@fcn](#mw%5F8d4186c6-c0c4-4e67-a84c-82c9d5ece3be)) creates a datastore from the collection of files specified bylocation and uses the function fcn to read the data from the files.

`fds` = fileDatastore([location](#d126e337526),"ReadFcn",[@fcn](#mw%5F8d4186c6-c0c4-4e67-a84c-82c9d5ece3be),[Name,Value](#namevaluepairarguments)) specifies additional parameters and properties for fds using one or more name-value pair arguments. For example, you can specify which files to include in the datastore depending on their extensions withfileDatastore(location,"ReadFcn",@customreader,"FileExtensions",[".exts",".extx"]).

example

Input Arguments

expand all

location — Files or folders to include in datastore

FileSet | DsFileSet object | string array | character vector | cell array of character vectors

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

Example: "file1.ext"

Example: "../dir/data/file1.ext"

Example: ["C:\dir\data\file1.exts","C:\dir\data\file2.extx"]

Example: "C:\dir\data\*.ext"

@fcn — Function that reads file data

function handle

Function that reads the file data, specified as a function handle.

The signature of the function represented by the function handle@fcn depends on the value of the specifiedReadMode. The function that reads the file data must confirm to one of these signatures.

ReadMode ReadFcn signature
"file" (default) The function must have this signature:function data = MyReadFcn(filename) ... endfilename — Name of file to read.data — Corresponding file data.
"partialfile" The function must have this signature:function [data,userdata,done] = MyReadFcn(filename,userdata) ... enduserdata — Set and read fields ofuserdata to persist data between multiple FileDatastore read calls.done — Set this logical argument to either true orfalse. false — Continue to read the current file.true — Terminate current file read and read the next file.data — Portion of file data.
"byte" The function must have this signature:function data = MyReadFcn(filename,offset,size) ... endoffset — Specify the byte offset from the first byte in the file.size — Specify the number of bytes to read during the current read operation.data — Portion of file data of the size specified inBlockSize.TheFileDatastore increments both the offset andsize inputs based on the value specified inBlockSize.

The value specified in @fcn, sets the value of theReadFcn property.

Example: @customreader

Data Types: function_handle

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: fds = fileDatastore("C:\dir\data",FileExtensions=[".exts",".extx"])

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: fds = fileDatastore("C:\dir\data","FileExtensions",[".exts",".extx"])

IncludeSubfolders — Subfolder inclusion flag

true or false | 0 or 1

Subfolder inclusion flag, specified as the comma-separated pair consisting of "IncludeSubfolders" andtrue, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

FileExtensions — Custom format file extensions

character vector | cell array of character vectors | string scalar | string array

Custom format file extensions, specified as the comma-separated pair consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array.

When you specify a file extension, thefileDatastore function creates a datastore object only for files with the specified extension. You also can create a datastore for files without any extensions by specifying"FileExtensions" as an empty character vector, ''. If you do not specify"FileExtensions", thenfileDatastore automatically includes all files within a folder.

Example: "FileExtensions",''

Example: "FileExtensions",".ext"

Example: "FileExtensions",[".exts",".extx"]

Data Types: char | cell | string

PreviewFcn — Function to preview input data

@ReadFcn (default) | function handle

Function to preview the input data, specified as a function handle.

If you do not specify a preview function,FileDatastore uses the value specified in@ReadFcn as the default preview function. Alternatively, you can specify your own custom preview function for your data.

The function specified by PreviewFcn must return values with the same data types that theReadFcn returns.

Data Types: function_handle

ReadMode — Portion of the file to read

"file" (default) | "partialfile" | "bytes"

Portion of the file to read, specified as"file", "partialfile", or"bytes".

"file" (default) Use read mode "file" when your custom function, specified inReadFcn, reads the complete file in one read operation.Based on your custom read function, the file datastore reads the complete file with each call toread. The unit of parallelization to is a complete file.
"partialfile" Use read mode"partialfile" when your custom file read function, specified inReadFcn, reads only a portion of the file with each read operation.Based on your custom read function, the file datastore reads only a portion of the file with every call to theread function. In the "partialfile" read mode, the unit of parallelization is a complete file. Multiple read operations, in serial, are necessary to read a complete file.
"bytes" Use read mode"bytes" when your custom function, specified in ReadFcn, reads a BlockSize sized portion of the file with each read operation.FileDatastore sets the unit of parallelization to a block of the file containing the number of bytes specified byBlockSize.Based on your custom read function, the file datastore readsBlockSize sized portions of a file with every call to the read function. Multiple read operations in parallel are necessary to read a complete file.

To use the subset andshuffle functions on aFileDatastore object, you must set"ReadMode" to"file".

Data Types: char | string

BlockSize — Number of bytes to read

positive integer

Number of bytes to read with every read operation, specified as a positive integer.

To ensure that you can distribute multiple blocks of a file across multiple parallel MATLAB® workers, specify BlockSize as a positive integer greater than 131072 bytes (128 kilobytes).

To specify or to change the value of BlockSize, you must first set ReadMode to"bytes". FileDatastore sets the default value of BlockSize based on the value specified in ReadMode.

AlternateFileSystemRoots — Alternate file system root paths

string vector | cell array

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Properties

expand all

FileDatastore properties describe the files associated with aFileDatastore object. Except for the Files property, you can specify the value of FileDatastore properties using name-value pair arguments. To view or modify a property after creating the object, use the dot notation.

Files — Files included in datastore

character vector | cell array of character vectors | string scalar | string array

Files included in the datastore, resolved as a character vector, cell array of character vectors, string scalar, or string array, where each character vector or string is a full path to a file. Thelocation argument in thefileDatastore and datastore functions defines Files when the datastore is created.

Example: {"C:\dir\data\file1.ext";"C:\dir\data\file2.ext"}

Example: "hdfs:///data/*.mat"

Data Types: char | cell | string

Folders — Folders used to construct datastore

cell array of character vectors

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. Thelocation argument in thefileDatastore and datastore functions defines Folders when the datastore is created.

The Folders property is reset when you modify theFiles property of a FileDatastore object.

Data Types: cell

ReadFcn — Function that reads file data

function handle

Function that reads the file data, specified as a function handle.

The value specified by @fcn, sets the value of the ReadFcn property.

Example: @MyCustomFileReader

Data Types: function_handle

UniformRead — Vertically concatenateable flag

false (default) | true

This property is read-only.

Vertically concatenateable flag, specified as a logicaltrue or false. Specify the value of this property when you first create the FileDatastore object.

true Multiple reads of theFileDatastore object return uniform data that is vertically concatenateable.When the UniformRead property value is true: The ReadFcn function must return data that is vertically concatenateable ; otherwise, the readall method returns an error.The underlying data type of the output of the tall function is the same as the data type of the output fromReadFcn.
false (default) Multiple reads of theFileDatastore object do not return uniform data that is vertically concatenateable.When theUniformRead property value isfalse: readall returns a cell array.tall returns a tall cell array.

Example: fds = fileDatastore(location,"ReadFcn",@load,"UniformRead",true)

Data Types: logical | double

SupportedOutputFormats — Formats supported for writing

string row vector

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

Object Functions

Examples

collapse all

Create FileDatastore Object

Create a fileDatastore object using either a FileSet object or file paths.

Create a FileSet object. Create a fileDatastore object.

fs = matlab.io.datastore.FileSet("airlinesmall.parquet"); fds = fileDatastore(fs,"ReadFcn",@load)

fds = FileDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\airlinesmall.parquet' } Folders: { '...\matlab\toolbox\matlab\demos' } UniformRead: 0 ReadMode: 'file' BlockSize: Inf PreviewFcn: @load SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" "png" "jpg" "jpeg" "tif" "tiff" "wav" "flac" "ogg" "mp4" "m4a"] ReadFcn: @load AlternateFileSystemRoots: {}

Alternatively, you can use file paths to create your fileDatastore object.

fds = fileDatastore("airlinesmall.parquet","ReadFcn",@load);

Read Datastore of MAT-Files

Create a datastore containing all the .mat files within the MATLAB® demos folder, specifying theload function to read the file data.

fds = fileDatastore(fullfile(matlabroot,"toolbox","matlab","demos"),"ReadFcn",@load,"FileExtensions",".mat")

fds = FileDatastore with properties: Files: { '...\matlab\toolbox\matlab\demos\accidents.mat'; '...\matlab\toolbox\matlab\demos\airfoil.mat'; ' ...\matlab\toolbox\matlab\demos\airlineResults.mat' ... and 38 more } Folders: { '...\matlab\toolbox\matlab\demos' } UniformRead: 0 ReadMode: 'file' BlockSize: Inf PreviewFcn: @load SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" "png" "jpg" "jpeg" "tif" "tiff" "wav" "flac" "ogg" "mp4" "m4a"] ReadFcn: @load AlternateFileSystemRoots: {}

Read the first file in the datastore, and then read the second file.

data1 = read(fds);
data2 = read(fds);

Read all files in the datastore simultaneously.

Initialize a cell array to hold the data and counteri.

dataarray = cell(numel(fds.Files), 1); i = 1;

Reset the datastore to the first file and read the files one at a time until there is no data left. Assign the data to the arraydataarray.

reset(fds);
while hasdata(fds)
dataarray{i} = read(fds); i = i+1; end

Read One Array at a Time From Large MAT-File

You can create a datastore to read from a large MAT-file that does not necessarily fit in memory. Assuming that each array in the large MAT-file fits in the available memory, create a datastore to read and process the data in three steps:

  1. Write a custom reading function that reads one array at a time from a MAT-file.
  2. Set up the parameters of the datastore function to perform partial reads.
  3. Read one array at a time from the MAT-file.

Write a custom function that reads one array at time from MAT-file. The function must have a signature as described in the @ReadFcn argument of FileDatastore. Save this file in your working folder or in a folder that is on the MATLAB path. For this example, a custom function load_variable is included here.

function [data,variables,done] = load_variable(filename,variables)

% If variable list is empty, 
% create list of variables from the file
if isempty(variables) 
    variables = who('-file', filename);
end

% Load a variable from the list of variables
data = load(filename, variables{1});

% Remove the newly-read variable from the list
variables(1) = []; 

% Move on to the next file if this file is done reading.
done = isempty(variables); 

end

Create and setup a FileDatastore containing accidents.mat. Specify the datastore parameters to use "partialfile" as the read mode and load_variable as the custom reading function.

fds = fileDatastore("accidents.mat","ReadMode","partialfile","ReadFcn",@load_variable);

Read the first three variables from the file using the datastore. The file accidents.mat contains nine variables and every call to read returns one variable. Therefore, to get the first three variables, call the read function three times.

data = struct with fields: datasources: {3x1 cell}

data = struct with fields: hwycols: 17

data = struct with fields: hwydata: [51x17 double]

Note that the sample file accidents.mat is small and fits in memory, but you can expect similar results for large MAT-files that do not fit in memory.

Limitations

Tips

Version History

Introduced in R2016a

expand all

R2024b: Read data over HTTP and HTTPS using datastore functions

You can read data from primary online sources by performing datastore operations over an internet URL.