matlab.io.datastore.Subsettable - Add subset and fine-grained parallelization support to
datastore - MATLAB ([original](http://www.mathworks.com/access/helpdesk/help/matlab/ref/matlab.io.datastore.subsettable-class.html)) ([raw](?raw))
Namespace: matlab.io.datastore
Add subset and fine-grained parallelization support to datastore
Since R2022b
Description
matlab.io.datastore.Subsettable
is an abstract mixin class that adds subset and fine-grained parallelization support to your custom datastore for use with Parallel Computing Toolbox™ and MATLAB® Parallel Server™. matlab.io.datastore.Subsettable
creates fine-grained subsets with the subset
method, coarse-grained partitions with thepartition
method, and dataset randomization with theshuffle
method.
Use matlab.io.datastore.Subsettable
only if you can access every data read independently for increased granularity. If not, such as inTabularTextDatastore
workflows, thenmatlab.io.datastore.Partitionable
is more appropriate.
To use this mixin class, inherit from thematlab.io.datastore.Subsettable
class, in addition to inheriting from thematlab.io.Datastore base class. Type this syntax as the first line of your class definition file:
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Subsettable ... end
To add support for parallel processing to your custom datastore, you must:
- Inherit from the class
matlab.io.datastore.Subsettable
in addition tomatlab.io.Datastore
. - Define the method maxpartitions.
- Define the method subsetByReadIndices.
Subsettable
uses thesubset
method to call the implementation ofsubsetByReadIndices
.
For more details and steps to create your custom datastore with parallel processing support, see Develop Custom Datastore.
Methods
Examples
Build Datastore with Subset Support
Build a datastore with subset processing support and use it to bring your data into MATLAB®.
Create a class definition file that contains the code implementing your datastore. Save this file in your working folder or in a folder that is on the MATLAB path. The name of the .m
file must be the same as the name of your object constructor function. In this example, create the MyHDF5Datastore
class in a file named MyHDF5Datastore.m
. The .m
class definition contains the following steps:
- Step 1: Inherit from the
matlab.io.Datastore
andmatlab.io.datastore.Subsettable
classes. - Step 2: Define the constructor as well as the
subsetByReadIndices
andmaxpartitions
methods. - Step 3: Define your custom file-reading function. Here, the
MyHDF5Datastore
class creates and uses thelistHDF5Datasets
function.
%% STEP 1 classdef MyHDF5Datastore < matlab.io.Datastore ... & matlab.io.datastore.Subsettable
properties
Filename (1, 1) string
Datasets (:, 1) string {mustBeNonmissing} = "/"
CurrentDatasetIndex (1, 1) double {mustBeInteger, mustBeNonnegative} = 1
end
%% STEP 2 methods function ds = MyHDF5Datastore(Filename, Location) arguments Filename (1, 1) string Location (1, 1) string {mustBeNonmissing} = "/" end
ds.Filename = Filename;
ds.Datasets = listHDF5Datasets(ds.Filename, Location);
end
function [data, info] = read(ds, varargin)
if ~hasdata(ds)
error(message("No more datasets to read."));
end
dataset = ds.Datasets(ds.CurrentDatasetIndex);
data = { h5read(ds.Filename, dataset, varargin{:}) };
if nargout > 1
info = h5info(ds.Filename, dataset);
end
ds.CurrentDatasetIndex = ds.CurrentDatasetIndex + 1;
end
function tf = hasdata(ds)
tf = ds.CurrentDatasetIndex <= numel(ds.Datasets);
end
function reset(ds)
ds.CurrentDatasetIndex = 1;
end
end
methods (Access = protected)
function subds = subsetByReadIndices(ds, indices)
datasets = ds.Datasets(indices);
subds = copy(ds);
subds.Datasets = datasets;
reset(subds);
end
function n = maxpartitions(ds)
n = numel(ds.Datasets);
end
end
end
%% STEP 3 function datasets = listHDF5Datasets(filename, location, args) arguments filename (1, 1) string location (1, 1) string args.IncludeSubGroups (1, 1) logical = true end
if strlength(location) == 0
location = "/";
end
info = h5info(filename, location);
datasets = listDatasetsInH5infoStruct(info, location, IncludeSubGroups=args.IncludeSubGroups);
end
function datasets = listDatasetsInH5infoStruct(S, location, args) arguments S (1, 1) struct location (1, 1) string args.IncludeSubGroups (1, 1) logical = true end
datasets = string.empty(0, 1);
if isfield(S, "Datatype")
datasets = location;
elseif isfield(S, "Datasets")
if ~isempty(S.Datasets)
datasets = location + "/" + {S.Datasets.Name}';
end
if args.IncludeSubGroups
listFcn = @(group) listDatasetsInH5infoStruct(group, group.Name, IncludeSubGroups=true);
else
listFcn = @(group) string(group.Name);
end
childDatasets = arrayfun(listFcn, S.Groups, UniformOutput=false);
childDatasets = vertcat(childDatasets{:});
datasets = [datasets; childDatasets];
end
end
Read a Subset of a Datastore
Create a subset of datasets from a specific group of an HDF5 file.
First, create a datastore from all datasets under the /g4
group of the HDF5 file. Use the MyHDF5Datastore.m
class definition file from the Build Datastore with Subset Support example.
g4ds = MyHDF5Datastore("example.h5","/g4"); data = readall(g4ds)
data=4×1 cell array {19x1 double} {36x1 double} {10x1 double} {36x19 double}
Select specific datasets from the g4ds
datastore using the subset
function.
subds = subset(g4ds,[2 4]); data = readall(subds)
data=2×1 cell array {36x1 double} {36x19 double}
Tips
- For your custom datastore implementation, a best practice is not to implement thenumpartitions method.
Version History
Introduced in R2022b