matlab.io.datastore.Subsettable - Add subset and fine-grained parallelization support to

  datastore - MATLAB ([original](http://www.mathworks.com/access/helpdesk/help/matlab/ref/matlab.io.datastore.subsettable-class.html)) ([raw](?raw))

Namespace: matlab.io.datastore

Add subset and fine-grained parallelization support to datastore

Since R2022b

Description

matlab.io.datastore.Subsettable is an abstract mixin class that adds subset and fine-grained parallelization support to your custom datastore for use with Parallel Computing Toolbox™ and MATLAB® Parallel Server™. matlab.io.datastore.Subsettable creates fine-grained subsets with the subset method, coarse-grained partitions with thepartition method, and dataset randomization with theshuffle method.

Use matlab.io.datastore.Subsettable only if you can access every data read independently for increased granularity. If not, such as inTabularTextDatastore workflows, thenmatlab.io.datastore.Partitionable is more appropriate.

To use this mixin class, inherit from thematlab.io.datastore.Subsettable class, in addition to inheriting from thematlab.io.Datastore base class. Type this syntax as the first line of your class definition file:

classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Subsettable ... end

To add support for parallel processing to your custom datastore, you must:

Inherit from the class matlab.io.datastore.Subsettable in addition to matlab.io.Datastore.
Define the method maxpartitions.
Define the method subsetByReadIndices. Subsettable uses thesubset method to call the implementation ofsubsetByReadIndices.

For more details and steps to create your custom datastore with parallel processing support, see Develop Custom Datastore.

Methods

Examples

collapse all

Build Datastore with Subset Support

Build a datastore with subset processing support and use it to bring your data into MATLAB®.

Create a class definition file that contains the code implementing your datastore. Save this file in your working folder or in a folder that is on the MATLAB path. The name of the .m file must be the same as the name of your object constructor function. In this example, create the MyHDF5Datastore class in a file named MyHDF5Datastore.m. The .m class definition contains the following steps:

Step 1: Inherit from the matlab.io.Datastore and matlab.io.datastore.Subsettable classes.
Step 2: Define the constructor as well as the subsetByReadIndices and maxpartitions methods.
Step 3: Define your custom file-reading function. Here, the MyHDF5Datastore class creates and uses the listHDF5Datasets function.

%% STEP 1 classdef MyHDF5Datastore < matlab.io.Datastore ... & matlab.io.datastore.Subsettable

properties
    Filename            (1, 1) string
    Datasets            (:, 1) string {mustBeNonmissing} = "/"
    CurrentDatasetIndex (1, 1) double {mustBeInteger, mustBeNonnegative} = 1
end

%% STEP 2 methods function ds = MyHDF5Datastore(Filename, Location) arguments Filename (1, 1) string Location (1, 1) string {mustBeNonmissing} = "/" end

        ds.Filename = Filename;
        ds.Datasets = listHDF5Datasets(ds.Filename, Location);
    end

    function [data, info] = read(ds, varargin)
        if ~hasdata(ds)
            error(message("No more datasets to read."));
        end

        dataset = ds.Datasets(ds.CurrentDatasetIndex);
        data = { h5read(ds.Filename, dataset, varargin{:}) };
        if nargout > 1
            info =   h5info(ds.Filename, dataset);
        end

        ds.CurrentDatasetIndex = ds.CurrentDatasetIndex + 1;
    end

    function tf = hasdata(ds)
        tf = ds.CurrentDatasetIndex <= numel(ds.Datasets);
    end

    function reset(ds)
        ds.CurrentDatasetIndex = 1;
    end
end

methods (Access = protected)
    function subds = subsetByReadIndices(ds, indices)
        datasets = ds.Datasets(indices);

        subds = copy(ds);
        subds.Datasets = datasets;
        reset(subds);
    end

    function n = maxpartitions(ds)
        n = numel(ds.Datasets);
    end
end

end

%% STEP 3 function datasets = listHDF5Datasets(filename, location, args) arguments filename (1, 1) string location (1, 1) string args.IncludeSubGroups (1, 1) logical = true end

if strlength(location) == 0
    location = "/";
end

info = h5info(filename, location);

datasets = listDatasetsInH5infoStruct(info, location, IncludeSubGroups=args.IncludeSubGroups);

end

function datasets = listDatasetsInH5infoStruct(S, location, args) arguments S (1, 1) struct location (1, 1) string args.IncludeSubGroups (1, 1) logical = true end

datasets = string.empty(0, 1);

if isfield(S, "Datatype")
    datasets = location;
elseif isfield(S, "Datasets")
    if ~isempty(S.Datasets)
        datasets = location + "/" + {S.Datasets.Name}';
    end

    if args.IncludeSubGroups
        listFcn = @(group) listDatasetsInH5infoStruct(group, group.Name, IncludeSubGroups=true);
    else
        listFcn = @(group) string(group.Name);
    end

    childDatasets = arrayfun(listFcn, S.Groups, UniformOutput=false);
    childDatasets = vertcat(childDatasets{:});

    datasets = [datasets; childDatasets];
end

end

Read a Subset of a Datastore

Create a subset of datasets from a specific group of an HDF5 file.

First, create a datastore from all datasets under the /g4 group of the HDF5 file. Use the MyHDF5Datastore.m class definition file from the Build Datastore with Subset Support example.

g4ds = MyHDF5Datastore("example.h5","/g4"); data = readall(g4ds)

data=4×1 cell array {19x1 double} {36x1 double} {10x1 double} {36x19 double}

Select specific datasets from the g4ds datastore using the subset function.

subds = subset(g4ds,[2 4]); data = readall(subds)

data=2×1 cell array {36x1 double} {36x19 double}

Tips

For your custom datastore implementation, a best practice is not to implement thenumpartitions method.

Version History

Introduced in R2022b