matlab.io.datastore.HadoopFileBased - (Not recommended) Add Hadoop file support to datastore - MATLAB (original) (raw)

Namespace: matlab.io.datastore

(Not recommended) Add Hadoop file support to datastore

Description

matlab.io.datastore.HadoopFileBased is an abstract mixin class that adds Hadoop® support to your custom datastore.

To use this mixin class, you must inherit from thematlab.io.datastore.HadoopFileBased class in addition to inheriting from the matlab.io.Datastore base class. Type the following syntax as the first line of your class definition file:

classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.HadoopFileBased ... end

To add Hadoop support along with parallel processing support, use these lines in your class definition file:

classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Partitionable & ... matlab.io.datastore.HadoopFileBased ... end

To add support for Hadoop to your custom datastore, you must:

Inherit from the additional class matlab.io.datastore.HadoopFileBased
Define these additional methods: getLocation, initializeDatastore, and isfullfile

For more details and steps to create your custom datastore with support for Hadoop, see Develop Custom Datastore.

Methods

getLocation	(Not recommended) Location of files in Hadoop
initializeDatastore	(Not recommended) Initialize datastore with information from Hadoop
isfullfile	(Not recommended) Check if datastore reads full files

Examples

collapse all

Implement a datastore with parallel processing and Hadoop support and use it to bring your data from the Hadoop server into MATLAB®. Then use the tall andgather functions on this data.

Create a new .m class definition file that contains the code implementing your custom datastore. You must save this file in your working folder or in a folder that is on the MATLAB path. The name of the .m file must be the same as the name of your object constructor function. For example, if you want your constructor function to have the nameMyDatastoreHadoop, then the name of the script file must be MyDatastoreHadoop.m. The .m class definition file must contain these steps:

Step 1: Inherit from the datastore classes.
Step 2: Define the constructor and the required methods.
Step 3: Define your custom file reading function.

This code shows the three steps in a sample implementation of a custom datastore that can read binary files from a Hadoop server.

%% STEP 1: INHERIT FROM DATASTORE CLASSES classdef MyDatastoreHadoop < matlab.io.Datastore & ... matlab.io.datastore.Partitionable & ... matlab.io.datastore.HadoopFileBased

properties (Access = private)
    CurrentFileIndex double
    FileSet matlab.io.datastore.DsFileSet
end

%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS methods % Define your datastore constructor function myds = MyDatastoreHadoop(location,altRoots) myds.FileSet = matlab.io.datastore.DsFileSet(location,... 'FileExtensions','.bin', ... 'FileSplitSize',8*1024); myds.CurrentFileIndex = 1;

        if nargin == 2
             myds.AlternateFileSystemRoots = altRoots;
        end
        
        reset(myds);
    end
    
    % Define the hasdata method
    function tf = hasdata(myds)
        % Return true if more data is available
        tf = hasfile(myds.FileSet);
    end
    
    % Define the read method
    function [data,info] = read(myds)
        % Read data and information about the extracted data
        % See also: MyFileReader()
        if ~hasdata(myds)
            error(sprintf(['No more data to read.\nUse the reset ',... 
                 'method to reset the datastore to the start of ' ,...
                 'the data. \nBefore calling the read method, ',...
                 'check if data is available to read ',...
                 'by using the hasdata method.'])) 
        end
        
        fileInfoTbl = nextfile(myds.FileSet);
        data = MyFileReader(fileInfoTbl);
        info.Size = size(data);
        info.FileName = fileInfoTbl.FileName;
        info.Offset = fileInfoTbl.Offset;
        
        % Update CurrentFileIndex for tracking progress
        if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                fileInfoTbl.FileSize
            myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
        end
    end
    
    % Define the reset method
    function reset(myds)
        % Reset to the start of the data
        reset(myds.FileSet);
        myds.CurrentFileIndex = 1;
    end
    
    
    % Define the partition method
    function subds = partition(myds,n,ii)
        subds = copy(myds);
        subds.FileSet = partition(myds.FileSet,n,ii);
        reset(subds);
    end
end      

 
methods (Hidden = true)   

    % Define the progress method
    function frac = progress(myds)
        % Determine percentage of data read from datastore
        if hasdata(myds) 
           frac = (myds.CurrentFileIndex-1)/...
                         myds.FileSet.NumFiles; 
        else 
           frac = 1;  
        end 
    end

    % Define the initializeDatastore method
    function initializeDatastore(myds,hadoopInfo)
        import matlab.io.datastore.DsFileSet;
        myds.FileSet = DsFileSet(hadoopInfo,...
            'FileSplitSize',myds.FileSet.FileSplitSize,...
            'IncludeSubfolders',true, ...
            'FileExtensions','.bin');
        reset(myds);
    end
    
    % Define the getLocation method
    function loc = getLocation(myds)
        loc = myds.FileSet;
    end
    
    % Define the isfullfile method
    function tf = isfullfile(~)
        tf = isequal(myds.FileSet.FileSplitSize,'file'); 
    end

end
    
methods (Access = protected)
    % If you use the  FileSet property in the datastore,
    % then you must define the copyElement method. The
    % copyElement method allows methods such as readall
    % and preview to remain stateless 
    function dscopy = copyElement(ds)
        dscopy = copyElement@matlab.mixin.Copyable(ds);
        dscopy.FileSet = copy(ds.FileSet);
    end
    
    % Define the maxpartitions method
    function n = maxpartitions(myds)
        n = maxpartitions(myds.FileSet);
    end
end

end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION function data = MyFileReader(fileInfoTbl) % create a reader object using FileName reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data data = read(reader,fileInfoTbl.SplitSize); end

This step completes the implementation of your custom datastore.

Next, create a datastore object using your custom datastore constructor. If your data is located at hdfs:///pathtofiles, then you can use this code.

setenv('HADOOP_HOME','/path/to/hadoop/install'); ds = MyDatastoreHadoop('hdfs:///pathtofiles');

To use tall arrays and the gather function on Apache® Spark™ with parallel cluster configuration, set themapreducer and attachMyDatastoreHadoop.m to the cluster.

mr = mapreducer(cluster); mr.Cluster.AttachedFiles = 'MyDatastoreHadoop.m';

Create tall array from datastore.

Gather the head of the tall array.

Version History

Introduced in R2017b