partition - Partition a datastore - MATLAB (original) (raw)
Syntax
Description
[subds](#buouv0l-1-subds) = partition([ds](#buouv0l-1-ds),[n](#buouv0l-1-N),[index](#buouv0l-1-index))
partitions datastore ds
into the number of parts specified byn
and returns the partition corresponding to the indexindex
.
[subds](#buouv0l-1-subds) = partition([ds](#buouv0l-1-ds),'Files',[index](#buouv0l-1-index))
partitions the datastore by files and returns the partition corresponding to the file of index index
in the Files
property.
[subds](#buouv0l-1-subds) = partition([ds](#buouv0l-1-ds),'Files',[filename](#buouv0l-1-filename))
partitions the datastore by files and returns the partition corresponding to the file specified by filename
.
Examples
Partition Datastore into Specific Number of Parts
Create a datastore for a large collection of files. For this example, use ten copies of the sample file airlinesmall.csv
. To handle missing fields in the tabular data, specify the name-value pairs TreatAsMissing
and MissingValue
.
files = repmat({'airlinesmall.csv'},1,10); ds = tabularTextDatastore(files,... 'TreatAsMissing','NA','MissingValue',0);
Partition the datastore into three parts and return the first partition. The partition
function returns approximately the first third of the data from the datastore ds
.
subds = partition(ds,3,1)
subds = TabularTextDatastore with properties:
Files: {
' ...\ExampleManager\nhossain.Bdoc.Feb13\matlab-ex96137387\airlinesmall.csv';
' ...\ExampleManager\nhossain.Bdoc.Feb13\matlab-ex96137387\airlinesmall.csv';
' ...\ExampleManager\nhossain.Bdoc.Feb13\matlab-ex96137387\airlinesmall.csv'
... and 1 more
}
Folders: {
' ...\Documents\MATLAB\ExampleManager\nhossain.Bdoc.Feb13\matlab-ex96137387'
}
FileEncoding: 'UTF-8'
AlternateFileSystemRoots: {} VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} DatetimeLocale: en_US
Text Format Properties: NumHeaderLines: 0 Delimiter: ',' RowDelimiter: '\r\n' TreatAsMissing: 'NA' MissingValue: 0
Advanced Text Format Properties: TextscanFormats: {'%f', '%f', '%f' ... and 26 more} TextType: 'char' ExponentCharacters: 'eEdD' CommentStyle: '' Whitespace: ' \b\t' MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedFormats: {'%f', '%f', '%f' ... and 26 more} ReadSize: 20000 rows OutputType: 'table' RowTimes: []
Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "dat" "asc" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "txt"
The Files
property of the datastore contains a list of files included in the datastore. Check the number of files in the Files
property of the datastore ds
and the partitioned datastore subds
. The datastore ds
contains ten files and the partition subds
contains the first four files.
Partition Datastore into Default Number of Parts
Create a datastore from the sample file, mapredout.mat
, which is the output file of the mapreduce
function.
ds = datastore('mapredout.mat');
Get the default number of partitions for ds
.
Partition the datastore into the default number of partitions and return the datastore corresponding to the first partition.
subds = partition(ds,n,1);
Read the data in subds
.
while hasdata(subds) data = read(subds); end
Partition Datastore by Files
Create a datastore that contains three image files.
ds = imageDatastore({'street1.jpg','peppers.png','corn.tif'})
ds =
ImageDatastore with properties:
Files: {
' ...\matlab\toolbox\matlab\demos\street1.jpg';
' ...\matlab\toolbox\matlab\imagesci\peppers.png';
' ...\matlab\toolbox\matlab\imagesci\corn.tif'
}
ReadSize: 1
Labels: {}
ReadFcn: @readDatastoreImage
Partition the datastore by files and return the part corresponding to the second file.
subds = partition(ds,'Files',2)
subds =
ImageDatastore with properties:
Files: {
' ...\matlab\toolbox\matlab\imagesci\peppers.png'
}
ReadSize: 1
Labels: {}
ReadFcn: @readDatastoreImage
subds
contains one file.
Partition Data in Parallel
Create a datastore from the sample file, mapredout.mat
, which is the output file of the mapreduce
function.
ds = datastore('mapredout.mat');
Partition the datastore into three parts on three workers in a parallel pool.
numWorkers = 3; p = parpool('local',numWorkers); n = numpartitions(ds,p);
parfor ii=1:n subds = partition(ds,n,ii); while hasdata(subds) data = read(subds); end end
Compare Data Granularities
Compare a coarse-grained partition with a fine-grained subset.
Read all the frames in the video file xylophone.mp4
and construct an ArrayDatastore
object to iterate over it. The resulting object has 141 frames.
v = VideoReader("xylophone.mp4"); allFrames = read(v); arrds = arrayDatastore(allFrames,IterationDimension=4,OutputType="cell",ReadSize=4);
To extract a specific set of adjacent frames, create four coarse-grained partitions of arrds
. Extract the second partition, which has 35 frames.
partds = partition(arrds,4,2); imshow(imtile(partds.readall()))
Extract six nonadjacent frames from arrds
at specified indices using a fine-grained subset.
subds = subset(arrds,[67 79 82 69 89 33]); imshow(imtile(subds.readall()))
Input Arguments
ds
— Input datastore
datastore
Input datastore. You can use the datastore function to create a datastore object from your data.
n
— Number of partitions
positive integer
Number of partitions, specified as a positive integer.
If you specify a number of partitions that is not a numerical factor of the number of files in the datastore, partition
will place each of the remaining observations in the existing partitions, starting with the first partition.
The number of existing partitions that contain an additional observation is equal to the remainder obtained when dividing the number of files in the datastore by the number of partitions. For example, if your datastore object contains 23 files that you wish to partition into 3 parts, the first two partitions that partition
creates will contain 8 files, and the last partition will contain 7 files.
Example: 3
Data Types: double
index
— Index
positive integer
Index, specified as a positive integer.
Example: 1
Data Types: double
filename
— file name
character vector | string scalar
File name, specified as a character vector or string scalar.
The value of filename
must match exactly the file name contained in the Files
property of the datastore. To ensure that the file names match exactly, specifyfilename
using ds.Files{N}
whereN
is the index of the file in theFiles
property. For example,ds.Files{3}
specifies the third file in the datastoreds
.
Example: ds.Files{3}
Example: 'file1.csv'
Example: '../dir/data/file1.csv'
Example: 'hdfs://myserver:7867/data/file1.txt'
Data Types: char
Output Arguments
subds
— Output datastore
datastore
Output datastore. The output datastore is of the same type as the input datastore ds
.
Extended Capabilities
Thread-Based Environment
Run code in the background using MATLAB® backgroundPool
or accelerate code with Parallel Computing Toolbox™ ThreadPool
.
Usage notes and limitations:
- In a thread-based environment, you can use
partition
only with the following datastores:ImageDatastore
objectsCombinedDatastore
,SequentialDatastore
, orTransformedDatastore
objects you create fromImageDatastore
objects by usingcombine
ortransform
You can usepartition
with other datastores if you have Parallel Computing Toolbox™. To do so, run the function using a process-backed parallel pool instead of usingbackgroundPool
orThreadPool
(use eitherProcessPool
orClusterPool
).
For more information, see Run MATLAB Functions in Thread-Based Environment.
Version History
Introduced in R2015a