ParquetDatastore - Datastore for collection of Parquet files - MATLAB (original) (raw)

Datastore for collection of Parquet files

Description

Use a ParquetDatastore object to manage a collection of Parquet files, where each individual Parquet file fits in memory, but the entire collection of files does not necessarily fit. You can create a ParquetDatastore object using the parquetDatastore function, specify its properties, and then import and process the data using object functions.

Creation

Syntax

Description

`pds` = parquetDatastore([location](#mw%5F0f8d89d5-dca0-45b5-9366-1abf39876b59)) creates a datastore pds from the collection of Parquet files specified by location.

example

`pds` = parquetDatastore([location](#mw%5F0f8d89d5-dca0-45b5-9366-1abf39876b59),[Name,Value](#namevaluepairarguments)) specifies additional parameters and properties for pds using one or more name-value pair arguments.

example

Input Arguments

expand all

location — Files or folders to include in datastore

FileSet | DsFileSet object | string array | character vector | cell array of character vectors

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

The parquetDatastore function supports the.parquet file format.

Example: "myfile.parquet"

Example: "../dir/data/myfile.parquet"

Example: ["C:\dir\data\myfile01.parquet","C:\dir\data\myfile02.parquet"]

Example: "s3://bucketname/path_to_files/*.parquet"

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: parquetDatastore("myfolder",IncludeSubfolders=true)

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: parquetDatastore("myfolder","IncludeSubfolders",true)

FileExtensions — Extensions to include in datastore

character vector | cell array of character vectors | string scalar | string array

Extensions to include in datastore, specified as the name-value argument consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array.

Example: "FileExtensions",[".parquet",".parq"]

Example: "FileExtensions",".myformat"

Example: "FileExtensions",''

Data Types: char | cell | string

IncludeSubfolders — Subfolder inclusion flag

false (default) | true

Subfolder inclusion flag, specified as the name-value argument consisting of"IncludeSubfolders" and true orfalse. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify "IncludeSubfolders", then the default value isfalse.

Example: "IncludeSubfolders",true

Data Types: logical | double

OutputType — Output datatype

"auto" (default) | "table" | "timetable"

Output datatype, specified as the name-value argument consisting of"OutputType" and one of these values:

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the"RowTimes" name-value pair to return timetables fromParquetDatastore.

Example: "OutputType","timetable"

Data Types: char | string

VariableNamingRule — Flag to preserve variable names

"modify" (default) | "preserve"

Flag to preserve variable names, specified as either "modify" or"preserve".

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

AlternateFileSystemRoots — Alternate file system root paths

string vector | cell array

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Properties

expand all

ParquetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. With the exception of the Files property, you can specify the value ofParquetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation.

Files — Files included in datastore

cell array of character vectors | string array

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. Thelocation argument defines these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: {"C:\dir\data\file1.ext";"C:\dir\data\file2.ext"}

Data Types: cell | string

Folders — Folders used to construct datastore

cell array of character vectors

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. The location argument in theparquetDatastore and datastore functions definesFolders when the datastore is created.

The Folders property is reset when you modify theFiles property of a ParquetDatastore object.

Data Types: cell

RowFilter — Filter to select rows to import

matlab.io.RowFilter object

Filter to select rows to import, specified as amatlab.io.RowFilter object. Thematlab.io.RowFilter object designates conditions each row must satisfy to be included in your output table or timetable. If you do not specifyRowFilter, then parquetDatastore imports all rows from the input Parquet file.

ReadSize — Amount of data to read per read step

"rowgroup" (default) | "file" | positive integer

Amount of data to read per read step, specified as one of these values:

When you change ReadSize from a positive integer to"file" or "rowgroup", or from"file" or "rowgroup" to a positive integer, MATLAB resets the datastore to an unread state, where no data has been read from it.

In a parallel processing workflow (Parallel Computing Toolbox), the data is read in steps from each parallel worker. In a serial workflow, the data is read in steps from the input location.

Data Types: string | char | double

PartitionMethod — Partition unit for parallel processing

"auto" (default) | "file" | "bytes" | "rowgroup"

Since R2023b

Partition unit for parallel processing, specified as one of the values in the following table.

In a parallel processing workflow (Parallel Computing Toolbox), PartitionMethod determines the amount of data to send to each parallel worker. The amount of data to send to each worker is approximately calculated by the total number of partition units divided by the number of parallel workers. In a serial workflow, the PartitionMethod name-value argument is ignored.

Value Description
"auto" parquetDatastore selects a partition unit based on theReadSize name-value argument to balance the workload between parallel workers.
"file" Partitions are based on the total number of files.
"bytes" Partitions are based on the number of bytes specified by theBlockSize property.
"rowgroup" Partitions are based on the total number of row groups.

Granularity and speed of processing depend on the combination ofPartitionMethod and ReadSize values. WhilePartitionMethod determines how much data to send to each parallel worker, ReadSize determines how much data to read per read step. This table shows supported PartitionMethod and ReadSize combinations and their relative granularities and partitioning times.

Granularity, Partitioning Time PartitionMethod ReadSize
High granularity, long partitioning time rowgroup rowgroup
rowgroup positive integer
Moderate granularity, moderate partitioning time bytes rowgroup
Low granularity, short partitioning time file file

BlockSize — Number of bytes per partition

128000000 (default) | positive integer

Since R2023b

Number of bytes per partition, specified as a positive integer. Specify this argument if PartitionMethod is "bytes". By default, the value of BlockSize is 128000000 bytes (128 MB).

Example: BlockSize=1000000

VariableNames — Names of variables

character vector | cell array of character vectors | string scalar | string array

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specifyVariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to "preserve".

If ReadVariableNames is false, thenVariableNames defaults to ["Var1","Var2", ...].

Example: ["Time","Date","Quantity"]

Data Types: char | cell | string

SelectedVariableNames — Variables to read

cell array of character vectors | string array

Variables to read from the file, specified as a cell array of character vectors or a string array, where each character vector or string contains the name of one variable. You can specify the variable names in any order.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of theVariableNamingRule parameter to"preserve".

Example: ["Var3","Var7","Var4"]

Data Types: cell | string

RowTimes — Name of row times variable

variable name | variable index

Name of row times variable, specified as the name-value argument consisting of"RowTimes" and a variable name (such as"Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain adatetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then ParquetDatastore uses the first datetime or duration variable as the row times for the timetable.

SupportedOutputFormats — Formats supported for writing

string row vector

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

DefaultOutputFormat — Default output format

string scalar

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

Examples

collapse all

Create parquetDatastore Object

Create a parquetDatastore object using either a FileSet object or a file path.

Create a FileSet object containing the fileoutages.parquet. Create a parquetDatastore object.

fs = matlab.io.datastore.FileSet("outages.parquet"); pds = parquetDatastore(fs)

pds = ParquetDatastore with properties:

                   Files: {
                          '...\matlab\toolbox\matlab\demos\outages.parquet'
                          }
                 Folders: {
                          '...\matlab\toolbox\matlab\demos'
                          }
           VariableNames: {1x6 cell}
   SelectedVariableNames: {1x6 cell}
                ReadSize: 'rowgroup'
              OutputType: 'table'
                RowTimes: []
AlternateFileSystemRoots: {}
  SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    ...    ]
     DefaultOutputFormat: "parquet"
      VariableNamingRule: 'modify'

Alternatively, you can use a file path to create yourparquetDatastore object.

pds = parquetDatastore("outages.parquet");

Specify Read Size for ParquetDatastore

Create a datastore for a sample Parquet file, and then read data from the file with different ReadSize values.

Create a datastore for outages.parquet, set ReadSize to 10 rows, and then read from the datastore. The value of ReadSize determines how many rows of data are read from the datastore with each call to the read function.

pds = parquetDatastore("outages.parquet","ReadSize",10); read(pds)

ans=10×6 table Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________

"SouthWest"    01-Feb-2002 12🔞00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
"SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
"SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
"West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
"MidWest"      16-Mar-2002 06🔞00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
"West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
"West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
"West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
"NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
"MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"

Set the ReadSize property value to "file" and read from the datastore. Every call to the read function reads all the data from the datastore.

pds.ReadSize ="file"; data = read(pds)

data=1468×6 table Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________

"SouthWest"    01-Feb-2002 12🔞00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
"SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
"SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
"West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
"MidWest"      16-Mar-2002 06🔞00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
"West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
"West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
"West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"
"NorthEast"    16-Jul-2003 16:23:00    239.93         49434    17-Jul-2003 01:12:00    "fire"           
"MidWest"      27-Sep-2004 11:09:00    286.72         66104    27-Sep-2004 16:37:00    "equipment fault"
"SouthEast"    05-Sep-2004 17:48:00    73.387         36073    05-Sep-2004 20:46:00    "equipment fault"
"West"         21-May-2004 21:45:00    159.99           NaN    22-May-2004 04:23:00    "equipment fault"
"SouthEast"    01-Sep-2002 18:22:00    95.917         36759    01-Sep-2002 19:12:00    "severe storm"   
"SouthEast"    27-Sep-2003 07:32:00       NaN    3.5517e+05    04-Oct-2003 07:02:00    "severe storm"   
"West"         12-Nov-2003 06:12:00    254.09    9.2429e+05    17-Nov-2003 02:04:00    "winter storm"   
"NorthEast"    18-Sep-2004 05:54:00         0             0                     NaT    "equipment fault"
  ⋮

You also can set the value of ReadSize property to "rowgroup". For more information, see the ReadSize property of the ParquetDatastore object reference page.

Return Timetable from Parquet Datastore

Use the OutputType and RowTimes name-value pairs to make ParquetDatastore return timetables instead of tables.

Create a datastore for airlinesmall.parquet. Specify the "OutputType" name-value argument as "timetable".

pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable"); preview(pds)

ans=12500×26 timetable Date DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay ___________ _________ ____________________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________

21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:35:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"           3180 sec           3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:24:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"           3780 sec           3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 22🔞00    23-Oct-1987 21:57:00        "PS"           1589        "NA"           4980 sec           4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"       480      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14:31:00    23-Oct-1987 14🔞00        "PS"           1655        "NA"           3540 sec           3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:46:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"           4620 sec           4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"       373      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 15:47:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"           3660 sec           3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:52:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"           5040 sec           4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"       447      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:34:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"           9300 sec           8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"       954      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
20-Oct-1987        2        20-Oct-1987 18:33:00    20-Oct-1987 18:30:00    20-Oct-1987 19:29:00    20-Oct-1987 19:26:00        "PS"           1831        "NA"           3360 sec           3360 sec       NaN sec     180 sec     180 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
15-Oct-1987        4        15-Oct-1987 10:41:00    15-Oct-1987 10:40:00    15-Oct-1987 11:57:00    15-Oct-1987 11:55:00        "PS"           1864        "NA"           4560 sec           4500 sec       NaN sec     120 sec      60 sec    "SFO"     "LAS"       414      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
15-Oct-1987        4        15-Oct-1987 16:08:00    15-Oct-1987 15:53:00    15-Oct-1987 16:56:00    15-Oct-1987 16:40:00        "PS"           1907        "NA"           2880 sec           2820 sec       NaN sec     960 sec     900 sec    "LAX"     "FAT"       209      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
21-Oct-1987        3        21-Oct-1987 09:49:00    21-Oct-1987 09:40:00    21-Oct-1987 10:55:00    21-Oct-1987 10:52:00        "PS"           1939        "NA"           3960 sec           4320 sec       NaN sec     180 sec     540 sec    "LGB"     "SFO"       354      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
22-Oct-1987        4        22-Oct-1987 19:02:00    22-Oct-1987 18:47:00    22-Oct-1987 20:30:00    22-Oct-1987 19:51:00        "PS"           1973        "NA"           5280 sec           3840 sec       NaN sec    2340 sec     900 sec    "LAX"     "OAK"       337      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
16-Oct-1987        5        16-Oct-1987 19:10:00    16-Oct-1987 18:38:00    16-Oct-1987 20:52:00    16-Oct-1987 19:55:00        "TW"             19        "NA"           9720 sec           8220 sec       NaN sec    3420 sec    1920 sec    "STL"     "DEN"       770      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
02-Oct-1987        5        02-Oct-1987 11:30:00    02-Oct-1987 11:33:00    02-Oct-1987 12:37:00    02-Oct-1987 12:37:00        "TW"             59        "NA"          11220 sec          11040 sec       NaN sec       0 sec    -180 sec    "STL"     "PHX"      1262      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
30-Oct-1987        5        30-Oct-1987 14:00:00    30-Oct-1987 14:00:00    30-Oct-1987 19:20:00    30-Oct-1987 19:34:00        "TW"            102        "NA"          12000 sec          12840 sec       NaN sec    -840 sec       0 sec    "SNA"     "STL"      1570      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
  ⋮

When you do not also specify "RowTimes", parquetDatastore uses the first datetime or duration variable as the row times. In this case, the Date variable is used for the row times.

Specify the "RowTimes" option to use the arrival times (ArrTime) as the row times, instead of the flight dates.

pds = parquetDatastore("airlinesmall.parquet","OutputType","timetable","RowTimes","ArrTime"); preview(pds)

ans=12500×26 timetable ArrTime Date DayOfWeek DepTime CRSDepTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay ____________________ ___________ _________ ____________________ ____________________ ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ _______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________

21-Oct-1987 07:35:00    21-Oct-1987        3        21-Oct-1987 06:42:00    21-Oct-1987 06:30:00    21-Oct-1987 07:27:00        "PS"           1503        "NA"           3180 sec           3420 sec       NaN sec     480 sec     720 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
26-Oct-1987 11:24:00    26-Oct-1987        1        26-Oct-1987 10:21:00    26-Oct-1987 10:20:00    26-Oct-1987 11:16:00        "PS"           1550        "NA"           3780 sec           3360 sec       NaN sec     480 sec      60 sec    "SJC"     "BUR"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
23-Oct-1987 22🔞00    23-Oct-1987        5        23-Oct-1987 20:55:00    23-Oct-1987 20:35:00    23-Oct-1987 21:57:00        "PS"           1589        "NA"           4980 sec           4920 sec       NaN sec    1260 sec    1200 sec    "SAN"     "SMF"       480      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
23-Oct-1987 14:31:00    23-Oct-1987        5        23-Oct-1987 13:32:00    23-Oct-1987 13:20:00    23-Oct-1987 14🔞00        "PS"           1655        "NA"           3540 sec           3480 sec       NaN sec     780 sec     720 sec    "BUR"     "SJC"       296      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
22-Oct-1987 07:46:00    22-Oct-1987        4        22-Oct-1987 06:29:00    22-Oct-1987 06:30:00    22-Oct-1987 07:42:00        "PS"           1702        "NA"           4620 sec           4320 sec       NaN sec     240 sec     -60 sec    "SMF"     "LAX"       373      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
28-Oct-1987 15:47:00    28-Oct-1987        3        28-Oct-1987 14:46:00    28-Oct-1987 13:43:00    28-Oct-1987 14:48:00        "PS"           1729        "NA"           3660 sec           3900 sec       NaN sec    3540 sec    3780 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
08-Oct-1987 10:52:00    08-Oct-1987        4        08-Oct-1987 09:28:00    08-Oct-1987 09:30:00    08-Oct-1987 10:49:00        "PS"           1763        "NA"           5040 sec           4740 sec       NaN sec     180 sec    -120 sec    "SAN"     "SFO"       447      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
10-Oct-1987 11:34:00    10-Oct-1987        6        10-Oct-1987 08:59:00    10-Oct-1987 09:00:00    10-Oct-1987 11:23:00        "PS"           1800        "NA"           9300 sec           8580 sec       NaN sec     660 sec     -60 sec    "SEA"     "LAX"       954      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
20-Oct-1987 19:29:00    20-Oct-1987        2        20-Oct-1987 18:33:00    20-Oct-1987 18:30:00    20-Oct-1987 19:26:00        "PS"           1831        "NA"           3360 sec           3360 sec       NaN sec     180 sec     180 sec    "LAX"     "SJC"       308      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
15-Oct-1987 11:57:00    15-Oct-1987        4        15-Oct-1987 10:41:00    15-Oct-1987 10:40:00    15-Oct-1987 11:55:00        "PS"           1864        "NA"           4560 sec           4500 sec       NaN sec     120 sec      60 sec    "SFO"     "LAS"       414      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
15-Oct-1987 16:56:00    15-Oct-1987        4        15-Oct-1987 16:08:00    15-Oct-1987 15:53:00    15-Oct-1987 16:40:00        "PS"           1907        "NA"           2880 sec           2820 sec       NaN sec     960 sec     900 sec    "LAX"     "FAT"       209      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
21-Oct-1987 10:55:00    21-Oct-1987        3        21-Oct-1987 09:49:00    21-Oct-1987 09:40:00    21-Oct-1987 10:52:00        "PS"           1939        "NA"           3960 sec           4320 sec       NaN sec     180 sec     540 sec    "LGB"     "SFO"       354      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
22-Oct-1987 20:30:00    22-Oct-1987        4        22-Oct-1987 19:02:00    22-Oct-1987 18:47:00    22-Oct-1987 19:51:00        "PS"           1973        "NA"           5280 sec           3840 sec       NaN sec    2340 sec     900 sec    "LAX"     "OAK"       337      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
16-Oct-1987 20:52:00    16-Oct-1987        5        16-Oct-1987 19:10:00    16-Oct-1987 18:38:00    16-Oct-1987 19:55:00        "TW"             19        "NA"           9720 sec           8220 sec       NaN sec    3420 sec    1920 sec    "STL"     "DEN"       770      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
02-Oct-1987 12:37:00    02-Oct-1987        5        02-Oct-1987 11:30:00    02-Oct-1987 11:33:00    02-Oct-1987 12:37:00        "TW"             59        "NA"          11220 sec          11040 sec       NaN sec       0 sec    -180 sec    "STL"     "PHX"      1262      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
30-Oct-1987 19:20:00    30-Oct-1987        5        30-Oct-1987 14:00:00    30-Oct-1987 14:00:00    30-Oct-1987 19:34:00        "TW"            102        "NA"          12000 sec          12840 sec       NaN sec    -840 sec       0 sec    "SNA"     "STL"      1570      NaN sec    NaN sec      false            "NA"           false        NaN sec         NaN sec       NaN sec        NaN sec            NaN sec     
  ⋮

Conditionally Select Rows Using Row Filter

Conditionally select rows from a data set using the RowFilter property.

Create a Parquet datastore using the outages.parquet file. View the first 8 rows of the datastore.

pds = parquetDatastore("outages.parquet"); preview(pds)

ans=8×6 table Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ _________________

"SouthWest"    01-Feb-2002 12🔞00    458.98    1.8202e+06    07-Feb-2002 16:50:00    "winter storm"   
"SouthEast"    23-Jan-2003 00:49:00    530.14    2.1204e+05                     NaT    "winter storm"   
"SouthEast"    07-Feb-2003 21:15:00     289.4    1.4294e+05    17-Feb-2003 08:14:00    "winter storm"   
"West"         06-Apr-2004 05:44:00    434.81    3.4037e+05    06-Apr-2004 06:10:00    "equipment fault"
"MidWest"      16-Mar-2002 06🔞00    186.44    2.1275e+05    18-Mar-2002 23:23:00    "severe storm"   
"West"         18-Jun-2003 02:49:00         0             0    18-Jun-2003 10:54:00    "attack"         
"West"         20-Jun-2004 14:39:00    231.29           NaN    20-Jun-2004 19:16:00    "equipment fault"
"West"         06-Jun-2002 19:28:00    311.86           NaN    07-Jun-2002 00:51:00    "equipment fault"

Create a row filter that identifies rows with a Region of "NorthEast" and a Cause of "winter storm". Then, set the RowFilter property of the datastore to the filter. Preview the datastore, note that the datastore contains only rows that meet the filter conditions.

rf = rowfilter(pds); filter = rf.Region == "NorthEast" & rf.Cause == "winter storm"; pds.RowFilter = filter; preview(pds)

ans=8×6 table Region OutageTime Loss Customers RestorationTime Cause
___________ ____________________ ______ __________ ____________________ ______________

"NorthEast"    13-Nov-2004 10:42:00       NaN    1.4227e+05    19-Nov-2004 02:31:00    "winter storm"
"NorthEast"    26-Dec-2004 22🔞00    255.45    1.0444e+05    27-Dec-2004 14:11:00    "winter storm"
"NorthEast"    17-Dec-2003 15:11:00       NaN         66692    19-Dec-2003 07:22:00    "winter storm"
"NorthEast"    28-Jan-2005 18:20:00    401.39         89683    29-Jan-2005 02:36:00    "winter storm"
"NorthEast"    04-Feb-2005 00:53:00    32.061         46182    09-Feb-2005 02:42:00    "winter storm"
"NorthEast"    16-Nov-2006 10:04:00    147.25    1.2571e+05    17-Nov-2006 10:55:00    "winter storm"
"NorthEast"    03-Feb-2007 02:19:00    293.83    1.1628e+05    04-Feb-2007 21:24:00    "winter storm"
"NorthEast"    18-Feb-2008 05:24:00    353.29         64687    20-Feb-2008 08:56:00    "winter storm"

Limitations

Extended Capabilities

Thread-Based Environment

Run code in the background using MATLAB® backgroundPool or accelerate code with Parallel Computing Toolbox™ ThreadPool.

This function fully supports thread-based environments. For more information, see Run MATLAB Functions in Thread-Based Environment.

Version History

Introduced in R2019a

expand all

R2024b: Read data over HTTP and HTTPS using datastore functions

You can read data from primary online sources by performing datastore operations over an internet URL.

R2023b: Create ParquetDatastore more efficiently with partition control in parallel environments

In parallel environments, you can create a ParquetDatastore more efficiently by specifying the unit of partition and the size of partition blocks. Specify the PartitionMethod and Blocksize name-value arguments during creation of the datastore.

R2022b: Read Parquet files containing structured data

Read structured data from Parquet files as nested tables.

R2022b: Use function in thread-based environments

This function supports thread-based environments.

R2022a: Read Parquet file data more efficiently using rowfilter to conditionally filter rows

Conditionally filter and read data faster (Predicate Pushdown) from Parquet files when using parquetread and parquetDatastore. You can create conditions for filtering by using the rowfilter function,matlab.io.RowFilter object, and RowFilter name-value argument.

R2022a: Specify FileSet objects as data locations

parquetDatastore accepts FileSet objects as the locations of files to include in the datastore. FileSet objects provide increased performance compared to file paths or DsFileSet objects.

R2021a: Use categorical data in Parquet data format

Use Parquet data that contains the categorical data type.

R2019b: Write tabular data containing any characters

Use tabular data that has variable names containing any Unicode characters, including spaces and non-ASCII characters.