spreadsheetDatastore - Datastore for spreadsheet files - MATLAB (original) (raw)

Datastore for spreadsheet files

Description

Use a spreadsheetDatastore object to manage large collections of spreadsheet files where the collection does not necessarily fit in memory. You can create a spreadsheetDatastore object using thespreadsheetDatastore function, specify its properties, and then import the data using object functions.

Creation

Syntax

Description

`ssds` = spreadsheetDatastore([location](#d126e347561)) creates a spreadsheet datastore from the collection of data specified bylocation.

example

`ssds` = spreadsheetDatastore([location](#d126e347561),[Name,Value](#namevaluepairarguments)) specifies additional parameters and properties for ssds using one or more name-value arguments. For example,spreadsheetDatastore(location,"FileExtensions",[".xlsx",".xls"]) specifies which files to include in the datastore depending on the file extensions.

Input Arguments

expand all

location — Files or folders to include in datastore

FileSet | DsFileSet object | string array | character vector | cell array of character vectors

Files or folders to include in the datastore, specified as one of these values:

Files or folders can be local or remote:

When you specify a folder, the datastore includes only files with supported file formats and ignores files with any other format. To specify a custom list of file extensions to include in your datastore, see the FileExtensions name-value argument.

The spreadsheetDatastore function supports files with the extensions: .xls, .xlsx,.xlsm, .xltx, and.xltm.

Example: "file1.xlsx"

Example: "../dir/data/file1.xlsx"

Example: ["C:\dir\data\file1.xlsx","C:\dir\data\file2.xlsx"]

Example: "s3://bucketname/path_to_files/*.xls"

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: ssds = spreadsheetDatastore("C:\dir\spreadsheetdata",FileExtensions=[".xls",".xlsm"])

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: ssds = spreadsheetDatastore("C:\dir\spreadsheetdata","FileExtensions",[".xls",".xlsm"])

FileExtensions — Spreadsheet file extensions

character vector | cell array of character vectors | string scalar | string array

Spreadsheet file extensions, specified as the name-value argument consisting of "FileExtensions" and a character vector, cell array of character vectors, string scalar, or string array.

Example: "FileExtensions",''

Example: "FileExtensions",".xls"

Example: "FileExtensions",[".xlsx",".xlsm"]

Data Types: char | cell | string

IncludeSubfolders — Subfolder inclusion flag

true or false | 0 or 1

Subfolder inclusion flag, specified as the name-value argument consisting of "IncludeSubfolders" andtrue, false, 0, or 1. Specify true to include all files and subfolders within each folder or false to include only the files within each folder.

If you do not specify "IncludeSubfolders", then the default value is false.

Example: "IncludeSubfolders",true

Data Types: logical | double

OutputType — Output datatype

"auto" (default) | "table" | "timetable"

Output datatype, specified as the name-value argument consisting of"OutputType" and one of these values:

The value of OutputType determines the data type returned by the preview, read, and readall functions. Use this option in conjunction with the"RowTimes" name-value pair to return timetables fromspreadsheetDatastore.

Example: "OutputType","timetable"

Data Types: char | string

AlternateFileSystemRoots — Alternate file system root paths

string vector | cell array

Alternate file system root paths, specified as the name-value argument consisting of"AlternateFileSystemRoots" and a string vector or a cell array. Use"AlternateFileSystemRoots" when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use"AlternateFileSystemRoots" to associate the root paths.

The value of "AlternateFileSystemRoots" must satisfy these conditions:

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

TextType — Output data type of text variables

"char" (default) | "string"

Output data type of text variables, specified as the name-value argument consisting of "TextType" and either"char" or "string".

Data Types: char | string

In addition to these name-value pairs, you also can specify any of the properties on this page as name-value pairs (except for theFiles property).

Properties

expand all

SpreadsheetDatastore properties describe the format of the files in a datastore object, and control how the data is read from the datastore. Except for the Files property, you can specify the value of SpreadsheetDatastore properties using name-value pair arguments when you create the datastore object. To view or modify a property after creating the object, use the dot notation:

ds = spreadsheetDatastore("airlinesmall_subset.xlsx"); ds.Sheets = [1 2]; ds.Range = "A1:C6";

File Properties

Files — Files included in datastore

cell array of character vectors | string array

Files included in the datastore, resolved as a cell array of character vectors or a string array, where each character vector or string is a full path to a file. The location argument in thespreadsheetDatastore anddatastore functions define these files.

The first file specified in the cell array determines the variable names and format information for all files in the datastore.

Example: ["C:\dir\data\file1.xls";"C:\dir\data\file2.xls"]

Data Types: cell | string

— Number of lines to skip

0 (default) | positive integer

Number of lines to skip at the beginning of each sheet when reading, specified as a positive integer. When you also specify theRange property, NumHeaderLines is the number of lines to skip at the beginning of the specified block of data.

Data Types: double

Sheets — Sheets in files

"" (default) | sheet names | sheet indices

Sheets in files, specified as a character vector, cell array of character vectors, string scalar, or string array containing sheet names, or as a numeric vector of sheet indices. The empty character vector '' indicates that all sheets in the files are included.

Example: ["sheet1","sheet7"]

Example: [3 5 7]

Data Types: char | cell | string | double

Range — Row and column bounds

"" (default) | character vector | string scalar

Row and column bounds, specified as a character vector or string scalar that defines a rectangular block of data in the sheets. The empty character vector '' indicates that the bounds are the beginning of the file and the end of the data.

Example: "B1:T7"

Example: "A:C"

Data Types: char | string

ReadVariableNames — Indicator for reading variable names

true (default) | false | 1 | 0

Indicator for reading the first row of the first file in the datastore as variable names, specified as either true (1) or false (0).

Data Types: logical | double

VariableNamingRule — Flag to preserve variable names

"modify" (default) | "preserve"

Flag to preserve variable names, specified as either "modify" or"preserve".

Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

Data Types: char | string

VariableNames — Names of variables

character vector | cell array of character vectors | string scalar | string array

Names of variables in the datastore, specified as a character vector, cell array of character vectors, string scalar, or string array. Specify the variable names in the order in which they appear in the files. If you do not specify the variable names, the datastore detects them from the first nonheader line in the first file. You can specifyVariableNames with a character vector or string scalar, however the datastore converts and stores the property value to a cell array of character vectors. When modifying the VariableNames property, the number of new variable names must match the number of original variable names.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of the VariableNamingRule parameter to "preserve".

If ReadVariableNames is false, thenVariableNames defaults to ["Var1","Var2", ...].

Example: ["Time","Date","Quantity"]

Data Types: char | cell | string

VariableTypes — Variable types

"double" | "char" | "string" | "categorical" | "datetime"

Variable types, specified as "double","char", "string","categorical", or "datetime", which indicates the type of each variable when reading the data.

The list of variable types corresponds with the variables inVariableNames. Types double,char, and datetime can be automatically detected from the data. You can specifyVariableTypes as a character vector or string scalar, however spreadsheetDatastore automatically converts it to a cell array of character vectors or a string array.

Example: ["char","categorical"]

Data Types: char | cell | string

Properties for preview, read, readall Table

SelectedVariableNames — Variables to read

character vector | cell array of character vectors | string scalar | string array

Variables to read from the file, specified as a character vector, cell array of character vectors, string scalar, or string array. Each character vector or string contains the name of one variable. You can specify the variable names in any order. You can specifySelectedVariableNames with a character vector or string scalar, however spreadsheetDatastore automatically converts it to a cell array of character vectors or a string array.

To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value ofVariableNamingRule parameter to"preserve".

Example: ["Var3","Var7","Var4"]

Data Types: char | cell | string

SelectedVariableTypes — Selected variable types

"double" | "char" | "string" | "categorical" | "datetime"

Selected variable types, specified as "double","char", "string","categorical", or "datetime", which indicates the type of each selected variable when reading the data. The list of variable types corresponds with the variables inSelectedVariableNames. Typesdouble, char, anddatetime can be automatically detected from the data. You can specify SelectedVariableTypes as a character vector or string scalar, however it is automatically converted to a cell array of character vectors or string array.

Example: ["double","datetime"]

Data Types: char | cell | string

ReadSize — Amount of data to read

"file" (default) | "sheet" | positive integer scalar

Amount of data to read in a call to the read function, specified as "file" or"sheet", or as a positive integer scalar.

When you change ReadSize from an integer scalar to"file" or "sheet", or conversely, the datastore resets using the reset function.

Data Types: char | string | double

RowTimes — Name of row times variable

variable name | variable index

Name of row times variable, specified as the name-value argument consisting of"RowTimes" and a variable name (such as"Date") or a variable index (such as 3).

RowTimes is a timetable-related parameter. Each row of a timetable is associated with a time, which is captured in a time vector for the timetable. The variable specified in RowTimes must contain adatetime or a duration vector.

If the value of "OutputType" is "timetable", but you do not specify "RowTimes", then spreadsheetDatastore uses the first datetime or duration variable as the row times for the timetable.

Properties for use by writeall

Folders — Folders used to construct datastore

cell array of character vectors

This property is read-only.

Folders used to construct datastore, returned as a cell array of character vectors. The cell array is oriented as a column vector. Each character vector is a path to a folder that contains data files. Thelocation argument in thespreadsheetDatastore anddatastore functions definesFolders when the datastore is created.

The Folders property is reset when you modify theFiles property of aSpreadsheetDatastore object.

Data Types: cell

SupportedOutputFormats — Formats supported for writing

string row vector

This property is read-only.

Formats supported for writing, returned as a row vector of strings. This property specifies the possible output formats when using writeall to write output files from the datastore.

DefaultOutputFormat — Default output format

string scalar

This property is read-only.

Default output format, returned as a string scalar. This property specifies the default format when using writeall to write output files from the datastore.

Data Types: string

Object Functions

Examples

collapse all

Create SpreadsheetDatastore Object

Create a spreadsheetDatastore object using either a FileSet object or a file path.

Create a FileSet object. Create a spreadsheetDatastore object.

fs = matlab.io.datastore.FileSet("airlinesmall_subset.xlsx"); ssds = spreadsheetDatastore(fs)

ssds = SpreadsheetDatastore with properties:

                  Files: {
                         ' ...\Documents\MATLAB\Examples\airlinesmall_subset.xlsx'
                         }
                Folders: {
                         ' ...\Documents\MATLAB\Examples'
                         }

AlternateFileSystemRoots: {} Sheets: '' Range: ''

Sheet Format Properties: NumHeaderLines: 0 VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} VariableTypes: {'double', 'double', 'double' ... and 26 more}

Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more} ReadSize: 'file' OutputType: 'table' RowTimes: []

Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "xlsx"

Alternatively, you can specify your data with a file path.

ssds = spreadsheetDatastore("airlinesmall_subset.xlsx");

Read Spreadsheet File

Create a SpreadsheetDatastore object containing the file airlinesmall_subset.xlsx.

ssds = spreadsheetDatastore("airlinesmall_subset.xlsx")

ssds = SpreadsheetDatastore with properties:

                  Files: {
                         ' ...\Documents\MATLAB\Examples\airlinesmall_subset.xlsx'
                         }
                Folders: {
                         ' ...\Documents\MATLAB\Examples'
                         }

AlternateFileSystemRoots: {} Sheets: '' Range: ''

Sheet Format Properties: NumHeaderLines: 0 VariableNamingRule: 'modify' ReadVariableNames: true VariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} VariableTypes: {'double', 'double', 'double' ... and 26 more}

Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'Month', 'DayofMonth' ... and 26 more} SelectedVariableTypes: {'double', 'double', 'double' ... and 26 more} ReadSize: 'file' OutputType: 'table' RowTimes: []

Write-specific Properties: SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq"] DefaultOutputFormat: "xlsx"

Display the sheet names for the file. The file contains one sheet per year.

ans = 13x1 string "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008"

Specify the variable FlightNum in the second sheet as the data of interest, and preview the first eight rows.

ssds.Sheets = 2; ssds.SelectedVariableNames = "FlightNum"; preview(ssds)

ans=8×1 table FlightNum _________

  1014   
  1201   
   702   
  1184   
  1310   
  1759   
  1242   
  1558   

Read only the first three rows of variables DepTime andArrTime in the first sheet.

ssds.ReadSize = 3; ssds.Sheets = 1; ssds.SelectedVariableNames = ["DepTime","ArrTime"]; read(ssds)

ans=3×2 table DepTime ArrTime _______ _______

 2117       2305  
 1252       1511  
 1441       1708  

Read all of sheets four, five, and six.

ssds.Sheets = 4:6; readall(ssds);

Return Timetable from Spreadsheet Datastore

Use the OutputType and RowTimes name-value pairs to make SpreadsheetDatastore return timetables instead of tables.

Create a datastore for tsunamis_dated.xlsx. Specify "OutputType" as "timetable" so that SpreadsheetDatastore returns timetables instead of tables.

ssds = spreadsheetDatastore("tsunamis_dated.xlsx","OutputType","timetable"); preview(ssds)

ans=8×15 timetable Date Latitude Longitude ValidityCode Validity CauseCode Cause EarthquakeMagnitude Country Location MaxHeight IidaMagnitude Intensity NumDeaths DescDeaths Time
___________ ________ _________ ____________ ____________________ _________ ____________________________ ___________________ _____________ ____________________________ _________ _____________ _________ _________ __________ ________________________

04-Mar-1952     42.15        143.85          4          {'definite tsunami'}        1        {'Earthquake'              }            8.1            {'JAPAN'    }    {'SE. HOKKAIDO ISLAND'     }        6.5           2.7              2           33           1         04-Mar-1952 01:22:41.000
10-Jul-1958     58.34       -136.52          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            8.3            {'USA'      }    {'SE. ALASKA, AK'          }     524.26           4.6              5            5           1         10-Jul-1958 06:15:53.600
22-May-1960     -39.5         -74.5          4          {'definite tsunami'}        1        {'Earthquake'              }            9.5            {'CHILE'    }    {'CENTRAL CHILE'           }         25           4.6              4         1260           3         22-May-1960 19:11:17.000
20-Nov-1960      -6.8         -80.7          4          {'definite tsunami'}        1        {'Earthquake'              }            6.8            {'PERU'     }    {'PERU'                    }          9           3.2            2.5           66           2         20-Nov-1960 22:01:56.400
28-Mar-1964      61.1        -147.5          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            9.2            {'USA'      }    {'PRINCE WILLIAM SOUND, AK'}         67           6.1              5          221           3         28-Mar-1964 03:36:14.000
16-Jun-1964     38.65         139.2          4          {'definite tsunami'}        1        {'Earthquake'              }            7.5            {'JAPAN'    }    {'NW. HONSHU ISLAND'       }        5.8           2.7              2           26           1         16-Jun-1964 04:01:44.300
14-Aug-1968       0.2         119.8          4          {'definite tsunami'}        1        {'Earthquake'              }            7.8            {'INDONESIA'}    {'BANDA SEA'               }         10           3.3              3          200           3         14-Aug-1968 22:14:19.400
23-Feb-1969      -3.1         118.9          4          {'definite tsunami'}        1        {'Earthquake'              }            6.9            {'INDONESIA'}    {'MAKASSAR STRAIT'         }          4             2              2          600           3         23-Feb-1969 00:36:56.600

When you do not specify "RowTimes", spreadsheetDatastore uses the first datetime or duration variable as the row times. In this case, the Date variable is used for the row times. This data has two datetime variables: Date and Time. The Date variable does not include information about hours, minutes, or seconds, while the Time variable has the specific time of each event.

Specify the "RowTimes" option to use the event times (the Time variable) as the row times.

ssds = spreadsheetDatastore("tsunamis_dated.xlsx","OutputType","timetable","RowTimes","Time"); preview(ssds)

ans=8×15 timetable Time Latitude Longitude ValidityCode Validity CauseCode Cause EarthquakeMagnitude Country Location MaxHeight IidaMagnitude Intensity NumDeaths DescDeaths Date
________________________ ________ _________ ____________ ____________________ _________ ____________________________ ___________________ _____________ ____________________________ _________ _____________ _________ _________ __________ ___________

04-Mar-1952 01:22:41.000     42.15        143.85          4          {'definite tsunami'}        1        {'Earthquake'              }            8.1            {'JAPAN'    }    {'SE. HOKKAIDO ISLAND'     }        6.5           2.7              2           33           1         04-Mar-1952
10-Jul-1958 06:15:53.600     58.34       -136.52          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            8.3            {'USA'      }    {'SE. ALASKA, AK'          }     524.26           4.6              5            5           1         10-Jul-1958
22-May-1960 19:11:17.000     -39.5         -74.5          4          {'definite tsunami'}        1        {'Earthquake'              }            9.5            {'CHILE'    }    {'CENTRAL CHILE'           }         25           4.6              4         1260           3         22-May-1960
20-Nov-1960 22:01:56.400      -6.8         -80.7          4          {'definite tsunami'}        1        {'Earthquake'              }            6.8            {'PERU'     }    {'PERU'                    }          9           3.2            2.5           66           2         20-Nov-1960
28-Mar-1964 03:36:14.000      61.1        -147.5          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            9.2            {'USA'      }    {'PRINCE WILLIAM SOUND, AK'}         67           6.1              5          221           3         28-Mar-1964
16-Jun-1964 04:01:44.300     38.65         139.2          4          {'definite tsunami'}        1        {'Earthquake'              }            7.5            {'JAPAN'    }    {'NW. HONSHU ISLAND'       }        5.8           2.7              2           26           1         16-Jun-1964
14-Aug-1968 22:14:19.400       0.2         119.8          4          {'definite tsunami'}        1        {'Earthquake'              }            7.8            {'INDONESIA'}    {'BANDA SEA'               }         10           3.3              3          200           3         14-Aug-1968
23-Feb-1969 00:36:56.600      -3.1         118.9          4          {'definite tsunami'}        1        {'Earthquake'              }            6.9            {'INDONESIA'}    {'MAKASSAR STRAIT'         }          4             2              2          600           3         23-Feb-1969

Limitations

Version History

Introduced in R2016a

expand all

R2024b: Read data over HTTP and HTTPS using datastore functions

You can read data from primary online sources by performing datastore operations over an internet URL.