write - Write tall array to local and remote locations for checkpointing - MATLAB (original) (raw)

Write tall array to local and remote locations for checkpointing

Syntax

Description

write([location](#bvciprm-location),[tA](#bvciprm-tA)) calculates the values in tall array tA and writes the array to files in the folder specified by location. The data is stored in an efficient binary format suitable for reading back usingdatastore(location).

example

write([filepattern](#mw%5Fb6988d37-965c-40cc-8e74-0b3812839a42),[tA](#bvciprm-tA)) uses the file extension from filepattern to determine the output format. filepattern must include a folder to write the files into, followed by a file name that includes a wildcard *. The wildcard represents incremental numbers for generating unique file names. For example, write('folder/myfile_*.csv',tA).

example

write(___,[Name,Value](#namevaluepairarguments)) specifies additional options with one or more name-value pair arguments using any of the previous syntaxes. For example, you can specify the file type with'FileType' and a valid file type ('mat','seq', 'parquet','text', or 'spreadsheet'), or you can specify a custom write function to process the data with'WriteFcn' and a function handle.

example

Examples

collapse all

Write a tall array to disk, and then recover the tall array by creating a new datastore for the written files. This process is useful to save your work or share a tall array with a colleague.

Create a datastore for the airlinesmall.csv data set. Select only the Year, Month, and UniqueCarrier variables, and treat 'NA' values as missing data. Convert the datastore into a tall table.

ds = datastore('airlinesmall.csv'); ds.TreatAsMissing = 'NA'; ds.SelectedVariableNames = {'Month','Year','UniqueCarrier'}; tt = tall(ds)

tt =

M×3 tall table

Month    Year    UniqueCarrier
_____    ____    _____________

 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
 10      1987       {'PS'}    
  :       :            :
  :       :            :

Sort the data in descending order by year and extract the top 25 rows. The resulting tall table is unevaluated.

tt_new = topkrows(tt,25,'Year')

tt_new =

M×3 tall table

Month    Year    UniqueCarrier
_____    ____    _____________

  ?       ?            ?      
  ?       ?            ?      
  ?       ?            ?      
  :       :            :
  :       :            :

Preview deferred. Learn more.

Save the results to a new folder named ExampleData on the C:\ disk. (You can specify a different write location, especially if you are not using a Windows® computer.) The write function evaluates the tall array prior to writing the files, so there is no need to use the gather function prior to saving the data.

location = 'C:\ExampleData'; write(location,tt_new)

Writing tall data to folder C:\ExampleData Evaluating tall expression using the Local MATLAB Session:

Pass 1 of 1: Completed in 0.25 sec Evaluation completed in 0.65 sec

Clear tt and ds from your working directory. To recover the tall table that was written to disk, first create a new datastore that references the same directory. Then convert the datastore into a tall table. Since the tall table was evaluated before being written to disk, the display now includes a preview of the values.

clear tt ds ds2 = datastore(location); tt2 = tall(ds2)

tt2 =

M×3 tall table

Month    Year    UniqueCarrier
_____    ____    _____________

  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  1      2008       {'WN'}    
  :       :            :
  :       :            :

Create a tall table referencing the tsunamis.xlsx data file, which contains time-stamped data about the location, magnitude, and cause of tsunamis.

ds = spreadsheetDatastore('tsunamis.xlsx'); T = tall(ds)

T =

M×20 tall table

Latitude    Longitude    Year    Month    Day    Hour    Minute    Second    ValidityCode            Validity             CauseCode          Cause           EarthquakeMagnitude          Country                   Location             MaxHeight    IidaMagnitude    Intensity    NumDeaths    DescDeaths
________    _________    ____    _____    ___    ____    ______    ______    ____________    _________________________    _________    __________________    ___________________    ___________________    __________________________    _________    _____________    _________    _________    __________

  -3.8        128.3      1950     10       8       3       23       NaN           2          {'questionable tsunami' }        1        {'Earthquake'    }            7.6            {'INDONESIA'      }    {'JAVA TRENCH, INDONESIA'}       2.8            1.5            1.5          NaN          NaN    
  19.5         -156      1951      8      21      10       57       NaN           4          {'definite tsunami'     }        1        {'Earthquake'    }            6.9            {'USA'            }    {'HAWAII'                }       3.6            1.8            NaN          NaN          NaN    
 -9.02       157.95      1951     12      22     NaN      NaN       NaN           2          {'questionable tsunami' }        6        {'Volcano'       }            NaN            {'SOLOMON ISLANDS'}    {'KAVACHI'               }         6            2.6            NaN          NaN          NaN    
 42.15       143.85      1952      3       4       1       22        41           4          {'definite tsunami'     }        1        {'Earthquake'    }            8.1            {'JAPAN'          }    {'SE. HOKKAIDO ISLAND'   }       6.5            2.7              2           33            1    
  19.1         -155      1952      3      17       3       58       NaN           4          {'definite tsunami'     }        1        {'Earthquake'    }            4.5            {'USA'            }    {'HAWAII'                }         1            NaN            NaN          NaN          NaN    
  43.1        -82.4      1952      5       6     NaN      NaN       NaN           1          {'very doubtful tsunami'}        9        {'Meteorological'}            NaN            {'USA'            }    {'LAKE HURON, MI'        }      1.52            NaN            NaN          NaN          NaN    
 52.75        159.5      1952     11       4      16       58       NaN           4          {'definite tsunami'     }        1        {'Earthquake'    }              9            {'RUSSIA'         }    {'KAMCHATKA'             }        18            4.2              4         2236            3    
    50        156.5      1953      3      18     NaN      NaN       NaN           3          {'probable tsunami'     }        1        {'Earthquake'    }            5.8            {'RUSSIA'         }    {'N. KURIL ISLANDS'      }       1.5            0.6            NaN          NaN          NaN    
   :            :         :        :       :      :        :         :            :                      :                    :                :                      :                      :                         :                     :              :              :            :            :
   :            :         :        :       :      :        :         :            :                      :                    :                :                      :                      :                         :                     :              :              :            :            :

Combine the Year, Month, Day, Hour, Minute, and Second variables into a single datetime variable, and then remove those variables from the table. Remove any rows that contain missing data.

T.DateTime = datetime(T.Year, T.Month, T.Day, T.Hour, T.Minute, T.Second); T(:,3:8) = []; TT = rmmissing(T)

TT =

M×15 tall table

Latitude    Longitude    ValidityCode          Validity          CauseCode               Cause                EarthquakeMagnitude       Country                 Location              MaxHeight    IidaMagnitude    Intensity    NumDeaths    DescDeaths          DateTime      
________    _________    ____________    ____________________    _________    ____________________________    ___________________    _____________    ____________________________    _________    _____________    _________    _________    __________    ____________________

 42.15        143.85          4          {'definite tsunami'}        1        {'Earthquake'              }            8.1            {'JAPAN'    }    {'SE. HOKKAIDO ISLAND'     }        6.5           2.7              2           33           1         04-Mar-1952 01:22:41
 58.34       -136.52          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            8.3            {'USA'      }    {'SE. ALASKA, AK'          }     524.26           4.6              5            5           1         10-Jul-1958 06:15:53
 -39.5         -74.5          4          {'definite tsunami'}        1        {'Earthquake'              }            9.5            {'CHILE'    }    {'CENTRAL CHILE'           }         25           4.6              4         1260           3         22-May-1960 19:11:17
  -6.8         -80.7          4          {'definite tsunami'}        1        {'Earthquake'              }            6.8            {'PERU'     }    {'PERU'                    }          9           3.2            2.5           66           2         20-Nov-1960 22:01:56
  61.1        -147.5          4          {'definite tsunami'}        3        {'Earthquake and Landslide'}            9.2            {'USA'      }    {'PRINCE WILLIAM SOUND, AK'}         67           6.1              5          221           3         28-Mar-1964 03:36:14
 38.65         139.2          4          {'definite tsunami'}        1        {'Earthquake'              }            7.5            {'JAPAN'    }    {'NW. HONSHU ISLAND'       }        5.8           2.7              2           26           1         16-Jun-1964 04:01:44
   0.2         119.8          4          {'definite tsunami'}        1        {'Earthquake'              }            7.8            {'INDONESIA'}    {'BANDA SEA'               }         10           3.3              3          200           3         14-Aug-1968 22:14:19
  -3.1         118.9          4          {'definite tsunami'}        1        {'Earthquake'              }            6.9            {'INDONESIA'}    {'MAKASSAR STRAIT'         }          4             2              2          600           3         23-Feb-1969 00:36:56
   :            :             :                   :                  :                     :                           :                   :                       :                      :              :              :            :            :                  :
   :            :             :                   :                  :                     :                           :                   :                       :                      :              :              :            :            :                  :

Write the table as a spreadsheet file to a remote location in Amazon S3™ storage. To read or write data to Amazon S3 you must set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables using the credentials for your account. For more information, see Work with Remote Data.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY');

location = 's3://bucketname/preprocessedData/'; write(location, TT, 'FileType', 'spreadsheet')

To read the data back, use datastore to point to the remote location where the data now resides.

ds = datastore(location); tt = tall(ds);

Create and use a custom writing function to write data in additional formats that are not directly supported by write, such as image files.

Create an image datastore that references all of the sample images in the toolbox/matlab/demos folder. The selected images have the extensions .jpg, .tif, and .png. Convert the datastore to a tall cell array.

demoFolder = fullfile(matlabroot,'toolbox','matlab','demos'); ds = imageDatastore(demoFolder,'FileExtensions',{'.jpg' '.tif' '.png'}); T = tall(ds);

Bring one of the images into memory and display it.

Evaluating tall expression using the Local MATLAB Session:

Pass 1 of 1: Completed in 3 sec Evaluation completed in 3.2 sec

imshow(I{1},'InitialMagnification',30)

write does not support image files directly, so to write the images out in a different format, you must create a new function to handle the file writing. The writing function receives two inputs from write:

info is a structure containing fields with information about the current block of data. You can use these fields to construct your own unique file name, or simply use the SuggestedFilename field to use a name suggested by write.
data is the current block of data, obtained by using read on the datastore.

The function imageWriter uses the filename suggested by write, and uses imwrite to write the image files to disk as .jpg files. Save this function in your current working folder.

function imageWriter(info, data) filename = info.SuggestedFilename; imwrite(data{:}, filename) end

Write the images in the datastore to a new folder named exampleImages on the C:\ disk. (You can use a different location, especially if you are not using a Windows® computer.) Pass imageWriter as the custom write function using the 'WriteFcn' name-value pair argument.

location = 'C:\exampleImages\image_*.jpg'; write(location, T, 'WriteFcn', @imageWriter)

Writing tall data to folder C:\exampleImages Evaluating tall expression using the Local MATLAB Session:

Pass 1 of 1: Completed in 0.71 sec Evaluation completed in 0.98 sec

Display the contents of the folder where the files were written.

. image_1_000001.jpg image_3_000001.jpg image_5_000001.jpg
.. image_2_000001.jpg image_4_000001.jpg image_6_000001.jpg

To read the images back into MATLAB®, create a datastore that references the same location.

ds = imageDatastore(location); T = tall(ds)

T =

6×1 tall cell array

{1024×2048×3 uint8}
{ 650×600×3  uint8}
{1024×2048×3 uint8}
{ 650×600×3  uint8}
{ 480×640×3  uint8}
{ 480×640×3  uint8}

Input Arguments

collapse all

Folder location to write data, specified as a character vector or string.location can specify a full or relative path. The specified folder can be either of these options:

Existing empty folder
New folder that write creates

You can write data to local folders on your computer, folders on a shared network, or to remote locations in HDFS™, Azure®, or Amazon S3™. For more information about reading and writing data to remote locations, see Work with Remote Data.

Additional considerations apply for Hadoop® and Apache® Spark™:

If the folder is not available locally, then the full path of the folder must be a uniform resource locator (URL) of the form:
hdfs:///_`pathtofile`_.
Before writing to HDFS, set the HADOOP_HOME,HADOOP_PREFIX, orMATLAB_HADOOP_INSTALL environment variable to the folder where Hadoop is installed.
Before writing to Apache Spark, set the SPARK_HOME environment variable to the folder where Apache Spark is installed.

Example: location = 'hdfs:///some/output/folder' specifies an HDFS URL.

Example: location = '../../dir/data' specifies a relative file path.

Example: location = 'C:\Users\MyName\Desktop\data' specifies an absolute path to a Windows® desktop folder.

Example: location = 'file:///path/to/data' specifies an absolute URI path to a folder.

Data Types: char | string

Input array, specified as a tall array.

File naming pattern, specified as a string or a character vector. The file naming pattern must contain a folder to write the files into, followed by a file name that includes a wildcard *. Thewrite function replaces the wildcard with sequential numbers to ensure unique file names.

Example: write('folder/data_*.txt',tA) writes the tall array tA as a series of .txt files infolder with the file namesdata_1.txt, data_2.txt, and so on.

Data Types: char | string

Name-Value Arguments

expand all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: write('C:\myData', tX, 'FileType', 'text', 'WriteVariableNames', false) writes the tall array tX toC:\myData as a collection of text files that do not use variable names as column headings.

General Options

expand all

Type of file, specified as the comma-separated pair consisting of'FileType' and one of the allowed file types:'auto', 'mat','parquet', 'seq','text', or'spreadsheet'.

Use the 'FileType' name-value pair with thelocation argument to specify what type of files to write. By default, write attempts to automatically detect the proper file type. You do not need to specify the 'FileType' name-value pair argument ifwrite can determine the file type from an extension in the location orfilepattern arguments. write can determine the file type from these extensions:

.mat for MATLAB® data files
.parquet or .parq for Parquet files
.seq for sequence files
.txt, .dat, or.csv for delimited text files
.xls, .xlsx,.xlsb, .xlsm,.xltx, or .xltm for spreadsheet files

Example: write('C:\myData', tX, 'FileType', 'text')

Custom writing function, specified as the comma-separated pair consisting of 'WriteFcn' and a function handle. The specified function receives blocks of data from tA and is responsible for creating the output files. You can use the'WriteFcn' name-value pair argument to write data in a variety of formats, even if write does not directly support the output format.

Functional Signature

The custom writing function must accept two input arguments,info and data:

function myWriter(info, data)

data contains a block of data fromtA.

info is a structure with fields that contain information about the block of data. You can use the fields to build a new file name that is globally unique within the final location. The structure fields are

Field	Description
RequiredLocation	Fully qualified path to a temporary output folder. Only files written to this folder are copied to the final destination.
RequiredFilePattern	The file pattern required for output file names. This field is empty if only a folder name is specified.
SuggestedFilename	A fully qualified, globally unique file name that meets the location and naming requirements.
PartitionIndex	Index of the tall array partition being written.
NumPartitions	Total number of partitions in the tall array.
BlockIndexInPartition	Position of current data block within the partition.
IsFinalBlock	true if current block is the final block of the partition.

File Naming

The file name used for the output files determines the order in which datastore later reads the files back in. If the order of the files matters, then the best practice is to use the SuggestedFilename field to name the files, since the suggested name guarantees the file order. If you do not use the suggested file name, then the custom writing function must create globally unique, correctly ordered file names. The file names should follow the naming pattern outlined inRequiredFilePattern. When running in parallel with Parallel Computing Toolbox™, the file names must be unique and correctly ordered between workers, even though each worker writes to its own local folder.

Arrays with Multiple Partitions

You can divide a tall array into partitions to facilitate running calculations on the array in parallel with Parallel Computing Toolbox. Each of the partitions is still comprised of smaller blocks that individually fit into memory.

info contains several fields related to partitions: PartitionIndex,NumPartitions,BlockIndexInPartition, andIsFinalBlock. These fields are useful when you are writing out a single file and appending to it, which is a common task for arrays with large partitions that have been split into many blocks. The custom writing function is called once per block, and the blocks in one partition are always written in order by the same worker. However, different partitions can be written by different workers.

Example Function

A simple writing function that writes spreadsheet files is

function dataWriter(info, data) filename = info.SuggestedFilename; writetable(data, filename, 'FileType', 'spreadsheet') end

To invoke dataWriter as the writing function for some data tt, use these commands.

tt = tall(array2table(rand(5000,3))); location = '/tmp/MyData/tt_*.xlsx'; write(location, tt, 'WriteFcn', @dataWriter);

For each block, the dataWriter function uses the suggested file name in the info structure and calls writetable to write a spreadsheet file. The suggested file name takes into account the file naming pattern that is specified in the location argument.

Data Types: function_handle

Text or Spreadsheet Files

expand all

Indicator for writing variable names as column headings, specified as the comma-separated pair consisting of'WriteVariableNames' and a numeric or logical1 (true) or0 (false).

Indicator	Behavior
true	Variable names are included as the column headings of the output. (default)
false	Variable names are not included in the output.

Locale for writing dates, specified as the comma-separated pair consisting of 'DateLocale' and a character vector or a string scalar. When writing datetime values to the file, use DateLocale to specify the locale in whichwrite should write month and day-of-week names and abbreviations. The character vector or string takes the form_`xx`__ _`YY`_, where xx is a lowercase ISO 639-1 two-letter code indicating a language, and YY is an uppercase ISO 3166-1 alpha-2 code indicating a country. For a list of common values for the locale, see the Locale name-value pair argument for the datetime function.

For Excel® files, write writes variables containing datetime arrays as Excel dates and ignores the 'DateLocale' parameter value. If the datetime variables contain years prior to either 1900 or 1904, then write writes the variables as text. For more information on Excel dates, see Differences between the 1900 and the 1904 date system in Excel.

Example: 'DateLocale','ja_JP' or'DateLocale',"ja_JP"

Data Types: char | string

Text Files Only

expand all

Field delimiter character, specified as the comma-separated pair consisting of 'Delimiter' and one of these specifiers.

Specifier	Field Delimiter
',''comma'	Comma (default)
' ''space'	Space
'\t''tab'	Tab
';''semi'	Semicolon
'\|''bar'	Vertical bar

You can use the 'Delimiter' name-value pair argument only for delimited text files.

Example: 'Delimiter','space' or'Delimiter',"space"

Indicator for writing quoted text, specified as the comma-separated pair consisting of 'QuoteStrings' and eitherfalse or true. If'QuoteStrings' is set to true, then write encloses the text in double quotation marks and replaces any double-quote characters that appear as part of that text with two double-quote characters. For an example, see Write Quoted Text to CSV File.

You can use the 'QuoteStrings' name-value pair argument only with delimited text files.

Character encoding scheme associated with the file, specified as the comma-separated pair consisting of 'Encoding' and'system' or a standard character encoding scheme name like one of the values in this table. When you do not specify any encoding or specify encoding as 'system', thewrite function uses your system default encoding to write the file.

"Big5"	"ISO-8859-1"	"windows-874"
"Big5-HKSCS"	"ISO-8859-2"	"windows-949"
"CP949"	"ISO-8859-3"	"windows-1250"
"EUC-KR"	"ISO-8859-4"	"windows-1251"
"EUC-JP"	"ISO-8859-5"	"windows-1252"
"EUC-TW"	"ISO-8859-6"	"windows-1253"
"GB18030"	"ISO-8859-7"	"windows-1254"
"GB2312"	"ISO-8859-8"	"windows-1255"
"GBK"	"ISO-8859-9"	"windows-1256"
"IBM866"	"ISO-8859-11"	"windows-1257"
"KOI8-R"	"ISO-8859-13"	"windows-1258"
"KOI8-U"	"ISO-8859-15"	"US-ASCII"
	"Macintosh"	"UTF-8"
	"Shift_JIS"

Example: 'Encoding','system' or'Encoding',"system" uses the system default encoding.

Spreadsheet Files Only

expand all

Target worksheet, specified as the comma-separated pair consisting of'Sheet' and a character vector or a string scalar containing the worksheet name or a positive integer indicating the worksheet index. The worksheet name cannot contain a colon (:). To determine the names of sheets in a spreadsheet file, use sheets = sheetnames(filename). For more information, see sheetnames.

If the sheet does not exist, then write adds a new sheet at the end of the worksheet collection. If the sheet is an index larger than the number of worksheets, thenwrite appends empty sheets until the number of worksheets in the workbook equals the sheet index. In either case,write generates a warning indicating that it has added a new worksheet.

You can use the 'Sheet' name-value pair argument only with spreadsheet files.

Example: 'Sheet',2

Example: 'Sheet','MySheetName'

Parquet Files Only

expand all

Parquet compression algorithm, specified as one of these values.

'snappy', 'brotli','gzip', or'uncompressed'. If you specify one compression algorithm then write compresses all variables using the same algorithm.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'snappy' has better performance for reading and writing, 'gzip' has a higher compression ratio at the cost of more CPU processing time, and'brotli' typically produces the smallest file size at the cost of compression speed.

Example: write('C:\myData', tX, 'FileType', 'parquet', 'VariableCompression', 'brotli')

Example: write('C:\myData', tX, 'FileType', 'parquet', 'VariableCompression', {'brotli' 'snappy' 'gzip'})

Encoding scheme names, specified as one of these values:

'auto' — write uses'plain' encoding for logical variables, and 'dictionary' encoding for all others.
'dictionary', 'plain' — If you specify one encoding scheme thenwrite encodes all variables with that scheme.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general, 'dictionary' encoding results in smaller file sizes, but 'plain' encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, see Parquet encoding definitions.

Example: write('myData.parquet', T, 'FileType', 'parquet', 'VariableEncoding', 'plain')

Example: write('myData.parquet', T, 'FileType', 'parquet', 'VariableEncoding', {'plain' 'dictionary' 'plain'})

Parquet version to use, specified as either '1.0' or '2.0'. By default, '2.0' offers the most efficient storage, but you can select '1.0' for the broadest compatibility with external applications that support the Parquet format.

Caution

Parquet version 1.0 has a limitation that it cannot round-trip variables of type uint32 (they are read back into MATLAB as int64).

Limitations

In some cases, write(location, T, 'FileType', type) creates files that do not represent the original array T exactly. If you usedatastore(location) to read the files, then the result might not have the same format or contents as the original tall table.

For the 'text' and 'spreadsheet' file types, write uses these rules:
- write outputs numeric variables using longG format and categorical, character, or string variables as unquoted text.
- For nontext variables that have more than one column,write outputs multiple delimiter-separated fields on each line and constructs suitable column headings for the first line of the file.
- write outputs variables with more than two dimensions as two-dimensional variables, with trailing dimensions collapsed.
- For cell-valued variables with contents that are numeric, logical, character, or categorical, write outputs the contents of each cell as a single row, in multiple delimiter-separated fields. If the cells have a different data type,write outputs a single empty field.
  Do not use the 'text' or 'spreadsheet' file types if you need to write an exact checkpoint of the tall array.
For the 'parquet' file type, there are some cases where the Parquet format cannot fully represent the MATLAB table or timetable data types. If you useparquetread or datastore to read the files, then the result might not have the same format or contents as the original tall table. For more information, see Apache Parquet Data Type Mappings.

Tips

Use the write function to create_checkpoints_ or snapshots of your data as you work, especially when working with huge data sets. This practice allows you to reconstruct tall arrays directly from files on disk rather than re-executing all of the commands that produced the tall array.

Extended Capabilities

Version History

Introduced in R2016b