write - Write distributed data to an output location - MATLAB (original) (raw)

Write distributed data to an output location

Syntax

Description

write([location](#bvnaqho-location),[D](#bvnaqho-D)) writes the values in the distributed array D to files in the folder location. The data is stored in an efficient binary format suitable for reading back usingdatastore(location). If not distributed along the first dimension, MATLAB® redistributes the data before writing, so that the resulting files can be reread usingdatastore.

example

write([filepattern](#mw%5Fda5df600-8e1a-46e7-98f9-627e46bb0e7a),[D](#bvnaqho-D)) uses the file extension from filepattern to determine the output format. filepattern must include a folder to write the files into followed by a file name that includes a wildcard*. The wildcard represents incremental numbers for generating unique file names, for examplewrite('folder/myfile_*.csv',D).

example

write(___,[Name,Value](#namevaluepairarguments)) specifies additional options with one or more name-value pair arguments using any of the previous syntaxes. For example, you can specify the file type with 'FileType' and a valid file type ('mat', 'seq','parquet', 'text', or'spreadsheet'), or you can specify a custom write function to process the data with 'WriteFcn' and a function handle.

example

Examples

collapse all

Write Distributed Arrays

This example shows how to write a distributed array to a file system, then read it back using a datastore.

Create a distributed array and write it to an output folder.

d = distributed.rand(5000,1); location = 'hdfs://myHadoopCluster/some/output/folder'; write(location, d);

Recreate the distributed array from the written files.

ds = datastore(location); d1 = distributed(ds);

Write Distributed Arrays Using File Patterns

This example shows how to write distributed arrays to different formats using a file pattern.

Create a distributed table and write it to a simple text-based format that many applications can read.

dt = distributed(array2table(rand(5000,3))); location = "/tmp/CSVData/dt_*.csv"; write(location, dt);

Recreate the distributed table from the written files.

ds = datastore(location); dt1 = distributed(ds);

Write and Read Back Tall and Distributed Data

You can write distributed data and read it back as tall data and vice versa.

Create a distributed timetable and write it to disk.

dt = distributed(array2table(rand(5000,3))); location = "/tmp/CSVData/dt_*.csv"; write(location, dt);

Build a tall table from the written files.

ds = datastore(location); tt = tall(ds);

Alternatively, you can read data written from tall data into distributed data. Create a tall timetable and write it to disk.

tt = tall(array2table(rand(5000,3))); location = "/tmp/CSVData/dt_*.csv"; write(location, tt);

Read back into a distributed timetable.

ds = datastore(location); dt = distributed(ds);

Write Distributed Arrays Using a Write Function

This example shows how to write distributed arrays to a file system using a custom write function.

Create a simple write function that writes out spreadsheet files.

function dataWriter(info, data) filename = info.SuggestedFilename; writetable(data, filename, "FileType", "spreadsheet"); end

Create a distributed table and write it to disk using the custom write function.

dt = distributed(array2table(rand(5000,3))); location = "/tmp/MyData/tt_*.xlsx"; write(location, dt, "WriteFcn", @dataWriter);

Input Arguments

collapse all

`location` — Folder location to write data

character vector | string

Folder location to write data, specified as a character vector or string. location can specify a full or relative path. The specified folder can be either of these options:

Existing empty folder that contains no other files
New folder that write creates

You can write data to local folders on your computer, folders on a shared network, or to remote locations, such as Amazon S3™, Windows Azure® Storage Blob, or a Hadoop® Distributed File System (HDFS™). For more information about reading and writing data to remote locations, see Work with Remote Data.

Example: location = '../../dir/data' specifies a relative file path.

Example: location = 'C:\Users\MyName\Desktop\data' specifies an absolute path to a Windows® desktop folder.

Example: location = 'file:///path/to/data' specifies an absolute URI path to a folder.

Example: location = 'hdfs://myHadoopCluster/some/output/folder' specifies an HDFS URL.

Example: location = 's3://bucketname/some/output/folder' specifies an Amazon S3 location.

Data Types: char | string

`D` — Input array

distributed array

Input array, specified as a distributed array.

`filepattern` — File naming pattern

string | character vector

File naming pattern, specified as a string or a character vector. The file naming pattern must contain a folder to write the files into followed by a file name that includes a wildcard *.write replaces the wildcard with sequential numbers to ensure unique file names.

Example: write('folder/data_*.txt',D) writes the distributed array D as a series of .txt files infolder with the file namesdata_1.txt,data_2.txt, and so on.

Data Types: char | string

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: write('C:\myData', D, 'FileType', 'text', 'WriteVariableNames', false) writes the distributed array D to C:\myData as a collection of text files that do not use variable names as column headings.

General Options

collapse all

`FileType` — Type of file

Type of file, specified as the comma-separated pair consisting of 'FileType' and one of the allowed file types:'auto','mat','parquet','seq','text', or'spreadsheet'.

Use the 'FileType' name-value pair with the location argument to specify what type of files to write. By default, write attempts to automatically detect the proper file type. You do not need to specify the'FileType' name-value pair argument if write can determine the file type from an extension in thelocation orfilepattern arguments.write can determine the file type from these extensions:

.mat for MATLAB data files
.parquet or.parq for Parquet files
.seq for sequence files
.txt,.dat, or.csv for delimited text files
.xls,.xlsx,.xlsb,.xlsm,.xltx, or.xltm for spreadsheet files

Example: write('C:\myData', D, 'FileType', 'text')

`WriteFcn` — Custom writing function

function handle

Custom writing function, specified as the comma-separated pair consisting of'WriteFcn' and a function handle. The specified function receives blocks of data from D and is responsible for creating the output files. You can use the'WriteFcn' name-value pair argument to write data in a variety of formats, even if the output format is not directly supported by write.

Functional Signature

The custom writing function must accept two input arguments, info anddata:

function myWriter(info, data)

data contains a block of data from D.

info is a structure with fields that contain information about the block of data. You can use the fields to build a new file name that is globally unique within the final location. The structure fields are:

Field	Description
RequiredLocation	Fully qualified path to a temporary output folder. All output files must be written to this folder.
RequiredFilePattern	The file pattern required for output file names. This field is empty if only a folder name is specified.
SuggestedFilename	A fully qualified, globally unique file name that meets the location and naming requirements.
PartitionIndex	Index of the distributed array partition being written.
NumPartitions	Total number of partitions in the distributed array.
BlockIndexInPartition	Position of current data block within the partition.
IsFinalBlock	true if current block is the final block of the partition.

File Naming

The file name used for the output files determines the order that the files are read back in later by datastore. If the order of the files matters, then the best practice is to use the SuggestedFilename field to name the files since the suggested name guarantees the file order. If you do not use the suggested file name, the custom writing function must create globally unique, correctly ordered file names. The file names should follow the naming pattern outlined inRequiredFilePattern. The file names must be unique and correctly ordered between workers, even though each worker writes to its own local folder.

Arrays with Multiple Partitions

A distributed array is divided into partitions to facilitate running calculations on the array in parallel with Parallel Computing Toolbox™. When writing a distributed array, each of the partitions is divided in smaller blocks.

info contains several fields related to partitions:PartitionIndex,NumPartitions,BlockIndexInPartition, andIsFinalBlock. These fields are useful when you are writing out a single file and appending to it, which is a common task for arrays with large partitions that have been split into many blocks. The custom writing function is called once per block, and the blocks in one partition are always written in order on one worker. However, different partitions can be written by different workers.

Example Function

A simple writing function that writes out spreadsheet files is:

function dataWriter(info, data) filename = info.SuggestedFilename; writetable(data, filename, 'FileType', 'spreadsheet') end

To invoke dataWriter as the writing function for some dataD, use the commands:

D = distributed(array2table(rand(5000,3))); location = '/tmp/MyData/D_*.xlsx'; write(location, D, 'WriteFcn', @dataWriter);

For each block, the dataWriter function uses the suggested file name in theinfo structure and callswritetable to write out a spreadsheet file. The suggested file name takes into account the file naming pattern that is specified in the location argument.

Data Types: function_handle

Text or Spreadsheet Files

collapse all

`WriteVariableNames` — Indicator for writing variable names as column headings

true or 1 (default) | false or0

Indicator for writing variable names as column headings, specified as the comma-separated pair consisting of'WriteVariableNames' and a numeric or logical 1 (true) or 0 (false).

Indicator	Behavior
true	Variable names are included as the column headings of the output. This is the default behavior.
false	Variable names are not included in the output.

`DateLocale` — Locale for writing dates

character vector | string scalar

Locale for writing dates, specified as the comma-separated pair consisting of'DateLocale' and a character vector or a string scalar. When writingdatetime values to the file, use DateLocale to specify the locale in which write should write month and day-of-week names and abbreviations. The character vector or string takes the form_`xx`__ _`YY`_, where xx is a lowercase ISO 639-1 two-letter code indicating a language, and YY is an uppercase ISO 3166-1 alpha-2 code indicating a country. For a list of common values for the locale, see theLocale name-value pair argument for the datetime function.

For Excel® files, write writes variables containingdatetime arrays as Excel dates and ignores the'DateLocale' parameter value. If the datetime variables contain years prior to either 1900 or 1904, thenwrite writes the variables as text. For more information on Excel dates, see Differences between the 1900 and the 1904 date system in Excel.

Example: 'DateLocale','ja_JP' or'DateLocale',"ja_JP"

Data Types: char | string

Text Files Only

collapse all

`Delimiter` — Field delimiter character

',' or 'comma' | ' ' or 'space' | ...

Field delimiter character, specified as the comma-separated pair consisting of'Delimiter' and one of these specifiers:

Specifier	Field Delimiter
',''comma'	Comma. This is the default behavior.
' ''space'	Space
'\t''tab'	Tab
';''semi'	Semicolon
'\|''bar'	Vertical bar

You can use the 'Delimiter' name-value pair argument only for delimited text files.

Example: 'Delimiter','space' or'Delimiter',"space"

`QuoteStrings` — Indicator for writing quoted text

false (default) | true

Indicator for writing quoted text, specified as the comma-separated pair consisting of'QuoteStrings' and eitherfalse ortrue. If'QuoteStrings' istrue, thenwrite encloses the text in double quotation marks, and replaces any double-quote characters that appear as part of that text with two double-quote characters. For an example, see Write Quoted Text to CSV File.

You can use the 'QuoteStrings' name-value pair argument only with delimited text files.

`Encoding` — Character encoding scheme

'UTF-8' | 'ISO-8859-1' | 'windows-1251' | 'windows-1252' | ...

Character encoding scheme associated with the file, specified as the comma-separated pair consisting of'Encoding' and'system' or a standard character encoding scheme name like one of the values in this table. When you do not specify any encoding or specify encoding as'system', thewrite function uses your system default encoding to write the file.

"Big5"	"ISO-8859-1"	"windows-874"
"Big5-HKSCS"	"ISO-8859-2"	"windows-949"
"CP949"	"ISO-8859-3"	"windows-1250"
"EUC-KR"	"ISO-8859-4"	"windows-1251"
"EUC-JP"	"ISO-8859-5"	"windows-1252"
"EUC-TW"	"ISO-8859-6"	"windows-1253"
"GB18030"	"ISO-8859-7"	"windows-1254"
"GB2312"	"ISO-8859-8"	"windows-1255"
"GBK"	"ISO-8859-9"	"windows-1256"
"IBM866"	"ISO-8859-11"	"windows-1257"
"KOI8-R"	"ISO-8859-13"	"windows-1258"
"KOI8-U"	"ISO-8859-15"	"US-ASCII"
	"Macintosh"	"UTF-8"
	"Shift_JIS"

Example: 'Encoding','system' or'Encoding',"system" uses the system default encoding.

Spreadsheet Files Only

collapse all

`Sheet` — Target worksheet

character vector | string scalar | positive integer

Target worksheet, specified as the comma-separated pair consisting of 'Sheet' and a character vector or a string scalar containing the worksheet name or a positive integer indicating the worksheet index. The worksheet name cannot contain a colon (:). To determine the names of sheets in a spreadsheet file, use[status,sheets] = xlsfinfo(filename).

If the sheet does not exist, thenwrite adds a new sheet at the end of the worksheet collection. If the sheet is an index larger than the number of worksheets, thenwrite appends empty sheets until the number of worksheets in the workbook equals the sheet index. In either case,write generates a warning indicating that it has added a new worksheet.

You can use the 'Sheet' name-value pair argument only with spreadsheet files.

Example: 'Sheet',2

Example: 'Sheet','MySheetName'

Parquet Files Only

collapse all

`VariableCompression` — Parquet compression algorithm

Parquet compression algorithm, specified as one of these values.

'snappy','brotli','gzip', or'uncompressed'. If you specify one compression algorithm thenwrite compresses all variables using the same algorithm.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'snappy' has better performance for reading and writing,'gzip' has a higher compression ratio at the cost of more CPU processing time, and'brotli' typically produces the smallest file size at the cost of compression speed.

Example: write('C:\myData',D,'FileType','parquet','VariableCompression','brotli')

Example: write('C:\myData', D, 'FileType', 'parquet', 'VariableCompression', {'brotli' 'snappy' 'gzip'})

`VariableEncoding` — Encoding scheme names

'auto' (default) | 'dictionary' | 'plain' | cell array of character vectors | string vector

Encoding scheme names, specified as one of these values:

'auto' —write uses'plain' encoding for logical variables, and 'dictionary' encoding for all others.
'dictionary','plain' — If you specify one encoding scheme then write encodes all variables with that scheme.
Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.

In general, 'dictionary' encoding results in smaller file sizes, but'plain' encoding can be faster for variables that do not contain many repeated values. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically reverts to plain encoding. For more information on Parquet encodings, seeParquet encoding definitions.

Example: write('myData.parquet', D, 'FileType', 'parquet', 'VariableEncoding', 'plain')

Example: write('myData.parquet', D, 'FileType', 'parquet', 'VariableEncoding', {'plain' 'dictionary' 'plain'})

`Version` — Parquet version to use

'2.0' (default) | '1.0'

Parquet version to use, specified as either'1.0' or'2.0'. By default,'2.0' offers the most efficient storage, but you can select'1.0' for the broadest compatibility with external applications that support the Parquet format.

Limitations

In some cases, write(location, D, 'FileType', type) creates files that do not represent the original array D exactly. If you use datastore(location) to read the checkpoint files, then the result might not have the same format or contents as the original distributed table.

For the 'text' and 'spreadsheet' file types, write uses these rules:

write outputs numeric variables usinglongG format, and categorical, character, or string variables as unquoted text.
For non-text variables that have more than one column,write outputs multiple delimiter-separated fields on each line, and constructs suitable column headings for the first line of the file.
write outputs variables with more than two dimensions as two-dimensional variables, with trailing dimensions collapsed.
For cell-valued variables, write outputs the contents of each cell as a single row, in multiple delimiter-separated fields, when the contents are numeric, logical, character, or categorical, and outputs a single empty field otherwise.

Do not use the 'text' or 'spreadsheet' file types if you need to write an exact checkpoint of the distributed array.

Tips

Use the write function to create_checkpoints_ or_snapshots_ of your data as you work. This practice allows you to reconstruct distributed arrays directly from files on disk rather than re-executing all of the commands that produced the distributed array.

Version History

Introduced in R2017a

write - Write distributed data to an output location - MATLAB (original) (raw)

Syntax

Description

Examples

Write Distributed Arrays

Write Distributed Arrays Using File Patterns

Write and Read Back Tall and Distributed Data

Write Distributed Arrays Using a Write Function

Input Arguments

location — Folder location to write data

D — Input array

filepattern — File naming pattern

Name-Value Arguments

FileType — Type of file

WriteFcn — Custom writing function

Functional Signature

File Naming

Arrays with Multiple Partitions

Example Function

WriteVariableNames — Indicator for writing variable names as column headings

DateLocale — Locale for writing dates

Delimiter — Field delimiter character

QuoteStrings — Indicator for writing quoted text

Encoding — Character encoding scheme

Sheet — Target worksheet

VariableCompression — Parquet compression algorithm

VariableEncoding — Encoding scheme names

Version — Parquet version to use

Limitations

Tips

Version History

`location` — Folder location to write data

`D` — Input array

`filepattern` — File naming pattern

`FileType` — Type of file

`WriteFcn` — Custom writing function

`WriteVariableNames` — Indicator for writing variable names as column headings

`DateLocale` — Locale for writing dates

`Delimiter` — Field delimiter character

`QuoteStrings` — Indicator for writing quoted text

`Encoding` — Character encoding scheme

`Sheet` — Target worksheet

`VariableCompression` — Parquet compression algorithm

`VariableEncoding` — Encoding scheme names

`Version` — Parquet version to use