ak.to_parquet — Awkward Array 2.8.2 documentation (original) (raw)
Defined in awkward.operations.ak_to_parquet on line 22.
ak.to_parquet(array, destination, *, list_to32=False, string_to32=True, bytestring_to32=True, emptyarray_to=None, categorical_as_dictionary=False, extensionarray=True, count_nulls=True, compression='zstd', compression_level=None, row_group_size=64 * 1024 * 1024, data_page_size=None, parquet_flavor=None, parquet_version='2.4', parquet_page_version='1.0', parquet_metadata_statistics=True, parquet_dictionary_encoding=False, parquet_byte_stream_split=False, parquet_coerce_timestamps=None, parquet_old_int96_timestamps=None, parquet_compliant_nested=False, parquet_extra_options=None, storage_options=None)#
Parameters:
- array – Array-like data (anything ak.to_layout recognizes).
- destination (path-like) – Name of the output file, file path, or remote URL passed to fsspec.core.url_to_fsfor remote writing.
- list_to32 (bool) – If True, convert Awkward lists into 32-bit Arrow lists if they’re small enough, even if it means an extra conversion. Otherwise, signed 32-bit ak.types.ListType maps to Arrow
ListType
, signed 64-bit ak.types.ListType maps to ArrowLargeListType
, and unsigned 32-bit ak.types.ListType picks whichever Arrow type its values fit into. - string_to32 (bool) – Same as the above for Arrow
string
andlarge_string
. - bytestring_to32 (bool) – Same as the above for Arrow
binary
andlarge_binary
. - emptyarray_to (None or dtype) – If None, ak.types.UnknownType maps to Arrow’s null type; otherwise, it is converted a given numeric dtype.
- categorical_as_dictionary (bool) – If True, ak.contents.IndexedArray andak.contents.IndexedOptionArray labeled with
__array__ = "categorical"
are mapped to ArrowDictionaryArray
; otherwise, the projection is evaluated before conversion (always the case without__array__ = "categorical"
). - extensionarray (bool) – If True, this function returns extended Arrow arrays (at all levels of nesting), which preserve metadata so that Awkward → Arrow → Awkward preserves the array’s ak.types.Type (though not the ak.forms.Form). If False, this function returns generic Arrow arrays that might be needed for third-party tools that don’t recognize Arrow’s extensions. Even with
extensionarray=False
, the values produced by Arrow’sto_pylist
method are the same as the values produced by Awkward’sak.to_list. - count_nulls (bool) – If True, count the number of missing values at each level and include these in the resulting Arrow array, which makes some downstream applications faster. If False, skip the up-front cost of counting them.
- compression (None , str, or dict) – Compression algorithm name, passed topyarrow.parquet.ParquetWriter. Parquet supports
{"NONE", "SNAPPY", "GZIP", "BROTLI", "LZ4", "ZSTD"}
(where"GZIP"
is also known as “zlib” or “deflate”). If a dict, the keys are column names (the same column names that ak.forms.Form.columns returns and ak.forms.Form.select_columns accepts) and the values are compression algorithm names, to compress each column differently. - compression_level (None , int, or dict None) – Compression level, passed topyarrow.parquet.ParquetWriter. Compression levels have different meanings for different compression algorithms: GZIP ranges from 1 to 9, but ZSTD ranges from -7 to 22, for example. Generally, higher numbers provide slower but smaller compression.
- row_group_size (int or None) – Number of entries in each row group (except the last), passed to pyarrow.parquet.ParquetWriter.write_table. If None, the Parquet default of 64 MiB is used.
- data_page_size (None or int) – Number of bytes in each data page, passed topyarrow.parquet.ParquetWriter. If None, the Parquet default of 1 MiB is used.
- parquet_flavor (None or
"spark"
) – If None, the output Parquet file will follow Arrow conventions; if"spark"
, it will follow Spark conventions. Some systems, such as Spark and Google BigQuery, might need Spark conventions, while others might need Arrow conventions. Passed topyarrow.parquet.ParquetWriter. asflavor
. - parquet_version (
"1.0"
,"2.4"
, or"2.6"
) – Parquet file format version. Passed to pyarrow.parquet.ParquetWriter. asversion
. - parquet_page_version (
"1.0"
or"2.0"
) – Parquet page format version. Passed to pyarrow.parquet.ParquetWriter. asdata_page_version
. - parquet_metadata_statistics (bool or dict) – If True, include summary statistics for each data page in the Parquet metadata, which lets some applications search for data more quickly (by skipping pages). If a dict mapping column names to bool, include summary statistics on only the specified columns. Passed topyarrow.parquet.ParquetWriter. as
write_statistics
. - parquet_dictionary_encoding (bool or dict) – If True, allow Parquet to pre-compress with dictionary encoding. If a dict mapping column names to bool, only use dictionary encoding on the specified columns. Passed topyarrow.parquet.ParquetWriter. as
use_dictionary
. - parquet_byte_stream_split (bool or dict) – If True, pre-compress floating point fields (
float32
orfloat64
) with byte stream splitting, which collects all mantissas in one part of the stream and exponents in another. Passed to pyarrow.parquet.ParquetWriter. asuse_byte_stream_split
. - parquet_coerce_timestamps (None,
"ms"
, or"us"
) – If None, any timestamps (datetime64
data) are coerced to a given resolution depending onparquet_version
: version"1.0"
and"2.4"
are coerced to microseconds, but later versions use thedatetime64
’s own units. If"ms"
is explicitly specified, timestamps are coerced to milliseconds; if"us"
, microseconds. Passed to pyarrow.parquet.ParquetWriter. ascoerce_timestamps
. - parquet_old_int96_timestamps (None or bool) – If True, use Parquet’s INT96 format for any timestamps (
datetime64
data), taking priority overparquet_coerce_timestamps
. If None, let theparquet_flavor
decide. Passed topyarrow.parquet.ParquetWriter. asuse_deprecated_int96_timestamps
. - parquet_compliant_nested (bool) – If True, use the Spark/BigQuery/Parquetconvention for nested lists, in which each list is a one-field record with field name “
element
”; otherwise, use the Arrow convention, in which the field name is “item
”. Passed to pyarrow.parquet.ParquetWriter. asuse_compliant_nested_type
. - parquet_extra_options (None or dict) – Any additional options to pass topyarrow.parquet.ParquetWriter.
- storage_options (None or dict) – Any additional options to pass tofsspec.core.url_to_fsto open a remote file for writing.
Returns:pyarrow._parquet.FileMetaData
instance
Writes an Awkward Array to a Parquet file (through pyarrow).
array1 = ak.Array([[1, 2, 3], [], [4, 5], [], [], [6, 7, 8, 9]]) ak.to_parquet(array1, "array1.parquet") <pyarrow._parquet.FileMetaData object at 0x7f646c38ff40> created_by: parquet-cpp-arrow version 9.0.0 num_columns: 1 num_rows: 6 num_row_groups: 1 format_version: 2.6 serialized_size: 0
If the array
does not contain records at top-level, the Arrow table will consist of one field whose name is ""
iff. extensionarray
is False.
If extensionarray
is True``, use a custom Arrow extension to store this array. Otherwise, generic Arrow arrays are used, and if the array
does not contain records at top-level, the Arrow table will consist of one field whose name is ""
. See ak.to_arrow_table for more details.
Parquet files can maintain the distinction between “option-type but no elements are missing” and “not option-type” at all levels, including the top level. However, there is no distinction between ?union[X, Y, Z]]
type and union[?X, ?Y, ?Z]
type. Be aware of these type distinctions when passing data through Arrow or Parquet.
See also ak.to_arrow, which is used as an intermediate step.