dask.dataframe.read_hdf — Dask documentation (original) (raw)

dask.dataframe.read_hdf#

dask.dataframe.read_hdf(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='r')[source]#

Read HDF files into a Dask DataFrame

Read hdf files into a dask dataframe. This function is likepandas.read_hdf, except it can read from a single large file, or from multiple files, or from multiple keys from the same file.

Parameters:

patternstring, pathlib.Path, list

File pattern (string), pathlib.Path, buffer to read from, or list of file paths. Can contain wildcards.

keygroup identifier in the store. Can contain wildcards

startoptional, integer (defaults to 0), row number to start at

stopoptional, integer (defaults to None, the last row), row number to

stop at

columnslist of columns, optional

A list of columns that if not None, will limit the return columns (default is None)

chunksizepositive integer, optional

Maximal number of rows per partition (default is 1000000).

sorted_indexboolean, optional

Option to specify whether or not the input hdf files have a sorted index (default is False).

lockboolean, optional

Option to use a lock to prevent concurrency issues (default is True).

mode{‘a’, ‘r’, ‘r+’}, default ‘r’. Mode to use when opening file(s).

‘r’

Read-only; no data can be modified.

‘a’

Append; an existing file is opened for reading and writing, and if the file does not exist it is created.

‘r+’

It is similar to ‘a’, but the file must already exist.

Returns:

dask.DataFrame

Examples

Load single file

dd.read_hdf('myfile.1.hdf5', '/x')

Load multiple files

dd.read_hdf('myfile.*.hdf5', '/x')

dd.read_hdf(['myfile.1.hdf5', 'myfile.2.hdf5'], '/x')

Load multiple datasets

dd.read_hdf('myfile.1.hdf5', '/*')