dask.dataframe.read_json — Dask documentation (original) (raw)

dask.dataframe.read_json#

dask.dataframe.read_json(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', meta=None, engine=<function read_json>, include_path_column=False, path_converter=None, **kwargs)[source]#

Create a dataframe from a set of JSON files

This utilises pandas.read_json(), and most parameters are passed through - see its docstring.

Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see read_json()). All other options require blocksize=None, i.e., one partition per input file.

Parameters:

url_path: str, list of str

Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as "s3://".

encoding, errors:

The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see str.encode()).

orient, lines, kwargs

passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.

storage_options: dict

Passed to backend file-system implementation

blocksize: None or int

If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.

sample: int

Number of bytes to pre-load, to provide an empty dataframe structure to any blocks without data. Only relevant when using blocksize.

encoding, errors:

Text conversion, see bytes.decode()

compressionstring or None

String like ‘gzip’ or ‘xz’.

enginecallable or str, default pd.read_json

The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader (pd.read_json). If a string is specified, this value will be passed under the enginekey-word argument to pd.read_json (only supported for pandas>=2.0).

include_path_columnbool or str, optional

Include a column with the file path where each row in the dataframe originated. If True, a new column is added to the dataframe calledpath. If str, sets new column name. Default is False.

path_converterfunction or None, optional

A function that takes one argument and returns a string. Used to convert paths in the path column, for instance, to strip a common prefix from all the paths.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, adict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:

dask.DataFrame

Examples

Load single file

dd.read_json('myfile.1.json')

Load multiple files

dd.read_json('myfile.*.json')

dd.read_json(['myfile.1.json', 'myfile.2.json'])

Load large line-delimited JSON files using partitions of approx 256MB size

>> dd.read_json(‘data/file*.csv’, blocksize=2**28)