dask.dataframe.read_json — Dask documentation (original) (raw)
dask.dataframe.read_json#
dask.dataframe.read_json(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', meta=None, engine=<function read_json>, include_path_column=False, path_converter=None, **kwargs)[source]#
Create a dataframe from a set of JSON files
This utilises pandas.read_json(), and most parameters are passed through - see its docstring.
Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see read_json()). All other options require blocksize=None, i.e., one partition per input file.
Parameters:
url_path: str, list of str
Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as "s3://".
encoding, errors:
The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see str.encode()).
orient, lines, kwargs
passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict
Passed to backend file-system implementation
blocksize: None or int
If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.
sample: int
Number of bytes to pre-load, to provide an empty dataframe structure to any blocks without data. Only relevant when using blocksize.
encoding, errors:
Text conversion, see bytes.decode()
compressionstring or None
String like ‘gzip’ or ‘xz’.
enginecallable or str, default pd.read_json
The underlying function that dask will use to read JSON files. By default, this will be the pandas JSON reader (pd.read_json). If a string is specified, this value will be passed under the enginekey-word argument to pd.read_json (only supported for pandas>=2.0).
include_path_columnbool or str, optional
Include a column with the file path where each row in the dataframe originated. If True, a new column is added to the dataframe calledpath. If str, sets new column name. Default is False.
path_converterfunction or None, optional
A function that takes one argument and returns a string. Used to convert paths in the path column, for instance, to strip a common prefix from all the paths.
metapd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, adict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.
Returns:
dask.DataFrame
Examples
Load single file
dd.read_json('myfile.1.json')
Load multiple files
dd.read_json('myfile.*.json')
dd.read_json(['myfile.1.json', 'myfile.2.json'])
Load large line-delimited JSON files using partitions of approx 256MB size
>> dd.read_json(‘data/file*.csv’, blocksize=2**28)