dask.dataframe.from_pandas — Dask documentation (original) (raw)
dask.dataframe.from_pandas#
dask.dataframe.from_pandas(data, npartitions=None, sort=True, chunksize=None)[source]#
Construct a Dask DataFrame from a Pandas DataFrame
This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to produce cleanly-divided partitions (with known divisions). To preserve the input ordering, make sure the input index is monotonically-increasing. Thesort=False option will also avoid reordering, but will not result in known divisions.
Parameters:
datapandas.DataFrame or pandas.Series
The DataFrame/Series with which to construct a Dask DataFrame/Series
npartitionsint, optional, default 1
The number of partitions of the index to create. Note that if there are duplicate values or insufficient elements in data.index, the output may have fewer partitions than requested.
chunksizeint, optional
The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.
sort: bool, default True
Sort the input by index first to obtain cleanly divided partitions (with known divisions). If False, the input will not be sorted, and all divisions will be set to None. Default is True.
Returns:
dask.DataFrame or dask.Series
A dask DataFrame/Series partitioned along the index
Raises:
TypeError
If something other than a pandas.DataFrame or pandas.Series is passed in.
See also
Construct a dask.DataFrame from an array that has record dtype
Construct a dask.DataFrame from a CSV file
Examples
from dask.dataframe import from_pandas df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), ... index=pd.date_range(start='20100101', periods=6)) ddf = from_pandas(df, npartitions=3) ddf.divisions (Timestamp('2010-01-01 00:00:00'), Timestamp('2010-01-03 00:00:00'), Timestamp('2010-01-05 00:00:00'), Timestamp('2010-01-06 00:00:00')) ddf = from_pandas(df.a, npartitions=3) # Works with Series too! ddf.divisions (Timestamp('2010-01-01 00:00:00'), Timestamp('2010-01-03 00:00:00'), Timestamp('2010-01-05 00:00:00'), Timestamp('2010-01-06 00:00:00'))