pandas arrays, scalars, and data types — pandas 3.0.0.dev0+2103.g41968a550a documentation (original) (raw)
Objects#
For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index, Series, orDataFrame.
For some data types, pandas extends NumPy’s type system. String aliases for these types can be found at dtypes.
pandas and third-party libraries can extend NumPy’s type system (see Extension types). The top-level array() method can be used to create a new array, which may be stored in a Series, Index, or as a column in a DataFrame.
PyArrow#
Warning
This feature is experimental, and the API can change in a future release without warning.
The arrays.ArrowExtensionArray is backed by a pyarrow.ChunkedArray with apyarrow.DataType instead of a NumPy array and data type. The .dtype
of a arrays.ArrowExtensionArrayis an ArrowDtype.
Pyarrow provides similar array and data typesupport as NumPy including first-class nullability support for all data types, immutability and more.
The table below shows the equivalent pyarrow-backed (pa
), pandas extension, and numpy (np
) types that are recognized by pandas. Pyarrow-backed types below need to be passed into ArrowDtype to be recognized by pandas e.g. pd.ArrowDtype(pa.bool_())
.
Note
Pyarrow-backed string support is provided by both pd.StringDtype("pyarrow")
and pd.ArrowDtype(pa.string())
.pd.StringDtype("pyarrow")
is described below in the string sectionand will be returned if the string alias "string[pyarrow]"
is specified. pd.ArrowDtype(pa.string())
generally has better interoperability with ArrowDtype of different types.
While individual values in an arrays.ArrowExtensionArray are stored as a PyArrow objects, scalars are returnedas Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or NA for missing values.
For more information, please see the PyArrow user guide.
Datetimes#
NumPy cannot natively represent timezone-aware datetimes. pandas supports this with the arrays.DatetimeArray extension array, which can hold timezone-naive or timezone-aware values.
Timestamp, a subclass of datetime.datetime, is pandas’ scalar type for timezone-naive or timezone-aware datetime data. NaTis the missing value for datetime data.
Properties#
Methods#
A collection of timestamps may be stored in a arrays.DatetimeArray. For timezone-aware data, the .dtype
of a arrays.DatetimeArray is aDatetimeTZDtype. For timezone-naive data, np.dtype("datetime64[ns]")
is used.
If the data are timezone-aware, then every value in the array must have the same timezone.
Timedeltas#
NumPy can natively represent timedeltas. pandas provides Timedeltafor symmetry with Timestamp. NaTis the missing value for timedelta data.
Properties#
Methods#
A collection of Timedelta may be stored in a TimedeltaArray
.
Periods#
pandas represents spans of times as Period objects.
Period#
Properties#
Methods#
A collection of Period may be stored in a arrays.PeriodArray. Every period in a arrays.PeriodArray must have the same freq
.
Intervals#
Arbitrary intervals can be represented as Interval objects.
Properties#
A collection of intervals may be stored in an arrays.IntervalArray.
Nullable integer#
numpy.ndarray cannot natively represent integer-data with missing values. pandas provides this through arrays.IntegerArray.
Nullable float#
Categoricals#
pandas defines a custom data type for representing data that can take only a limited, fixed set of values. The dtype of a Categorical can be described by a CategoricalDtype.
Categorical data can be stored in a pandas.Categorical:
The alternative Categorical.from_codes() constructor can be used when you have the categories and integer codes already:
The dtype information is available on the Categorical
np.asarray(categorical)
works by implementing the array interface. Be aware, that this converts the Categorical back to a NumPy array, so categories and order information is not preserved!
A Categorical can be stored in a Series or DataFrame. To create a Series of dtype category
, use cat = s.astype(dtype)
orSeries(..., dtype=dtype)
where dtype
is either
- the string
'category'
- an instance of CategoricalDtype.
If the Series is of dtype CategoricalDtype, Series.cat
can be used to change the categorical data. See Categorical accessor for more.
More methods are available on Categorical:
Sparse#
Data where a single value is repeated many times (e.g. 0
or NaN
) may be stored efficiently as a arrays.SparseArray.
The Series.sparse
accessor may be used to access sparse-specific attributes and methods if the Series contains sparse values. SeeSparse accessor and the user guide for more.
Strings#
When working with text data, where each valid element is a string or missing, we recommend using StringDtype (with the alias "string"
).
The Series.str
accessor is available for Series backed by a arrays.StringArray. See String handling for more.
Nullable Boolean#
The boolean dtype (with the alias "boolean"
) provides support for storing boolean data (True
, False
) with missing values, which is not possible with a bool numpy.ndarray.