pandas arrays, scalars, and data types — pandas 3.0.0.dev0+2103.g41968a550a documentation (original) (raw)

Objects#

For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index, Series, orDataFrame.

For some data types, pandas extends NumPy’s type system. String aliases for these types can be found at dtypes.

pandas and third-party libraries can extend NumPy’s type system (see Extension types). The top-level array() method can be used to create a new array, which may be stored in a Series, Index, or as a column in a DataFrame.

PyArrow#

Warning

This feature is experimental, and the API can change in a future release without warning.

The arrays.ArrowExtensionArray is backed by a pyarrow.ChunkedArray with apyarrow.DataType instead of a NumPy array and data type. The .dtype of a arrays.ArrowExtensionArrayis an ArrowDtype.

Pyarrow provides similar array and data typesupport as NumPy including first-class nullability support for all data types, immutability and more.

The table below shows the equivalent pyarrow-backed (pa), pandas extension, and numpy (np) types that are recognized by pandas. Pyarrow-backed types below need to be passed into ArrowDtype to be recognized by pandas e.g. pd.ArrowDtype(pa.bool_()).

Note

Pyarrow-backed string support is provided by both pd.StringDtype("pyarrow") and pd.ArrowDtype(pa.string()).pd.StringDtype("pyarrow") is described below in the string sectionand will be returned if the string alias "string[pyarrow]" is specified. pd.ArrowDtype(pa.string())generally has better interoperability with ArrowDtype of different types.

While individual values in an arrays.ArrowExtensionArray are stored as a PyArrow objects, scalars are returnedas Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or NA for missing values.

For more information, please see the PyArrow user guide.

Datetimes#

NumPy cannot natively represent timezone-aware datetimes. pandas supports this with the arrays.DatetimeArray extension array, which can hold timezone-naive or timezone-aware values.

Timestamp, a subclass of datetime.datetime, is pandas’ scalar type for timezone-naive or timezone-aware datetime data. NaTis the missing value for datetime data.

Properties#

Methods#

A collection of timestamps may be stored in a arrays.DatetimeArray. For timezone-aware data, the .dtype of a arrays.DatetimeArray is aDatetimeTZDtype. For timezone-naive data, np.dtype("datetime64[ns]")is used.

If the data are timezone-aware, then every value in the array must have the same timezone.

Timedeltas#

NumPy can natively represent timedeltas. pandas provides Timedeltafor symmetry with Timestamp. NaTis the missing value for timedelta data.

Properties#

Methods#

A collection of Timedelta may be stored in a TimedeltaArray.

Periods#

pandas represents spans of times as Period objects.

Period#

Properties#

Methods#

A collection of Period may be stored in a arrays.PeriodArray. Every period in a arrays.PeriodArray must have the same freq.

Intervals#

Arbitrary intervals can be represented as Interval objects.

Properties#

A collection of intervals may be stored in an arrays.IntervalArray.

Nullable integer#

numpy.ndarray cannot natively represent integer-data with missing values. pandas provides this through arrays.IntegerArray.

Nullable float#

Categoricals#

pandas defines a custom data type for representing data that can take only a limited, fixed set of values. The dtype of a Categorical can be described by a CategoricalDtype.

Categorical data can be stored in a pandas.Categorical:

The alternative Categorical.from_codes() constructor can be used when you have the categories and integer codes already:

The dtype information is available on the Categorical

np.asarray(categorical) works by implementing the array interface. Be aware, that this converts the Categorical back to a NumPy array, so categories and order information is not preserved!

A Categorical can be stored in a Series or DataFrame. To create a Series of dtype category, use cat = s.astype(dtype) orSeries(..., dtype=dtype) where dtype is either

If the Series is of dtype CategoricalDtype, Series.cat can be used to change the categorical data. See Categorical accessor for more.

More methods are available on Categorical:

Sparse#

Data where a single value is repeated many times (e.g. 0 or NaN) may be stored efficiently as a arrays.SparseArray.

The Series.sparse accessor may be used to access sparse-specific attributes and methods if the Series contains sparse values. SeeSparse accessor and the user guide for more.

Strings#

When working with text data, where each valid element is a string or missing, we recommend using StringDtype (with the alias "string").

The Series.str accessor is available for Series backed by a arrays.StringArray. See String handling for more.

Nullable Boolean#

The boolean dtype (with the alias "boolean") provides support for storing boolean data (True, False) with missing values, which is not possible with a bool numpy.ndarray.

Utilities#

Constructors#

Data type introspection#

Iterable introspection#

Scalar introspection#