Installing PyArrow — Apache Arrow v20.0.0 (original) (raw)
System Compatibility#
PyArrow is regularly built and tested on Windows, macOS and various Linux distributions. We strongly recommend using a 64-bit system.
Python Compatibility#
PyArrow is currently compatible with Python 3.9, 3.10, 3.11, 3.12 and 3.13.
Using Conda#
Install the latest version of PyArrow fromconda-forge using Conda:
conda install -c conda-forge pyarrow
Note
While the pyarrow
conda-forge package is the right choice for most users, both a minimal and maximal variant of the package exist, either of which may be better for your use case. SeeDifferences between conda-forge packages.
Using Pip#
Install the latest version from PyPI (Windows, Linux, and macOS):
If you encounter any importing issues of the pip wheels on Windows, you may need to install the Visual C++ Redistributable for Visual Studio 2015.
Warning
On Linux, you will need pip >= 19.0 to detect the prebuilt binary packages.
Installing nightly packages or from source#
See Python Development.
Dependencies#
Optional dependencies
- NumPy 1.16.6 or higher.
- pandas 1.0 or higher,
- cffi.
Additional packages PyArrow is compatible with are fsspecand pytz, dateutil or tzdata package for timezones.
tzdata on Windows#
While Arrow uses the OS-provided timezone database on Linux and macOS, it requires a user-provided database on Windows. To download and extract the text version of the IANA timezone database follow the instructions in the C++Runtime Dependencies or use pyarrow utility functionpyarrow.util.download_tzdata_on_windows()
that does the same.
By default, the timezone database will be detected at %USERPROFILE%\Downloads\tzdata
. If the database has been downloaded in a different location, you will need to set a custom path to the database from Python:
import pyarrow as pa pa.set_timezone_db_path("custom_path")
Differences between conda-forge packages#
On conda-forge, PyArrow is published as three separate packages, each providing varying levels of functionality. This is in contrast to PyPi, where only a single PyArrow package is provided.
The purpose of this split is to minimize the size of the installed package for most users (pyarrow
), provide a smaller, minimal package for specialized use cases (pyarrow-core
), while still providing a complete package for users who require it (pyarrow-all
). What was historically pyarrow
onconda-forge is now pyarrow-all
, though most users can continue using pyarrow
.
The pyarrow-core
package includes the following functionality:
- Data Types and In-Memory Data Model
- Compute Functions (i.e.,
pyarrow.compute
) - Memory and IO Interfaces
- Streaming, Serialization, and IPC (i.e.,
pyarrow.ipc
) - Filesystem Interface (i.e.,
pyarrow.fs
. Note: It’s planned to move cloud fileystems (i.e., S3, GCS, etc) intopyarrow
in a future release though Local FS will remain inpyarrow-core
.) - File formats: Arrow/Feather, JSON, CSV, ORC (but not Parquet)
The pyarrow
package adds the following:
- Acero (i.e.,
pyarrow.acero
) - Tabular Datasets (i.e.,
pyarrow.dataset
) - Parquet (i.e.,
pyarrow.parquet
) - Substrait (i.e.,
pyarrow.substrait
)
Finally, pyarrow-all
adds:
- Arrow Flight RPC and Flight SQL (i.e.,
pyarrow.flight
) - Gandiva (i.e.,
pyarrow.gandiva
)
The following table lists the functionality provided by each package and may be useful when deciding to use one package over another or whenCreating A Custom Selection.
Creating A Custom Selection#
If you know which components you need and want to control what’s installed, you can create a custom selection of packages to include only the extra features you need. For example, to install pyarrow-core
and add support for reading and writing Parquet, install libparquet
alongside pyarrow-core
:
conda install -c conda-forge pyarrow-core libparquet
Or if you wish to use pyarrow
but need support for Flight RPC:
conda install -c conda-forge pyarrow libarrow-flight