Building Arrow C++ — Apache Arrow v20.0.0 (original) (raw)

System setup#

Arrow uses CMake as a build configuration system. We recommend building out-of-source. If you are not familiar with this terminology:

Building requires:

On Ubuntu/Debian you can install the requirements with:

sudo apt-get install
build-essential
ninja-build
cmake

On Alpine Linux:

apk add autoconf
bash
cmake
g++
gcc
ninja
make

On Fedora Linux:

sudo dnf install
cmake
gcc
gcc-c++
ninja-build
make

On Arch Linux:

sudo pacman -S --needed
base-devel
ninja
cmake

On macOS, you can use Homebrew:

git clone https://github.com/apache/arrow.git cd arrow brew update && brew bundle --file=cpp/Brewfile

With vcpkg:

git clone https://github.com/apache/arrow.git cd arrow vcpkg install
--x-manifest-root cpp
--feature-flags=versions
--clean-after-build

On MSYS2:

pacman --sync --refresh --noconfirm
ccache
git
mingw-w64-${MSYSTEM_CARCH}-boost
mingw-w64-${MSYSTEM_CARCH}-brotli
mingw-w64-${MSYSTEM_CARCH}-cmake
mingw-w64-${MSYSTEM_CARCH}-gcc
mingw-w64-${MSYSTEM_CARCH}-gflags
mingw-w64-${MSYSTEM_CARCH}-glog
mingw-w64-${MSYSTEM_CARCH}-gtest
mingw-w64-${MSYSTEM_CARCH}-lz4
mingw-w64-${MSYSTEM_CARCH}-protobuf
mingw-w64-${MSYSTEM_CARCH}-python3-numpy
mingw-w64-${MSYSTEM_CARCH}-rapidjson
mingw-w64-${MSYSTEM_CARCH}-snappy
mingw-w64-${MSYSTEM_CARCH}-thrift
mingw-w64-${MSYSTEM_CARCH}-zlib
mingw-w64-${MSYSTEM_CARCH}-zstd

Building#

All the instructions below assume that you have cloned the Arrow git repository and navigated to the cpp subdirectory:

$ git clone https://github.com/apache/arrow.git $ cd arrow/cpp

CMake presets#

Using CMake version 3.21.0 or higher, some presets for various build configurations are provided. You can get a list of the available presets using cmake --list-presets:

$ cmake --list-presets # from inside the cpp subdirectory Available configure presets:

"ninja-debug-minimal" - Debug build without anything enabled "ninja-debug-basic" - Debug build with tests and reduced dependencies "ninja-debug" - Debug build with tests and more optional components [ etc. ]

You can inspect the specific options enabled by a given preset usingcmake -N --preset <preset name>:

$ cmake --preset -N ninja-debug-minimal Preset CMake variables:

ARROW_BUILD_INTEGRATION="OFF" ARROW_BUILD_STATIC="OFF" ARROW_BUILD_TESTS="OFF" ARROW_EXTRA_ERROR_CONTEXT="ON" ARROW_WITH_RE2="OFF" ARROW_WITH_UTF8PROC="OFF" CMAKE_BUILD_TYPE="Debug"

You can also create a build from a given preset:

$ mkdir build # from inside the cpp subdirectory $ cd build $ cmake .. --preset ninja-debug-minimal Preset CMake variables:

 ARROW_BUILD_INTEGRATION="OFF"
 ARROW_BUILD_STATIC="OFF"
 ARROW_BUILD_TESTS="OFF"
 ARROW_EXTRA_ERROR_CONTEXT="ON"
 ARROW_WITH_RE2="OFF"
 ARROW_WITH_UTF8PROC="OFF"
 CMAKE_BUILD_TYPE="Debug"

-- Building using CMake version: 3.21.3 [ etc. ]

and then ask to compile the build targets:

$ cmake --build . [142/142] Creating library symlink debug/libarrow.so.700 debug/libarrow.so

$ tree debug/ debug/ ├── libarrow.so -> libarrow.so.700 ├── libarrow.so.700 -> libarrow.so.700.0.0 └── libarrow.so.700.0.0

0 directories, 3 files

$ cmake --install .

When creating a build, it is possible to pass custom options besides the preset-defined ones, for example:

$ cmake .. --preset ninja-debug-minimal -DCMAKE_INSTALL_PREFIX=/usr/local

Note

The CMake presets are provided as a help to get started with Arrow development and understand common build configurations. They are not guaranteed to be immutable but may change in the future based on feedback.

Instead of relying on CMake presets, it is highly recommended that automated builds, continuous integration, release scripts, etc. use manual configuration, as outlined below.

Manual configuration#

The build system uses CMAKE_BUILD_TYPE=release by default, so if this argument is omitted then a release build will be produced.

Several build types are possible:

Note

These build types provide suitable optimization/debug flags by default but you can change them by specifying-DARROW_C_FLAGS_${BUILD_TYPE}=... and/or-DARROW_CXX_FLAGS_${BUILD_TYPE}=.... ${BUILD_TYPE} is upper case of build type. For example, DEBUG(-DARROW_C_FLAGS_DEBUG=... / -DARROW_CXX_FLAGS_DEBUG=...) for theDebug build type and RELWITHDEBINFO(-DARROW_C_FLAGS_RELWITHDEBINFO=... /-DARROW_CXX_FLAGS_RELWITHDEBINFO=...) for the RelWithDebInfobuild type.

For example, you can use -O3 as an optimization flag for the Releasebuild type by passing -DARROW_CXX_FLAGS_RELEASE=-O3 . You can use -g3 as a debug flag for the Debug build type by passing -DARROW_CXX_FLAGS_DEBUG=-g3 .

You can also use the standard CMAKE_C_FLAGS_${BUILD_TYPE}and CMAKE_CXX_FLAGS_${BUILD_TYPE} variables but the ARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} variables are recommended. The CMAKE_C_FLAGS_${BUILD_TYPE} andCMAKE_CXX_FLAGS_${BUILD_TYPE} variables replace all default flags provided by CMake, while ARROW_C_FLAGS_${BUILD_TYPE} andARROW_CXX_FLAGS_${BUILD_TYPE} just append the flags specified, which allows selectively overriding some of the defaults.

You can also run default build with flag -DARROW_EXTRA_ERROR_CONTEXT=ON, seeExtra debugging help.

Minimal release build (1GB of RAM for building or more recommended):

$ mkdir build-release $ cd build-release $ cmake .. $ make -j8 # if you have 8 CPU cores, otherwise adjust $ make install

Minimal debug build with unit tests (4GB of RAM for building or more recommended):

$ git submodule update --init --recursive $ export ARROW_TEST_DATA=$PWD/../testing/data $ mkdir build-debug $ cd build-debug $ cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON .. $ make -j8 # if you have 8 CPU cores, otherwise adjust $ make unittest # to run the tests $ make install

The unit tests are not built by default. After building, one can also invoke the unit tests using the ctest tool provided by CMake (note that testdepends on python being available).

On some Linux distributions, running the test suite might require setting an explicit locale. If you see any locale-related errors, try setting the environment variable (which requires the locales package or equivalent):

$ export LC_ALL="en_US.UTF-8"

Faster builds with Ninja#

Many contributors use the Ninja build system to get faster builds. It especially speeds up incremental builds. To useninja, pass -GNinja when calling cmake and then use the ninjacommand instead of make.

Unity builds#

The CMakeunity buildsoption can make full builds significantly faster, but it also increases the memory requirements. Consider turning it on (using -DCMAKE_UNITY_BUILD=ON) if memory consumption is not an issue.

Optional Components#

By default, the C++ build system creates a fairly minimal build. We have several optional system components which you can opt into building by passing boolean flags to cmake.

Compression options available in Arrow are:

Some features of the core Arrow shared library can be switched off for improved build times if they are not required for your application:

Note

If your use-case is limited to reading/writing Arrow data then the default options should be sufficient. However, if you wish to build any tests/benchmarks then ARROW_JSON is also required (it will be enabled automatically). If extended format support is desired then adding ARROW_PARQUET, ARROW_CSV,ARROW_JSON, or ARROW_ORC shouldn’t enable any additional components.

Note

In general, it’s a good idea to enable ARROW_COMPUTE if you anticipate using any compute kernels beyond cast. While there are (as of 12.0.0) a handful of additional kernels built in by default, this list may change in the future as it’s partly based on kernel usage in the current format implementations.

Optional Targets#

For development builds, you will often want to enable additional targets in enable to exercise your changes, using the following cmake options.

Optional Checks#

The following special checks are available as well. They instrument the generated code in various ways so as to detect select classes of problems at runtime (for example when executing unit tests).

Some of those options are mutually incompatible, so you may have to build several times with different options if you want to exercise all of them.

CMake version requirements#

We support CMake 3.16 and higher.

LLVM and Clang Tools#

We are currently using LLVM for library builds and for other developer tools such as code formatting with clang-format. LLVM can be installed via most modern package managers (apt, yum, conda, Homebrew, vcpkg, chocolatey).

Build Dependency Management#

The build system supports a number of third-party dependencies

The CMake option ARROW_DEPENDENCY_SOURCE is a global option that instructs the build system how to resolve each dependency. There are a few options:

The default method is AUTO unless you are developing within an active conda environment (detected by presence of the $CONDA_PREFIX environment variable), in which case it is CONDA.

Individual Dependency Resolution#

While -DARROW_DEPENDENCY_SOURCE=$SOURCE sets a global default for all packages, the resolution strategy can be overridden for individual packages by setting -D$PACKAGE_NAME_SOURCE=... For example, to build Protocol Buffers from source, set

-DProtobuf_SOURCE=BUNDLED

This variable is unfortunately case-sensitive; the name used for each package is listed above, but the most up-to-date listing can be found incpp/cmake_modules/ThirdpartyToolchain.cmake.

Bundled Dependency Versions#

When using the BUNDLED method to build a dependency from source, the version number from cpp/thirdparty/versions.txt is used. There is also a dependency source downloader script (see below), which can be used to set up offline builds.

When using BUNDLED for dependency resolution (and if you use either the jemalloc or mimalloc allocators, which are recommended), statically linking the Arrow libraries in a third party project is more complex. See below for instructions about how to configure your build system in this case.

Offline Builds#

If you do not use the above variables to direct the Arrow build system to preinstalled dependencies, they will be built automatically by the Arrow build system. The source archive for each dependency will be downloaded via the internet, which can cause issues in environments with limited access to the internet.

To enable offline builds, you can download the source artifacts yourself and use environment variables of the form ARROW_$LIBRARY_URL to direct the build system to read from a local file rather than accessing the internet.

To make this easier for you, we have prepared a scriptthirdparty/download_dependencies.sh which will download the correct version of each dependency to a directory of your choosing. It will print a list of bash-style environment variable statements at the end to use for your build script.

Download tarballs into $HOME/arrow-thirdparty

$ ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty

You can then invoke CMake to create the build directory and it will use the declared environment variable pointing to downloaded archives instead of downloading them (one for each build dir!).

Statically Linking#

When -DARROW_BUILD_STATIC=ON, all build dependencies built as static libraries by the Arrow build system will be merged together to create a static library arrow_bundled_dependencies. In UNIX-like environments (Linux, macOS, MinGW), this is called libarrow_bundled_dependencies.a and on Windows with Visual Studio arrow_bundled_dependencies.lib. This “dependency bundle” library is installed in the same place as the other Arrow static libraries.

If you are using CMake, the bundled dependencies will automatically be included when linking if you use the arrow_static CMake target. In other build systems, you may need to explicitly link to the dependency bundle. We created an example CMake-based build configuration to show you a working example.

On Linux and macOS, if your application does not link to the pthreadlibrary already, you must include -pthread in your linker setup. In CMake this can be accomplished with the Threads built-in package:

set(THREADS_PREFER_PTHREAD_FLAG ON) find_package(Threads REQUIRED) target_link_libraries(my_target PRIVATE Threads::Threads)

Deprecations and API Changes#

We use the marco ARROW_DEPRECATED which wraps C++ deprecated attribute for APIs that have been deprecated. It is a good practice to compile third party applications with -Werror=deprecated-declarations (for GCC/Clang or similar flags of other compilers) to proactively catch and account for API changes.

Modular Build Targets#

Since there are several major parts of the C++ project, we have provided modular CMake targets for building each library component, group of unit tests and benchmarks, and their dependencies:

Note

If you have selected Ninja as CMake generator, replace make arrow withninja arrow, and so on.

To build the unit tests or benchmarks, add -tests or -benchmarksto the target name. So make arrow-tests will build the Arrow core unit tests. Using the -all target, e.g. parquet-all, will build everything.

If you wish to only build and install one or more project subcomponents, we have provided the CMake option ARROW_OPTIONAL_INSTALL to only install targets that have been built. For example, if you only wish to build the Parquet libraries, its tests, and its dependencies, you can run:

cmake .. -DARROW_PARQUET=ON
-DARROW_OPTIONAL_INSTALL=ON
-DARROW_BUILD_TESTS=ON make parquet make install

If you omit an explicit target when invoking make, all targets will be built.

Debugging with Xcode on macOS#

Xcode is the IDE provided with macOS and can be use to develop and debug Arrow by generating an Xcode project:

cd cpp mkdir xcode-build cd xcode-build cmake .. -G Xcode -DARROW_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=DEBUG open arrow.xcodeproj

This will generate a project and open it in the Xcode app. As an alternative, the command xcodebuild will perform a command-line build using the generated project. It is recommended to use the “Automatically Create Schemes” option when first launching the project. Selecting an auto-generated scheme will allow you to build and run a unittest with breakpoints enabled.