polars - Rust (original) (raw)

Expand description

§Polars: DataFrames in Rust

Polars is a DataFrame library for Rust. It is based on Apache Arrow’s memory model. Apache Arrow provides very cache efficient columnar data structures and is becoming the defacto standard for columnar data.

§Quickstart

We recommend building queries directly with polars-lazy. This allows you to combine expressions into powerful aggregations and column selections. All expressions are evaluated in parallel and queries are optimized just in time.

use polars::prelude::*;

let lf1 = LazyFrame::scan_parquet("myfile_1.parquet", Default::default())?
    .group_by([col("ham")])
    .agg([
        // expressions can be combined into powerful aggregations
        col("foo")
            .sort_by([col("ham").rank(Default::default(), None)], SortMultipleOptions::default())
            .last()
            .alias("last_foo_ranked_by_ham"),
        // every expression runs in parallel
        col("foo").cum_min(false).alias("cumulative_min_per_group"),
        // every expression runs in parallel
        col("foo").reverse().implode().alias("reverse_group"),
    ]);

let lf2 = LazyFrame::scan_parquet("myfile_2.parquet", Default::default())?
    .select([col("ham"), col("spam")]);

let df = lf1
    .join(lf2, [col("reverse")], [col("foo")], JoinArgs::new(JoinType::Left))
    // now we finally materialize the result.
    .collect()?;

This means that Polars data structures can be shared zero copy with processes in many different languages.

§Tree Of Contents

§Cookbooks

See examples in the cookbooks:

§Data Structures

The base data structures provided by polars are DataFrame, Series, and ChunkedArray. We will provide a short, top-down view of these data structures.

§DataFrame

A DataFrame is a two-dimensional data structure backed by a Series and can be seen as an abstraction on Vec. Operations that can be executed on a DataFrame are similar to what is done in a SQL like query. You can GROUP, JOIN, PIVOT etc.

§Series

Series are the type-agnostic columnar data representation of Polars. The Series struct andSeriesTrait trait provide many operations out of the box. Most type-agnostic operations are provided by Series. Type-aware operations require downcasting to the typed data structure that is wrapped by the Series. The underlying typed data structure is a ChunkedArray.

§ChunkedArray

ChunkedArray are wrappers around an arrow array, that can contain multiples chunks, e.g.Vec. These are the root data structures of Polars, and implement many operations. Most operations are implemented by traits defined in chunked_array::ops, or on the ChunkedArray struct.

§SIMD

Polars / Arrow uses packed_simd to speed up kernels with SIMD operations. SIMD is an optionalfeature = "nightly", and requires a nightly compiler. If you don’t need SIMD, Polars runs on stable!

§API

Polars supports an eager and a lazy API. The eager API directly yields results, but is overall more verbose and less capable of building elegant composite queries. We recommend to use the Lazy API whenever you can.

As neither API is async they should be wrapped in spawn_blocking when used in an async context to avoid blocking the async thread pool of the runtime.

§Expressions

Polars has a powerful concept called expressions. Polars expressions can be used in various contexts and are a functional mapping ofFn(Series) -> Series, meaning that they have Series as input and Series as output. By looking at this functional definition, we can see that the output of an Expr also can serve as the input of an Expr.

That may sound a bit strange, so lets give an example. The following is an expression:

col("foo").sort().head(2)

The snippet above says select column "foo" then sort this column and then take the first 2 values of the sorted output. The power of expressions is that every expression produces a new expression and that they can be piped together. You can run an expression by passing them on one of polars execution contexts. Here we run two expressions in the select context:

  df.lazy()
   .select([
       col("foo").sort(Default::default()).head(None),
       col("bar").filter(col("foo").eq(lit(1))).sum(),
   ])
   .collect()?;

All expressions are run in parallel, meaning that separate polars expressions are embarrassingly parallel. (Note that within an expression there may be more parallelization going on).

Understanding Polars expressions is most important when starting with the Polars library. Read more about them in the user guide.

§Eager

Read more in the pages of the following data structures /traits.

§Lazy

Unlock full potential with lazy computation. This allows query optimizations and provides Polars the full query context so that the fastest algorithm can be chosen.

Read more in the lazy module.

§Compile times

A DataFrame library typically consists of

Both of these really put strain on compile times. To keep Polars lean, we make both opt-in, meaning that you only pay the compilation cost if you need it.

§Compile times and opt-in features

The opt-in features are (not including dtype features):

§Compile times and opt-in data types

As mentioned above, Polars Series are wrappers aroundChunkedArray without the generic parameter T. To get rid of the generic parameter, all the possible values of T are compiled for Series. This gets more expensive the more types you want for a Series. In order to reduce the compile times, we have decided to default to a minimal set of types and make more Series types opt-in.

Note that if you get strange compile time errors, you probably need to opt-in for that Series dtype. The opt-in dtypes are:

data type feature flag
Date dtype-date
Datetime dtype-datetime
Time dtype-time
Duration dtype-duration
Int8 dtype-i8
Int16 dtype-i16
UInt8 dtype-u8
UInt16 dtype-u16
Categorical dtype-categorical
Struct dtype-struct

Or you can choose one of the preconfigured pre-sets.

§Performance

To get the best performance out of Polars we recommend compiling on a nightly compiler with the features simd and performant activated. The activated cpu features also influence the amount of simd acceleration we can use.

See the features we activate for our python builds, or if you just run locally and want to use all available features on your cpu, set RUSTFLAGS='-C target-cpu=native'.

§Custom allocator

An OLAP query engine does a lot of heap allocations. It is recommended to use a custom allocator, (we have found this to have up to ~25% runtime influence).JeMalloc andMimalloc for instance, show a significant performance gain in runtime as well as memory usage.

§Jemalloc Usage

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
§Cargo.toml
[dependencies]
tikv-jemallocator = { version = "*" }
§Mimalloc Usage

use mimalloc::MiMalloc;

#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
§Cargo.toml
[dependencies]
mimalloc = { version = "*", default-features = false }
§Notes

Benchmarks have shown that on Linux and macOS JeMalloc outperforms Mimalloc on all tasks and is therefore the default allocator used for the Python bindings on Unix platforms.

§Config with ENV vars

§User guide

If you want to read more, check the user guide.

pub use [polars_io](../polars%5Fio/index.html "mod polars_io") as io; polars-io

pub use [polars_lazy](../polars%5Flazy/index.html "mod polars_lazy") as lazy; lazy

pub use [polars_time](../polars%5Ftime/index.html "mod polars_time") as time; temporal

chunked_array

The typed heart of every Series column.

datatypes

Data types supported by Polars.

docs

error

frame

DataFrame module.

functions

Functions

prelude

series

Type agnostic columnar data structure.

testing

Testing utilities.

apply_method_all_arrow_series

df

VERSION

Polars crate version

enable_string_cachedtype-categorical

Enable the global string cache.

using_string_cachedtype-categorical

Check whether the global string cache is enabled.