Introduction — Apache DataFusion documentation (original) (raw)

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. DataFusion originated as part of the Apache Arrowproject.

DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, python bindings, extensive customization, a great community, and more.

Project Goals

DataFusion aims to be the query engine of choice for new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications by leveraging the unique features of Rust and Apache Arrow.

Features

Use Cases

DataFusion can be used without modification as an embedded SQL engine or can be customized and used as a foundation for building new systems.

While most current use cases are “analytic” or (throughput) some components of DataFusion such as the plan representations, are suitable for “streaming” and “transaction” style systems (low latency).

Here are some example systems built using DataFusion:

By using DataFusion, projects are freed to focus on their specific features, and avoid reimplementing general (but still necessary) features such as an expression representation, standard optimizations, parellelized streaming execution plans, file format support, etc.

Known Users

Here are some active projects using DataFusion:

Here are some less active projects that used DataFusion:

Integrations and Extensions

There are a number of community projects that extend DataFusion or provide integrations with other systems, some of which are described below:

Language Bindings

Integrations

Why DataFusion?