GitHub - apache/datafusion-ray: Apache DataFusion Ray (original) (raw)
DataFusion for Ray
Overview
DataFusion for Ray is a distributed execution framework that enables DataFusion DataFrame and SQL queries to run on a Ray cluster. This integration allows users to leverage Ray's dynamic scheduling capabilities while executing queries in a distributed fashion.
Execution Modes
DataFusion for Ray supports two execution modes:
Streaming Execution
This mode mimics the default execution strategy of DataFusion. Each operator in the query plan starts executing as soon as its inputs are available, leading to a more pipelined execution model.
Batch Execution
Note: Batch Execution is not implemented yet. Tracking issue: #69
In this mode, execution follows a staged model similar to Apache Spark. Each query stage runs to completion, producing intermediate shuffle files that are persisted and used as input for the next stage.
Getting Started
See the contributor guide for instructions on building DataFusion for Ray.
Once installed, you can run queries using DataFusion's familiar API while leveraging the distributed execution capabilities of Ray.
from example in ./examples/http_csv.py
import ray from datafusion_ray import DFRayContext, df_ray_runtime_env
ray.init(runtime_env=df_ray_runtime_env)
ctx = DFRayContext() ctx.register_csv( "aggregate_test_100", "https://github.com/apache/arrow-testing/raw/master/data/csv/aggregate_test_100.csv", )
df = ctx.sql("SELECT c1,c2,c3 FROM aggregate_test_100 LIMIT 5")
df.show()
Contributing
Contributions are welcome! Please open an issue or submit a pull request if you would like to contribute. See thecontributor guide for more information.
License
DataFusion for Ray is licensed under Apache 2.0.