Deploying Ballista with Kubernetes — Apache DataFusion Ballista documentation (original) (raw)

Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that you are already comfortable managing Kubernetes deployments.

The Ballista deployment consists of:

k8s deployment for one or more scheduler processes
k8s deployment for one or more executor processes
k8s service to route traffic to the schedulers
k8s persistent volume and persistent volume claims to make local data accessible to Ballista
(optional) a keda instance for autoscaling the number of executors

Testing Locally¶

Microk8s is recommended for installing a local k8s cluster. Once Microk8s is installed, DNS must be enabled using the following command.

Build Docker Images¶

Run the following commands to download the official Docker image:

docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4

Altenatively run the following commands to clone the source repository and build the Docker images from source:

git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0 cd datafusion-ballista ./dev/build-ballista-docker.sh

This will create the following images:

apache/datafusion-ballista-benchmarks:0.12.0
apache/datafusion-ballista-cli:0.12.0
apache/datafusion-ballista-executor:0.12.0
apache/datafusion-ballista-scheduler:0.12.0
apache/datafusion-ballista-standalone:0.12.0

Publishing Docker Images¶

Once the images have been built, you can retag them and can push them to your favourite Docker registry.

docker tag apache/datafusion-ballista-scheduler:0.12.0 /datafusion-ballista-scheduler:0.12.0 docker tag apache/datafusion-ballista-executor:0.12.0 /datafusion-ballista-executor:0.12.0 docker push /datafusion-ballista-scheduler:0.12.0 docker push /datafusion-ballista-executor:0.12.0

Create Persistent Volume and Persistent Volume Claim¶

Copy the following yaml to a pv.yaml file and apply to the cluster to create a persistent volume and a persistent volume claim so that the specified host directory is available to the containers. This is where any data should be located so that Ballista can execute queries against it.

apiVersion: v1 kind: PersistentVolume metadata: name: data-pv labels: type: local spec: storageClassName: manual capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: path: "/mnt"

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-pv-claim spec: storageClassName: manual accessModes: - ReadWriteOnce resources: requests: storage: 3Gi

To apply this yaml:

You should see the following output:

persistentvolume/data-pv created persistentvolumeclaim/data-pv-claim created

Deploying a Ballista Cluster¶

Copy the following yaml to a cluster.yaml file and change <your-image> with the name of your Ballista Docker image.

apiVersion: v1 kind: Service metadata: name: ballista-scheduler labels: app: ballista-scheduler spec: ports: - port: 50050 name: scheduler - port: 80 name: scheduler-ui selector: app: ballista-scheduler

apiVersion: apps/v1 kind: Deployment metadata: name: ballista-scheduler spec: replicas: 1 selector: matchLabels: app: ballista-scheduler template: metadata: labels: app: ballista-scheduler ballista-cluster: ballista spec: containers: - name: ballista-scheduler image: /datafusion-ballista-scheduler:0.12.0 args: ["--bind-port=50050"] ports: - containerPort: 50050 name: flight volumeMounts: - mountPath: /mnt name: data volumes: - name: data persistentVolumeClaim: claimName: data-pv-claim

apiVersion: apps/v1 kind: Deployment metadata: name: ballista-executor spec: replicas: 2 selector: matchLabels: app: ballista-executor template: metadata: labels: app: ballista-executor ballista-cluster: ballista spec: containers: - name: ballista-executor image: /datafusion-ballista-executor:0.12.0 args: - "--bind-port=50051" - "--scheduler-host=ballista-scheduler" - "--scheduler-port=50050" ports: - containerPort: 50051 name: flight volumeMounts: - mountPath: /mnt name: data volumes: - name: data persistentVolumeClaim: claimName: data-pv-claim

kubectl apply -f cluster.yaml

This should show the following output:

service/ballista-scheduler created deployment.apps/ballista-scheduler created deployment.apps/ballista-executor created

You can also check status by running kubectl get pods:

$ kubectl get pods NAME READY STATUS RESTARTS AGE ballista-executor-78cc5b6486-4rkn4 0/1 Pending 0 42s ballista-executor-78cc5b6486-7crdm 0/1 Pending 0 42s ballista-scheduler-879f874c5-rnbd6 0/1 Pending 0 42s

You can view the scheduler logs with kubectl logs ballista-scheduler-0:

$ kubectl logs ballista-scheduler-0 [2021-02-19T00:24:01Z INFO scheduler] Ballista v0.7.0 Scheduler listening on 0.0.0.0:50050 [2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 } [2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }

Port Forwarding¶

If you want to run applications outside of the cluster and have them connect to the scheduler then it is necessary to set up port forwarding.

First, check that the ballista-scheduler service is running.

$ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.152.183.1 443/TCP 26h ballista-scheduler ClusterIP 10.152.183.21 50050/TCP 24m

Use the following command to set up port-forwarding.

kubectl port-forward service/ballista-scheduler 50050:50050

Deleting the Ballista Cluster¶

Run the following kubectl command to delete the cluster.

kubectl delete -f cluster.yaml

Autoscaling Executors¶

Ballista supports autoscaling for executors through Keda. Keda allows scaling a deployment through custom metrics which are exposed through the Ballista scheduler, and it can even scale the number of executors down to 0 if there is no activity in the cluster.

Keda can be installed in your kubernetes cluster through a single command line:

kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.7.1/keda-2.7.1.yaml

Once you have deployed Keda on your cluster, you can now deploy a new kubernetes object called ScaledObjectwhich will let Keda know how to scale your executors. In order to do that, copy the following YAML into ascale.yaml file:

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: ballista-executor spec: scaleTargetRef: name: ballista-executor minReplicaCount: 0 maxReplicaCount: 5 triggers: - type: external metadata: # Change this DNS if the scheduler isn't deployed in the "default" namespace scalerAddress: ballista-scheduler.default.svc.cluster.local:50050

And then deploy it into the cluster:

kubectl apply -f scale.yaml

If the cluster is inactive, Keda will now scale the number of executors down to 0, and will scale them up when you launch a query. Please note that Keda will perform a scan once every 30 seconds, so it might take a bit to scale the executors.

Please visit Keda’s documentation page for more information.