Running Apache Spark Applications in Docker Containers (original) (raw)

Join the DZone community and get the full member experience.

Apache Spark is a wonderful tool for distributed computations. However, some preparation steps are required on the machine where the application will be running. Assuming that you already have your Spark cluster configured and ready, you still have to do the following steps on your workstation:

Install Apache Spark distribution containing necessary tools and libraries.
Install Java Development Kit.
Install and configure SCM client, like Git.
Install and configure build tool, like SBT.

Then, you have to check out the source code from the repository, build the binary, and submit it to the Spark cluster using a special spark-submit tool. It should be clear now that one cannot simply just run an Apache Spark application... right?

Wrong! If you have the URL of the application source code and URL of the Spark cluster, then you can just run the application.

Let’s confine the complex things in a Docker container: docker-spark-submit. This Docker image serves as a bridge between the source code and the runtime environment, covering all intermediate steps.

Running applications in containers provides the following benefits:

Zero configuration on the machine because the container has everything it needs.
Clean application environment thanks to container immutability.

Here is an example of typical usage:

docker run \
  -ti \
  --rm \
  -p 5000-5010:5000-5010 \
  -e SCM_URL="https://github.com/mylogin/project.git" \
  -e SPARK_MASTER="spark://my.master.com:7077" \
  -e SPARK_DRIVER_HOST="host.domain" \
  -e MAIN_CLASS="Main" \
  tashoyan/docker-spark-submit:spark-2.2.0

Parameters SCM_URL, SPARK_MASTER, and MAIN_CLASS are self-describing. Other less intuitive, but important, parameters are as follows.

tashoyan/docker-spark-submit:spark-2.2.0

Choose the tag of the container image based on the version of your Spark cluster. In this example, Spark 2.2.0 is assumed.

-p 5000-5010:5000-5010

It is necessary to publish this range of network ports. The Spark driver program and Spark executors use these ports for communication.

-e SPARK_DRIVER_HOST="host.domain"

You have to specify where the network address of the host machine where the container will be running. Spark cluster nodes should be able to resolve this address. This is necessary for communication between executors and the driver program. For detailed technical information, see SPARK-4563.

Detailed instructions, as well as some examples, are available at the project page on GitHub. You can find there:

How to run the application code from a custom Git branch or from a custom subdirectory.
How to supply data for your Spark application by means of Docker volumes.
How to provide custom Spark settings or application arguments.
How to run docker-spark-submit on a machine behind a proxy server.

To conclude, let me emphasize that docker-spark-submit is not intended for continuous integration. The intended usage is to let people quickly try Spark applications, saving them from configuration overhead. CI practices assume separate stages for building, testing, and deploying; docker-spark-submit does not follow these practices.

Docker (software) application Apache Spark

Running Apache Spark Applications in Docker Containers (original) (raw)

Related