Apache Spark on AWS EKS

Creating EKS cluster with eksctl and Deploying Apache spark on EKS using Helm chart

Anil Augustine Chalissery

Published in

FAUN — Developer Community 🐾

5 min readNov 4, 2022

Introduction

Apache Spark is an open-source analytical engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Apache Spark natively supports Python, Java, R, and Scala, giving you a variety of languages for building your applications. Here we deploy Apache Spark with Helm which is a package manager of Kubernetes.

With spark-submitwe can directly submit to the Kubernetes cluster. Spark creates a Spark driver running within a Kubernetes pod. The driver creates executors which are also running within Kubernetes pods and connects to them and executes the application code. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

Step 1: Creating Kubernetes cluster with EKS

First, we create a cluster.yaml file and this file is used to create EKS cluster, Node groups, and relevant roles.

Once cluster.yaml is in place, use the following command to create EKS cluster.

eksctl create cluster -f cluster.yaml

This command will create CloudFormation stacks and clusters and roles are created. It takes around 15mins to complete this execution. In case of any errors checking CloudFormation stack events could help. The expected output is added below.

If we navigate to CloudFormation we could see the stacks compleated

CloudFormations stacks completed execution

If we navigate to EC2 console we could see our instance deployed and in running state

If we check the terminal logs we could see that kubeconfig is added to home directory .kube/config

we can check cluster access by checking cluster-info or getting nodes

kubectl cluster-infokubectl get nodes

Now we have our cluster in place so let’s move on to the next step which is deploying apache spark

Step 2: Deploy Apache Spark on EKS Cluster

We add Apache Spark with Helm charts. Helm chart is a package manager for Kubernetes. With the help of Helm charts, we will be deploying Apache Spark packaged by Bitnami.

First, we need to add the repo to the helm

helm repo add bitnami https://charts.bitnami.com/bitnami

(Optional)Then we can verify the repo by

helm repo list

Then to install this chart.

helm install myspark bitnami/spark

myspark is a changable name. more info

To confirm Spark is deployed and pods are running we can list pods

kubectl get pods

listing pods

Now lets port-forward and see the spark console.

kubectl port-forward — namespace default svc/myspark-master-svc 8080:80

Now that our spark cluster is running let's try spark-submit.

Step 3: Submitting a spark job

To submit an application to the Apache Spark cluster, use the spark-submit script. we need to download apache spark to run spark-submit

spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.kubernetes.container.image=bitnami/spark:3 \
    --master k8s://https://k8s-apiserver-host:k8s-apiserver-port \
    --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://spark-master-svc:spark-master-port \
    --deploy-mode cluster \
    ./examples/jars/spark-examples_2.12-3.2.0.jar 1000

As we have created a Kubernetes cluster setup, one way to discover the k8s-apiserver-host URL is by executing kubectl cluster-info.

$ kubectl cluster-info
Kubernetes control plane is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com
CoreDNS is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

In the above example, the specific Kubernetes cluster can be used with spark-submit by specifying --master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443 as an argument to spark-submit.

SPARK_MASTER_URL can be captured from the Spark console

Here we use bitnami/spark:3 docker image which already has spark-examples_2.12–3.3.0.jar in /opt/bitnami/spark/examples/jars directory. so the command we use is as follows

spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.kubernetes.container.image=bitnami/spark:3 \
    --master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443 \
    --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://myspark-master-0.myspark-headless.default.svc.cluster.local:7077 \
    --deploy-mode cluster \
    local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000

This will be showing some logs in real-time.