Apache Spark on AWS EKS

Creating EKS cluster with eksctl and Deploying Apache spark on EKS using Helm chart

--

Introduction

Apache Spark is an open-source analytical engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Apache Spark natively supports Python, Java, R, and Scala, giving you a variety of languages for building your applications. Here we deploy Apache Spark with Helm which is a package manager of Kubernetes.

Image from Apache Spark on Kubernetes

With spark-submitwe can directly submit to the Kubernetes cluster. Spark creates a Spark driver running within a Kubernetes pod. The driver creates executors which are also running within Kubernetes pods and connects to them and executes the application code. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

Step 1: Creating Kubernetes cluster with EKS

First, we create a cluster.yaml file and this file is used to create EKS cluster, Node groups, and relevant roles.

Once cluster.yaml is in place, use the following command to create EKS cluster.

eksctl create cluster -f cluster.yaml

This command will create CloudFormation stacks and clusters and roles are created. It takes around 15mins to complete this execution. In case of any errors checking CloudFormation stack events could help. The expected output is added below.

outputs while creating the cluster

If we navigate to CloudFormation we could see the stacks compleated

CloudFormations stacks completed execution

If we navigate to EC2 console we could see our instance deployed and in running state

Node groups created with eksctl

If we check the terminal logs we could see that kubeconfig is added to home directory .kube/config

eksctl command execution completes

we can check cluster access by checking cluster-info or getting nodes

kubectl cluster-infokubectl get nodes
checking cluster access

Now we have our cluster in place so let’s move on to the next step which is deploying apache spark

Step 2: Deploy Apache Spark on EKS Cluster

We add Apache Spark with Helm charts. Helm chart is a package manager for Kubernetes. With the help of Helm charts, we will be deploying Apache Spark packaged by Bitnami.

First, we need to add the repo to the helm

helm repo add bitnami https://charts.bitnami.com/bitnami

(Optional)Then we can verify the repo by

helm repo list

Then to install this chart.

helm install myspark bitnami/spark

myspark is a changable name. more info

helm chart installed

To confirm Spark is deployed and pods are running we can list pods

kubectl get pods
listing pods

Now lets port-forward and see the spark console.

kubectl port-forward — namespace default svc/myspark-master-svc 8080:80
Spark UI

Now that our spark cluster is running let's try spark-submit.

Step 3: Submitting a spark job

To submit an application to the Apache Spark cluster, use the spark-submit script. we need to download apache spark to run spark-submit

spark-submit \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=bitnami/spark:3 \
--master k8s://https://k8s-apiserver-host:k8s-apiserver-port \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://spark-master-svc:spark-master-port \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.2.0.jar 1000

As we have created a Kubernetes cluster setup, one way to discover the k8s-apiserver-host URL is by executing kubectl cluster-info.

$ kubectl cluster-info
Kubernetes control plane is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com
CoreDNS is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

In the above example, the specific Kubernetes cluster can be used with spark-submit by specifying --master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443 as an argument to spark-submit.

SPARK_MASTER_URL can be captured from the Spark console

SPARK_MASTER_URL

Here we use bitnami/spark:3 docker image which already has spark-examples_2.12–3.3.0.jar in /opt/bitnami/spark/examples/jars directory. so the command we use is as follows

spark-submit \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=bitnami/spark:3 \
--master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://myspark-master-0.myspark-headless.default.svc.cluster.local:7077 \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000

This will be showing some logs in real-time.

Logs ending

Now in Spark UI, we can see this in completed applications

It is added to completed applications

As we use EKS we can make use of ECR too to store docker images.

Reference

If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN & get similar stories in your inbox each week

--

--