Apache Spark on AWS EKS
Creating EKS cluster with eksctl and Deploying Apache spark on EKS using Helm chart
Introduction
Apache Spark is an open-source analytical engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Apache Spark natively supports Python, Java, R, and Scala, giving you a variety of languages for building your applications. Here we deploy Apache Spark with Helm which is a package manager of Kubernetes.
With spark-submit
we can directly submit to the Kubernetes cluster. Spark creates a Spark driver running within a Kubernetes pod. The driver creates executors which are also running within Kubernetes pods and connects to them and executes the application code. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Step 1: Creating Kubernetes cluster with EKS
First, we create a cluster.yaml file and this file is used to create EKS cluster, Node groups, and relevant roles.
Once cluster.yaml is in place, use the following command to create EKS cluster.
eksctl create cluster -f cluster.yaml
This command will create CloudFormation stacks and clusters and roles are created. It takes around 15mins to complete this execution. In case of any errors checking CloudFormation stack events could help. The expected output is added below.
If we navigate to CloudFormation we could see the stacks compleated
If we navigate to EC2 console we could see our instance deployed and in running state
If we check the terminal logs we could see that kubeconfig is added to home directory .kube/config
we can check cluster access by checking cluster-info or getting nodes
kubectl cluster-infokubectl get nodes
Now we have our cluster in place so let’s move on to the next step which is deploying apache spark
Step 2: Deploy Apache Spark on EKS Cluster
We add Apache Spark with Helm charts. Helm chart is a package manager for Kubernetes. With the help of Helm charts, we will be deploying Apache Spark packaged by Bitnami.
First, we need to add the repo to the helm
helm repo add bitnami https://charts.bitnami.com/bitnami
(Optional)Then we can verify the repo by
helm repo list
Then to install this chart.
helm install myspark bitnami/spark
myspark is a changable name. more info
To confirm Spark is deployed and pods are running we can list pods
kubectl get pods
Now lets port-forward and see the spark console.
kubectl port-forward — namespace default svc/myspark-master-svc 8080:80
Now that our spark cluster is running let's try spark-submit.
Step 3: Submitting a spark job
To submit an application to the Apache Spark cluster, use the spark-submit
script. we need to download apache spark to run spark-submit
spark-submit \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=bitnami/spark:3 \
--master k8s://https://k8s-apiserver-host:k8s-apiserver-port \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://spark-master-svc:spark-master-port \
--deploy-mode cluster \
./examples/jars/spark-examples_2.12-3.2.0.jar 1000
As we have created a Kubernetes cluster setup, one way to discover the k8s-apiserver-host URL is by executing kubectl cluster-info
.
$ kubectl cluster-info
Kubernetes control plane is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com
CoreDNS is running at https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
In the above example, the specific Kubernetes cluster can be used with spark-submit
by specifying --master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443
as an argument to spark-submit
.
SPARK_MASTER_URL
can be captured from the Spark console
Here we use bitnami/spark:3 docker image which already has spark-examples_2.12–3.3.0.jar in /opt/bitnami/spark/examples/jars directory. so the command we use is as follows
spark-submit \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=bitnami/spark:3 \
--master k8s://https://EC01035D6C8E74A267443B86721830FE.sk1.us-west-1.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://myspark-master-0.myspark-headless.default.svc.cluster.local:7077 \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
This will be showing some logs in real-time.
Now in Spark UI, we can see this in completed applications
As we use EKS we can make use of ECR too to store docker images.
Reference
- Running Spark on Kubernetes
- Deploying Spark jobs on Amazon EKS
- Best practices for running Spark on Amazon EKS
If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇