Running Spark on a Multi-node Local Kubernetes Cluster: A Step-by-Step Guide for Data Engineers

Introduction

Embarking on a journey to run Apache Spark on a local Kubernetes cluster? This comprehensive guide is crafted for data engineers who seek to leverage the power of Spark within a Kubernetes environment. Whether you’re a beginner or an adept engineer, these steps will ensure a smooth and efficient setup.

To complement this article, all the code and configuration files used in this guide are available in the GitHub repository: playground-spark-local-kubernetes.

Step 1: Installing kubectl

Why It’s Crucial: kubectl is your gateway to interacting with your Kubernetes cluster. Let’s get it up and running!

How to Install: Follow these easy steps at Kubernetes Installation Guide.

For example, if you are on Ubuntu or another Linux distribution that supports the snap package manager, kubectl is available as a snap application:

snap install kubectl --classic
kubectl version --client

Step 2: Embrace Kind for Local Clusters

Discover Kind: A powerful tool to spin up Kubernetes clusters locally using Docker containers.

Installation Guide: Jumpstart with Kind by following instructions at Kind Quick Start.

Step 3: Crafting a Multi-node Cluster

Creating a Kind Configuration File

Craft a YAML File: Start with a new .yaml file to outline your cluster’s architecture.

Define Multi-Node Structure: Include a control node and multiple worker nodes. Here’s a sample snippet:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
    - containerPort: 4040   # Spark UI
      hostPort: 4040
    - containerPort: 18080  # Spark History Server
      hostPort: 18080
    - containerPort: 10000  # Spark Thrift Server
      hostPort: 10000
    - containerPort: 8081   # Spark Worker
      hostPort: 8081
- role: worker
- role: worker

Save Your Configuration: Name it descriptively, like config-cluster.yaml.

Launching Your Customized Cluster

Create your cluster based on your configuration file:

kind create cluster --config config-cluster.yaml --name sparkcluster

Expect a smooth setup and a final message indicating a successful creation:

Creating cluster "sparkcluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦 📦 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining worker nodes 🚜 
Set kubectl context to "kind-sparkcluster"
You can now use your cluster with:

kubectl cluster-info --context kind-sparkcluster

Have a nice day! 👋

You should see information about the cluster, indicating that it is working correctly kubectl cluster-info:

kubectl cluster-info --context kind-sparkcluster

Finally, run the command kubectl get nodes and verify that all nodes are active by executing.

Step 4: Preparing for Apache Spark

Downloading Apache Spark

Choose Your Spark Version: Visit Apache Spark Downloads and select a pre-built package for Hadoop).

Extract and Move Spark to a desired location:

tar xvf spark-<spark-version>-bin-hadoop2.7.tgz
sudo mv spark-<spark-version>-bin-hadoop2.7 /usr/local/spark

Environment Setup: Add Spark to your PATH for universal access.

Dockerizing Spark for Kubernetes

Preparing the Docker Image

Optional Step: Modify the Dockerfile: Before building the image, it’s wise to tweak the Dockerfile located at /usr/local/spark/kubernetes/dockerfiles/spark. This step is crucial to circumvent potential authentication issues with Kerberos. For detailed instructions and code, refer to this repository.

Build the Docker Image using docker-image-tool.sh:

cd /usr/local/spark
./bin/docker-image-tool.sh -t my-tag build

Step 5: Injecting Spark into Kind

Load Your Spark Image into the Kind cluster:

kind load docker-image spark:3.5.0 --name sparkcluster

Step 6: Deploying Spark Master Node

Create a YAML File for Spark Master: Name it spark-master-deployment.yaml and include your Spark image details.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: spark-master
spec:
  replicas: 1
  selector:
    matchLabels:
      component: spark-master
  template:
    metadata:
      labels:
        component: spark-master
    spec:
      containers:
        - name: spark-master
          image: spark:3.5.0
          command: [ "/opt/spark/sbin/start-master.sh" ]
          ports:
            - containerPort: 7077
            - containerPort: 8080
          resources:
            requests:
              cpu: 1000m

Launch and Verify the Pod:

kubectl apply -f spark-master-deployment.yaml
kubectl get pods

Enabling Web Interface Access for Spark Master

Identify and Forward the Pod:.

kubectl port-forward pod/spark-master-xxxxx 8080:8080

Access the Web Interface at http://localhost:8080.

Step 7: Configuring Spark Workers

Deploy Worker Nodes and ensure communication with the Spark Master.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: spark-worker
spec:
  replicas: 2
  selector:
    matchLabels:
      component: spark-worker
  template:
    metadata:
      labels:
        component: spark-worker
    spec:
      containers:
        - name: spark-worker
          image: spark:3.5.0
          command: [ "/opt/spark/bin/spark-class", "org.apache.spark.deploy.worker.Worker" ]
          args: [ "spark://spark-master:7077" ] 
          ports:
            - containerPort: 8081
          resources:
            requests:
              cpu: 1000m

Kubernetes Service for Spark Master

Define and apply a Kubernetes Service: Match the service name with the worker’s configuration. It is necessary so that the Spark Master can communicate with the Worker nodes to communicate with it.

kind: Service
apiVersion: v1
metadata:
  name: spark-master
spec:
  ports:
    - name: webui
      port: 8080
      targetPort: 8080
    - name: spark
      port: 7077
      targetPort: 7077
  selector:
    component: spark-master

Access the Web Interface: Now navigate again to http://localhost:8080 in your browser and check if the two workers are displayed there:

Step 8: Testing the cluster

Executing a Spark Application Within the Cluster

Enter the Spark Shell and run a Spark job:.

kubectl exec -n default spark-master-xxx \\ 
-it -- /opt/spark/bin/spark-shell \\ 
--conf spark.driver.bindAddress=x.x.x.x \\ 
--conf spark.driver.host=x.x.x.x

val data = spark.range(1000000).toDF("number").agg(sum("number")).show()

Access the UI at **localhost:4040**to monitor the Spark job.

Launching a Spark Job with Kubernetes

Configure RoleBinding and Roles for Spark applications:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: default
subjects:
- kind: ServiceAccount
  name: default
  namespace: default
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps", "services", "persistentvolumeclaims"]
  verbs: ["get", "watch", "list", "create", "delete", "deletecollection"]

Execute a Spark Job:

/opt/spark/bin/spark-submit \\\\
    --master spark://10.244.2.2:7077 \\\\
    --deploy-mode cluster \\\\
    --name spark-examples \\\\
    --class org.apache.spark.examples.SparkPi \\\\
    --conf spark.executor.instances=2 \\\\
    /opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 10

Configuring a Kubernetes Job for Spark

Create a YAML File for the Spark Job and apply it:

apiVersion: batch/v1
kind: Job
metadata:
  name: spark-pi
spec:
  template:
    spec:
      containers:
        - name: spark-pi
          image: spark:3.5.0
          command: ["/opt/spark/bin/spark-submit"]
          args:
            - --master
            - k8s://172.19.0.2:6443
            - --deploy-mode
            - cluster
            - --name
            - spark-pi
            - --class
            - org.apache.spark.examples.SparkPi
            - --conf
            - spark.executor.instances=1
            - --conf
            - spark.kubernetes.namespace=default
            - --conf
            - spark.kubernetes.container.image=spark:3.5.0
            - local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar
            - "1000"
      restartPolicy: Never

Deploy the Job:
```
kubectl apply -f spark-pi-job.yaml
```
This guide has led you through setting up Apache Spark in a Kubernetes environment using Kind, preparing you to fully utilize Spark’s capabilities in a containerized setup.

https://github.com/cherrera20/playground-spark-local-kubernetes