Bigdatos.com

— Data Engineering

Join the BigDatos community to unlock the latest in data engineering. Get expert insights and cutting-edge tutorials directly in your inbox. Subscribe now and lead the way in the data revolution!

Running Spark on a Multi-node Local Kubernetes Cluster: A Step-by-Step Guide for Data Engineers

Introduction

Embarking on a journey to run Apache Spark on a local Kubernetes cluster? This comprehensive guide is crafted for data engineers who seek to leverage the power of Spark within a Kubernetes environment. Whether you’re a beginner or an adept engineer, these steps will ensure a smooth and efficient setup.

To complement this article, all the code and configuration files used in this guide are available in the GitHub repository: playground-spark-local-kubernetes.

Step 1: Installing kubectl

Why It’s Crucial: kubectl is your gateway to interacting with your Kubernetes cluster. Let’s get it up and running!

For example, if you are on Ubuntu or another Linux distribution that supports the snap package manager, kubectl is available as a snap application:

snap install kubectl --classic
kubectl version --client

Step 2: Embrace Kind for Local Clusters

Discover Kind: A powerful tool to spin up Kubernetes clusters locally using Docker containers.

Installation Guide: Jumpstart with Kind by following instructions at Kind Quick Start.

Step 3: Crafting a Multi-node Cluster

Creating a Kind Configuration File

  1. Craft a YAML File: Start with a new .yaml file to outline your cluster’s architecture.
  2. Define Multi-Node Structure: Include a control node and multiple worker nodes. Here’s a sample snippet:
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
    - role: control-plane
      extraPortMappings:
        - containerPort: 4040   # Spark UI
          hostPort: 4040
        - containerPort: 18080  # Spark History Server
          hostPort: 18080
        - containerPort: 10000  # Spark Thrift Server
          hostPort: 10000
        - containerPort: 8081   # Spark Worker
          hostPort: 8081
    - role: worker
    - role: worker
    
  3. Save Your Configuration: Name it descriptively, like config-cluster.yaml.

Launching Your Customized Cluster

  1. Create your cluster based on your configuration file:
    kind create cluster --config config-cluster.yaml --name sparkcluster
    

    Expect a smooth setup and a final message indicating a successful creation:

    Creating cluster "sparkcluster" ...
     ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
     ✓ Preparing nodes 📦 📦 📦  
     ✓ Writing configuration 📜 
     ✓ Starting control-plane 🕹️ 
     ✓ Installing CNI 🔌 
     ✓ Installing StorageClass 💾 
     ✓ Joining worker nodes 🚜 
    Set kubectl context to "kind-sparkcluster"
    You can now use your cluster with:
    
    kubectl cluster-info --context kind-sparkcluster
    
    Have a nice day! 👋
    

    You should see information about the cluster, indicating that it is working correctly kubectl cluster-info:

    kubectl cluster-info --context kind-sparkcluster
    

    Finally, run the command kubectl get nodes and verify that all nodes are active by executing.

Step 4: Preparing for Apache Spark

Downloading Apache Spark

  1. Choose Your Spark Version: Visit Apache Spark Downloads and select a pre-built package for Hadoop).
  2. Extract and Move Spark to a desired location:
    tar xvf spark-<spark-version>-bin-hadoop2.7.tgz
    sudo mv spark-<spark-version>-bin-hadoop2.7 /usr/local/spark
    
  3. Environment Setup: Add Spark to your PATH for universal access.

Dockerizing Spark for Kubernetes

Preparing the Docker Image

  1. Optional Step: Modify the Dockerfile: Before building the image, it’s wise to tweak the Dockerfile located at /usr/local/spark/kubernetes/dockerfiles/spark. This step is crucial to circumvent potential authentication issues with Kerberos. For detailed instructions and code, refer to this repository.
  2. Build the Docker Image using docker-image-tool.sh:
    cd /usr/local/spark
    ./bin/docker-image-tool.sh -t my-tag build
    

Step 5: Injecting Spark into Kind

  1. Load Your Spark Image into the Kind cluster:
    kind load docker-image spark:3.5.0 --name sparkcluster
    

Step 6: Deploying Spark Master Node

  1. Create a YAML File for Spark Master: Name it spark-master-deployment.yaml and include your Spark image details.
    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: spark-master
    spec:
      replicas: 1
      selector:
        matchLabels:
          component: spark-master
      template:
        metadata:
          labels:
            component: spark-master
        spec:
          containers:
            - name: spark-master
              image: spark:3.5.0
              command: [ "/opt/spark/sbin/start-master.sh" ]
              ports:
                - containerPort: 7077
                - containerPort: 8080
              resources:
                requests:
                  cpu: 1000m
    
  2. Launch and Verify the Pod:
    kubectl apply -f spark-master-deployment.yaml
    kubectl get pods
    

Enabling Web Interface Access for Spark Master

  1. Identify and Forward the Pod:.
    kubectl port-forward pod/spark-master-xxxxx 8080:8080
    
  2. Access the Web Interface at http://localhost:8080.

Step 7: Configuring Spark Workers

  1. Deploy Worker Nodes and ensure communication with the Spark Master.
    kind: Deployment
    apiVersion: apps/v1
    metadata:
      name: spark-worker
    spec:
      replicas: 2
      selector:
        matchLabels:
          component: spark-worker
      template:
        metadata:
          labels:
            component: spark-worker
        spec:
          containers:
            - name: spark-worker
              image: spark:3.5.0
              command: [ "/opt/spark/bin/spark-class", "org.apache.spark.deploy.worker.Worker" ]
              args: [ "spark://spark-master:7077" ] 
              ports:
                - containerPort: 8081
              resources:
                requests:
                  cpu: 1000m
    

Kubernetes Service for Spark Master

  1. Define and apply a Kubernetes Service: Match the service name with the worker’s configuration. It is necessary so that the Spark Master can communicate with the Worker nodes to communicate with it.
    kind: Service
    apiVersion: v1
    metadata:
      name: spark-master
    spec:
      ports:
        - name: webui
          port: 8080
          targetPort: 8080
        - name: spark
          port: 7077
          targetPort: 7077
      selector:
        component: spark-master
    
    
  2. Access the Web Interface: Now navigate again to http://localhost:8080 in your browser and check if the two workers are displayed there:

Step 8: Testing the cluster

Executing a Spark Application Within the Cluster

  1. Enter the Spark Shell and run a Spark job:.
    kubectl exec -n default spark-master-xxx \\ 
    -it -- /opt/spark/bin/spark-shell \\ 
    --conf spark.driver.bindAddress=x.x.x.x \\ 
    --conf spark.driver.host=x.x.x.x
    
    val data = spark.range(1000000).toDF("number").agg(sum("number")).show()
    

    Access the UI at **localhost:4040**to monitor the Spark job.

Launching a Spark Job with Kubernetes

  1. Configure RoleBinding and Roles for Spark applications:
    apiVersion: rbac.authorization.k8s.io/v1
    kind: RoleBinding
    metadata:
      name: read-pods
      namespace: default
    subjects:
    - kind: ServiceAccount
      name: default
      namespace: default
    roleRef:
      kind: Role
      name: pod-reader
      apiGroup: rbac.authorization.k8s.io
    
    
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
      namespace: default
      name: pod-reader
    rules:
    - apiGroups: [""]
      resources: ["pods", "configmaps", "services", "persistentvolumeclaims"]
      verbs: ["get", "watch", "list", "create", "delete", "deletecollection"]
    
    
  2. Execute a Spark Job:
    /opt/spark/bin/spark-submit \\\\
        --master spark://10.244.2.2:7077 \\\\
        --deploy-mode cluster \\\\
        --name spark-examples \\\\
        --class org.apache.spark.examples.SparkPi \\\\
        --conf spark.executor.instances=2 \\\\
        /opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 10
    
    

Configuring a Kubernetes Job for Spark

  1. Create a YAML File for the Spark Job and apply it:
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: spark-pi
    spec:
      template:
        spec:
          containers:
            - name: spark-pi
              image: spark:3.5.0
              command: ["/opt/spark/bin/spark-submit"]
              args:
                - --master
                - k8s://172.19.0.2:6443
                - --deploy-mode
                - cluster
                - --name
                - spark-pi
                - --class
                - org.apache.spark.examples.SparkPi
                - --conf
                - spark.executor.instances=1
                - --conf
                - spark.kubernetes.namespace=default
                - --conf
                - spark.kubernetes.container.image=spark:3.5.0
                - local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar
                - "1000"
          restartPolicy: Never
    
    
  2. Deploy the Job:
    kubectl apply -f spark-pi-job.yaml
    
    

    This guide has led you through setting up Apache Spark in a Kubernetes environment using Kind, preparing you to fully utilize Spark’s capabilities in a containerized setup.

    https://github.com/cherrera20/playground-spark-local-kubernetes


Posted

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *