This is a demo of PySpark with Jupyter Notebook on Kubernetes. It is based on the official PySpark Docker image.
- Kubernetes
- Spark 3.4.1
Install Kubernetes on your machine. You can use:
- Docker Desktop with Kubernetes cluster enabled
- Minikube
- Any other Kubernetes cluster
If you have wget and tar command available, you can use the following commands:
wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
tar -xzf spark-3.4.1-bin-hadoop3.tgz
Otherwise, you can do it manually - download Spark 3.4.1 from here and unpack the archive to spark-3.4.1-bin-hadoop3/
.
Build the Spark Docker image for Kubernetes and PySpark. You can use the following command:
cd ./spark-3.4.1-bin-hadoop3.2/
bin/docker-image-tool.sh -r docker.io/yourreponame -t v3.4.1 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
Before deploying the Jupyter Notebook with PySpark on Kubernetes, you need to update mount path in deployment.yaml file. You can do it by replacing <path_to_project>
in /<path_to_project>/pyspark-kuberentes/pyspark-notebook
with the real path to your project. Ensure that the path is absolute.
Once it's done, you can deploy the Jupyter Notebook with PySpark on Kubernetes using the following command:
kubectl apply -f ./kubernetes
To access the Jupyter Notebook, you need to get the URL of the service. You can do it by running the following command:
kubectl get pods -n spark
kubectl -n spark logs <pyspark-notebook-pod-name>
In the logs you will find the proper url with token allowing you to access the web interface.
You will see Jupyter notebook with the work
directory containing example code that will run Spark workers and make some computations.
Have fun!