Currently, and still after many years, Apache Spark cannot use cluster deploy mode on standalone clusters. You can choose to remove the worker container for the docker-compose.yaml file, or leave it as is and submit your job without cluster mode.
I made the choice to keep both master and worker so jar app can be deployed with cluster mode if needed.
This image install a development Spark cluster + a Python SnowFlake connector.
Current default versions are:
- Spark: 3.2.1
- Hadoop: 3.2
- Python: 3.9
- Snowflake connector: 2.7.4
2 containers will be deployed by default :
Name | Exposed ports |
---|---|
spark_master | 8081 7077 |
spark_cluster | 8082 7078 |
A volume is created too:
local | Container |
---|---|
sparkapps | /opt/spark/apps |
To change containers names, ports, paths, volumes, please update variables values in docker-compose.yaml
For security reasons, Snowflake credentials are not hard coded. To configure your connector, use the following command:
cp .env-example .env
and use your credentials. The corresponding values will be used during build and run.
Please have Docker installed and running.
The docker-compose.yaml file is made so that the image is automatically built. After cloning the repo, you can then directly run:
docker-compose up -d
to run the container in detached mode.
If you change values in Dockerfile, you'll have to rebuild the image. You can use the following command:
docker-compose up -d --build
To access Spark master, open: http://localhost:8081
For the worker, open:
For a simple test, connect to the master or the worker, and run the following command:
python /opt/spark/validate.py
It should output the current snowflake version.
You can also submit the job without cluster mode:
/opt/spark/bin/spark-submit /opt/spark/app/validate.py