- Use spark sql and dataframes API for data processing
- Write sql code in all src/main/resources/sql/task*/
- Write pyspark code for all dataframes in pyspark_task.py
- Optimize imports (Spark session need to be created during some function invocation not during import) for
- pyspark_task.py
- test_app.py
- Add parameters to the test_app.py, so you can invoke subsets of tests for
- Data Frame
- SQLs
- Task group
- Particular Task
- Make sure that all test passed,
- run commands
./bash/start-docker.sh y y
- or in master container execute
pytest /opt/spark-apps/test
- Implement easy mode
- Create own data comparison framework (write your own pyspark_task_validator.py)
- Test created all transformations for SQL and Dataframe api using pytest-spark (write your own test_app.py)
- Add logging to all your functions using decorators(write your own project_logs.py)
- Create docker image and run spark cluster (1 master 2 workers) on it (Add your own docker compose and Docker file)
- Implement hard mode
- Create UI using flask for execution implemented tasks, you should have ability to
- Choose task from drop down list
- Choose method of execution (sql, dataframe or both) from drop down list
- Button to start execution
- See logs generated by your script in real time on your web page
- Implement Extra Hard mode
- Make this solution work on any cloud
- Add CD/CI to your git project (https://circleci.com/)
- Docker ( on Linux or with WSL support to run bash scripts )
- 6 cores, 12 GB RAM
- SPARK_WORKER_CORES : 2 * 3
- SPARK_WORKER_MEMORY : 2G * 3
- SPARK_DRIVER_MEMORY : 1G * 3
- SPARK_EXECUTOR_MEMORY : 1G * 3
- 6 cores, 12 GB RAM
-
First time execution needs :
- Permissions
chmod -R 755 ./*
- Docker image build
./bash/start-docker.sh y
- If bash scripts doesn't work for you run commands below
docker build --build-arg SPARK_VERSION=3.0.2 --build-arg HADOOP_VERSION=3.2 -t cluster-apache-spark:3.0.2 ./
docker compose up -d
docker container exec -it py_spark_test_tasks-spark-master-1 /bin/bash
-
To connect to docker container :
./bash/start-docker.sh n
- To run all tests :
./bash/start-docker.sh n y
- To run failed tests :
./bash/start-docker.sh n f
- To run tasks using UI, use link below in your browser :
Tasks Description:
- Task_Description.txt
Inputs:
- data/tables/accounts/*.parquet
- data/tables/country_abbreviation/*.parquet
- data/tables/transactions/*.parquet
Outputs:
- data/df/task.../...
- data/sql/task.../...
Expected outputs:
- test/task1/expected_output/..
- test/task../expected_output/..
-
src/pyspark_task.py - dataframes and sql definition
-
src/pyspark_task_validator.py - module to invoke and test dataframes and sql definition
-
src/sql/.. - sql files with the same logic as for dataframes
-
src/web/.. - web UI on flask for task invocation
-
test/test_app.py - all tests definition
-
bash/start-docker.sh - file to start project
-
bash/... other files are related to the spark env config