Test project for pyspark

Goals :

Easy Mode

Use spark sql and dataframes API for data processing
1. Write sql code in all src/main/resources/sql/task*/
2. Write pyspark code for all dataframes in pyspark_task.py
3. Optimize imports (Spark session need to be created during some function invocation not during import) for
  1. pyspark_task.py
  2. test_app.py
4. Add parameters to the test_app.py, so you can invoke subsets of tests for
  1. Data Frame
  2. SQLs
  3. Task group
  4. Particular Task
5. Make sure that all test passed,
  1. run commands
  ./bash/start-docker.sh y y
  1. or in master container execute
  pytest /opt/spark-apps/test

Hard Mode

Implement easy mode
Create own data comparison framework (write your own pyspark_task_validator.py)
Test created all transformations for SQL and Dataframe api using pytest-spark (write your own test_app.py)
Add logging to all your functions using decorators(write your own project_logs.py)
Create docker image and run spark cluster (1 master 2 workers) on it (Add your own docker compose and Docker file)

Extra Hard Mode

Implement hard mode
Create UI using flask for execution implemented tasks, you should have ability to
1. Choose task from drop down list
2. Choose method of execution (sql, dataframe or both) from drop down list
3. Button to start execution
4. See logs generated by your script in real time on your web page

Expert Mode

Implement Extra Hard mode
Make this solution work on any cloud
Add CD/CI to your git project (https://circleci.com/)

Requirements:

Docker ( on Linux or with WSL support to run bash scripts )
- 6 cores, 12 GB RAM
  - SPARK_WORKER_CORES : 2 * 3
  - SPARK_WORKER_MEMORY : 2G * 3
  - SPARK_DRIVER_MEMORY : 1G * 3
  - SPARK_EXECUTOR_MEMORY : 1G * 3

How work with project environment:

First time execution needs :
1. Permissions
chmod -R 755 ./*
1. Docker image build
./bash/start-docker.sh y
1. If bash scripts doesn't work for you run commands below
  
  docker build --build-arg SPARK_VERSION=3.0.2 --build-arg HADOOP_VERSION=3.2 -t cluster-apache-spark:3.0.2 ./
  
  docker compose up -d
  
  docker container exec -it py_spark_test_tasks-spark-master-1 /bin/bash
To connect to docker container :

./bash/start-docker.sh n

To run all tests :

./bash/start-docker.sh n y

To run failed tests :

./bash/start-docker.sh n f

To run tasks using UI, use link below in your browser :

http://localhost:8000/run_task

Project data

Tasks Description:

Task_Description.txt

Inputs:

data/tables/accounts/*.parquet
data/tables/country_abbreviation/*.parquet
data/tables/transactions/*.parquet

Outputs:

data/df/task.../...
data/sql/task.../...

Expected outputs:

test/task1/expected_output/..
test/task../expected_output/..

Project realisation files

src/pyspark_task.py - dataframes and sql definition
src/pyspark_task_validator.py - module to invoke and test dataframes and sql definition
src/sql/.. - sql files with the same logic as for dataframes
src/web/.. - web UI on flask for task invocation
test/test_app.py - all tests definition
bash/start-docker.sh - file to start project
bash/... other files are related to the spark env config

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.idea		.idea
bash		bash
data		data
log		log
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
PySparkTestTasks.iml		PySparkTestTasks.iml
README.md		README.md
Task_Description.txt		Task_Description.txt
docker-cheatsheet.txt		docker-cheatsheet.txt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test project for pyspark

Goals :

Easy Mode

Hard Mode

Extra Hard Mode

Expert Mode

Requirements:

How work with project environment:

Project data

Project realisation files

About

Releases

Packages

Languages

merlin-zaraza/py_spark_test_tasks

Folders and files

Latest commit

History

Repository files navigation

Test project for pyspark

Goals :

Easy Mode

Hard Mode

Extra Hard Mode

Expert Mode

Requirements:

How work with project environment:

Project data

Project realisation files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages