Spark K8s Learning

Repository to test and development in Spark on K8s

Requirements

Docker and Docker Composer
Kind
Kubectl
Helm

DO YOUR "JEITO" #BRHUEHUE

Create kind cluster

make create-kind

# To clean-up
kind delete cluster

Install Grafana and Prometheus

make helm-add
make install-prometheus

Access http://grafana.localhost/ with

user: admin
pass: prom-operator

External Kafka in Docker-Compose

Note: The new docker compose use docker compose up -d instead docker-compose up -d

make kafka-setup

# To cleanup
# make kafka-destroy

Check with Kowl if data are sent to Kafka:

Create Spark Image and sent to Kind Cluster

# Create Spark Image
make build-spark

# Create Operator Image
make build-operator

# Send image to Kind Cluster
make send-images

Spark Operator

make helm-add
make install-spark

Copy jibaro library and scripts with:

make copy

This command will send files to s3://spark-artifacts.

Test Spark Job

Test Spark submit:

kubectl apply -f tests/spark-test.yaml

Check if job is completed:

kubectl get sparkapplications -n spark

NAME                 STATUS      ATTEMPTS   START                  FINISH                 AGE
spark-yuriniitsuma   COMPLETED   1          2022-05-29T06:43:26Z   2022-05-29T06:43:57Z   49s

Delete job:

kubectl delete sparkapplications spark-yuriniitsuma -n spark

Generate data in postgres database

Generate data in dbserver1.inventory.products:

bash external-services/generate_data.sh

Airflow

You need to install airflow on your host computer to easily access kind cluster. Obviosly you need Python 3 and pip3 installed on your computer.

# Install Airflow
python3 -m venv venv && source venv/bin/activate && \
pip install apache-airflow==2.4.3 \
    psycopg2-binary \
    apache-airflow-providers-cncf-kubernetes \
    apache-airflow-providers-amazon

Run airflow db init for the first time to create the config files.

Change the config below in $HOME/airflow/airflow.cfg:

dags_folder = $LOCAL_TO_REPO/airflow/dags
executor = LocalExecutor
load_examples = False
sql_alchemy_conn = postgresql+psycopg2://postgres:postgres@localhost/airflow
remote_logging = True
remote_log_conn_id = AirflowS3Logs
remote_base_log_folder = s3://airflow-logs/logs

Change config with sed:

sed -i '' 's/^executor = .*/executor = LocalExecutor/g' $HOME/airflow/airflow.cfg
sed -i '' 's/^load_examples = .*/load_examples = False/g' $HOME/airflow/airflow.cfg
sed -i '' 's/^sql_alchemy_conn = .*/sql_alchemy_conn = postgresql+psycopg2:\/\/postgres:postgres@localhost\/airflow/g' $HOME/airflow/airflow.cfg
sed -i '' 's:^dags_folder = .*:dags_folder = '`pwd`'/airflow\/dags:g' $HOME/airflow/airflow.cfg
sed -i '' 's/^remote_logging = .*/remote_logging = True/g' $HOME/airflow/airflow.cfg
sed -i '' 's/^remote_log_conn_id =.*/remote_log_conn_id = AirflowS3Logs/g' $HOME/airflow/airflow.cfg
sed -i '' 's/^remote_base_log_folder =.*/remote_base_log_folder = s3:\/\/airflow-logs\/logs/g' $HOME/airflow/airflow.cfg
sed -i '' 's/^web_server_port = .*/web_server_port = 8000/g' $HOME/airflow/airflow.cfg

Then run:

source venv/bin/activate && airflow db init && \
airflow connections add \
   --conn-type 'aws' \
   --conn-extra '{ "aws_access_key_id": "minio", "aws_secret_access_key": "miniominio", "host": "http://localhost:9000" }' \
   AirflowS3Logs &&\
airflow pools set spark 2 'spark on k8s'

# in one terminal run
source venv/bin/activate && PYTHON_PATH=$PWD/airflow/dags airflow webserver

# in another terminal run the scheduler
source venv/bin/activate && PYTHON_PATH=$PWD/airflow/dags airflow scheduler

Create an admin user:

airflow users create \
    --username admin \
    --password admin \
    --firstname admin \
    --lastname admin \
    --role Admin \
    --email [email protected]

Insert password and go to Airflow UI

And run Pipeline dag:

After the DAG is completed, you can check the output in Minio's datalake bucket http://localhost:9001/buckets/datalake/browse:

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
airflow/dags		airflow/dags
docs		docs
external-services		external-services
jibaro		jibaro
kind		kind
scripts		scripts
spark		spark
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
copy_jibaro.sh		copy_jibaro.sh
install_spark.sh		install_spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark K8s Learning

Requirements

Create kind cluster

Install Grafana and Prometheus

External Kafka in Docker-Compose

Create Spark Image and sent to Kind Cluster

Spark Operator

Test Spark Job

Generate data in postgres database

Airflow

TODO

About

Releases

Packages

Languages

ignitz/spark-k8s-learning

Folders and files

Latest commit

History

Repository files navigation

Spark K8s Learning

Requirements

Create kind cluster

Install Grafana and Prometheus

External Kafka in Docker-Compose

Create Spark Image and sent to Kind Cluster

Spark Operator

Test Spark Job

Generate data in postgres database

Airflow

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages