Example Airflow DAG and Spark Job for Google Cloud Dataproc
This example is meant to demonstrate basic functionality within Airflow for managing Dataproc Spark Clusters and Spark Jobs. It also demonstrates usage of the BigQuery Spark Connector.
- Uses Airflow DataProcHook and Google Python API to check for existence of a Dataproc cluster
- Uses Airflow BranchPythonOperator to decide whether to create a Dataproc cluster
- Uses Airflow DataprocClusterCreateOperator to create a Dataproc cluster
- Uses Airflow DataProcSparkOperator to launch a spark job
- Uses Airflow DataprocClusterDeleteOperator to delete the Dataproc cluster
- Creates Spark Dataset from data loaded from a BigQuery table by BigQuery Spark Connector
- run
sbt assembly
in spark-example to create an assembly jar - upload the assembly jar to GCS
- Add dataproc_dag.py to your dags directory (
/home/airflow/airflow/dags/
on your airflow server or your dags directory in GCS if using Cloud Composer) - In the Airflow UI, set variables:
project
GCP project idregion
GCP region ('us-central1')subnet
VPC subnet id (short id, not the full uri)bucket
GCS bucketprefix
GCS prefix ('/dataproc_example')dataset
BigQuery datasettable
BigQuery tablejarPrefix
where you uploaded the assembly jar
Apache License 2.0