spark-glue-data-catalog

This project builds Apache Spark in way it is compatible with AWS Glue Data Catalog.

It was mostly inspired by awslabs' Github project awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore and its various issues and user feedbacks.

⚠️ this is neither official, nor officially supported: use at your own risks!

Usage prerequisites

AWS credentials

You must provide AWS credentials via environment variables to the master/executor nodes for spark to be able to access AWS APIs: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION.

IAM permissions

Here is an exemple set of Glue permissions to allow Spark to access an hypothetic db1.table1 table in Glue Data Catalog:

{
  "Effect": "Allow",
  "Action": [
    "glue:*Database*",
    "glue:*Table*",
    "glue:*Partition*"
  ],
  "Resource": [
    "arn:aws:glue:us-west-2:123456789012:catalog",      

    "arn:aws:glue:us-west-2:123456789012:database/db1",
    "arn:aws:glue:us-west-2:123456789012:table/db1/table1",

    "arn:aws:glue:eu-west-1:645543648911:database/default",
    "arn:aws:glue:eu-west-1:645543648911:database/global_temp",
    "arn:aws:glue:eu-west-1:645543648911:database/parquet"
  ]
}

Note the last 3 resources are mandatory for the the glue-compatible hive connector.

Don't forget to also add S3 IAM permissions for Spark to be able to fetch table data!

GCP Bigquery/GCS credentials

You must provide a valid path to a GCP service account file using environment variable GOOGLE_APPLICATION_CREDENTIALS. Otherwise you have to set manually an Access Token after the Spark Context is created using

spark.conf.set("gcpAccessToken", "<access-token>")

Current release

📄 spark-2.4.5-bin-hadoop2.8-glue.tgz
- Python 3.6
- Spark 2.4.5
- Hadoop 2.8.5
- Hive 1.2.1
- AWS SDK 1.11.682
- Bigquery Connector 0.18.1

Miscellaneous

Build spark-glue-data-catalog locally

You need Docker and Docker Compose.

Just run make build. Spark bundle artifact is produced in dist/ directory.

Use in Jupyter notebook

To use this version of pyspark in Jupyter, you need to declare a new dedicated kernel.

We suppose you installed Spark in /opt directory and symlinked it with /opt/spark.

Create a kernel.json file somewhere with following content:

{
  "display_name": "PySpark",
  "language": "python",
  "argv": [
    "/opt/conda/bin/python",
    "-m",
    "ipykernel",
    "-f",
    "{connection_file}"
  ],
  "env": {
    "SPARK_HOME": "/opt/spark",
    "PYTHONPATH": "/opt/spark/python/:/opt/spark/python/lib/py4j-0.10.7-src.zip",
    "PYTHONSTARTUP": "/opt/spark/python/pyspark/shell.py",
    "PYSPARK_PYTHON": "/opt/conda/bin/python"
  }
}

Then, run jupyter kernelspec install {path to kernel.json's directory}.

References

Source code for the AWS Glue Data Catalog client for Apache Hive Metastore is now available for download - Feb 4, 2019 announcement
Using the AWS Glue Data Catalog as the Metastore for Hive documentation
Github's awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore project

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.circleci		.circleci
conf		conf
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
build-spark.sh		build-spark.sh
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-glue-data-catalog

Usage prerequisites

AWS credentials

IAM permissions

GCP Bigquery/GCS credentials

Current release

Miscellaneous

Build spark-glue-data-catalog locally

Use in Jupyter notebook

References

About

Releases

Packages

Languages

pang-wu/spark-glue-data-catalog

Folders and files

Latest commit

History

Repository files navigation

spark-glue-data-catalog

Usage prerequisites

AWS credentials

IAM permissions

GCP Bigquery/GCS credentials

Current release

Miscellaneous

Build spark-glue-data-catalog locally

Use in Jupyter notebook

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages