This repository contains many different examples, exercises and tutorials for Spark and Hadoop trainings performed by dimajix. You can always find the latest version on GitHub at
https://github.com/dimajix/spark-training
The repository contains different types of documents
- Source Code for Spark/Scala
- Jupyter Notebooks for PySpark
- Zeppelin Notebooks for Spark/Scala
- Hive SQL scripts
- Pig scripts
- ...and much more
Some notebooks require some test data provided by dimajix on S3 at s3://dimajix-training/data/.
The source code can be built using Maven, simply by running
mvn install
from the root directory.
Most code is either provided as interactive Notebooks (Jupyter and/or Zeppelin) or as compilable programs. Programs which create jar files always contain start scripts, which take care of setting any environment variables and Spark configuration properties.