All the file necessary to use this project are available on OneDrive
In order to set up this repository on your computer, you must do the following:
- Clone the repository to your computer
- Download the prepared volumes from OneDrive, and extract the zip file
- From the root of the project, run import/import.cmd
- Use the location of the extracted volumes as input, such as C:/Users/Name/Desktop/Volumes
- Start the cluster using start.cmd
- For stopping the cluster we recommend using stop.cmd to prevent corrupting data in HBase
- Wait for HDFS to exit safemode and HBase to initialize. This might take a few minutes
- You can now view different visualizations on localhost
- Go to the admin panel
- Ensure that all files necessary for the job are uploaded under "Upload" with the type "Spark application"
- If using the volumes provided on OneDrive, this has already been done
- If running jobs from scratch, make sure to upload the drivers for all python jobs as py files to HDFS. Furthermore, the code for all python files must be uploaded to the same directory as a single zip archive named files. Jar libraries on which the Spark applications depend must also be uploaded - these jars are available on OneDrive.
- Under "Submit Spark application", write the name of the job to execute, such as incident_aggregator and press "Submit"
- The status of the job is most easily tracked on Livy or by using the "Spark job status" on the admin page
For your convenience, all images have been uploaded to DockerHub.
If, for some reason, you wish to build images yourself, you must download the SHC connector from OneDrive and place it under pysparkApp/. This file is too large for GitHub, and our public fork does not permit the use of LFS.