Note: if you just want to set up a running Spark virtual machine, you do not need this project. Use the ml-notebook project instead, that one will download the packaged base box and launch the VM automatically. This one is for building the VM from scratch
This project contains the files needed to generate a virtual machine for Machine Learning/Data Science tasks. When provisioning the virtual machine, every required piece software is downloaded from Internet. To see what is included inside the virtual machine and what has changed between versions, look at the ChangeLog file.
The VM is managed through Vagrant. Software requirements for the host are:
- Vagrant 2.1 or above (if possible, use the latest version available)
- VirtualBox 6.0 or above
The project creates a "base" VM, with all the needed software but not fully configured to work. Another subproject defined as a submodule, in the ml-notebook repository, takes care of configuring the VM for a Spark system accessed through Jupyter Notebook in its own Vagrantfile. That subproject uses the "base" VM as a Vagrant box to start from.
So the complete creation is a two-step process:
-
the first step takes place here, and the produced
spark-base64
box is manually uploaded to Vagrant Cloud -
The second one is the one implemented in the
Vagrantfile
in ml-notebook; it downloads thespark-base64
box from the cloud and finalizes the configurationstarting box ---> base box ---> final VM [ubuntu 22.04] [spark-base64]
There is an additional submodule, nbextensions, which contains the Jupyter Notebook extensions that will be copied to the base VM (note by default they are not configured to automatically be included in notebooks, this is again taken care of in the Vagrantfile for the ml-notebook subproject.)
The base Vagrantfile
in this project is self-contained (downloads
everything needed from public repositories), with a few exceptions.