A 64 bit virtual machine for Machine Learning/Data Science tasks. Generated and provisioned with Vagrant.
This instance builds on the spark-base64
VM (which already provides all
the needed software packages, on an Ubuntu 22.04 box). On top of that, it
configures and launches a Jupyter Notebook process, exported as an HTTP service
to a local port. It allows creating notebooks with four different kernels:
- Python 3.10 (plain Python, with additional libraries such as NumPy, SciPy, Pandas, Matplotlib, Scikit-learn, etc),
- Pyspark (Python 3.10 + libraries + Spark),
- Scala 2.12 (with the capability of connecting to Spark)
- R (with SparkR available, though not loaded by default).
The repository also contains a number of small example notebooks.
The contents of the VM are:
- Apache Spark 3.4.1
- Python 3.10
- A virtualenv for Python 3.10 with a scientific Python stack (scipy, numpy, matplotplib, pandas, statmodels, scikit-learn, gensim, xgboost, networkx, seaborn, pylucene and a few others) plus IPython 8 + Jupyter notebook
- R 4.3 with a few packages installed (rmarkdown, magrittr, dplyr, tidyr, data.table, ggplot2, caret, plus their dependencies). Plus SparkR & sparklyr for interaction with Spark.
- Spark notebook Kernels for Python 3.10, Scala (Almond) and R (IRKernel), in addition to the default "plain" (i.e. non-Spark capable) Python 3.10 kernel.
- A few small notebook extensions (only for the "classic" notebook interface)
- A notebook startup daemon script
- RStudio Server (as an optional provisioning)
Important: the default Python kernel for notebooks is not Spark-aware.
To develop notebooks in Python for Spark, the Pyspark (Py 3)
kernel must be
specifically selected. Hence Spark Python Notebooks that were created elsewhere
(or with former versions of this VM) will not work initially.
They can be made to work by changing its kernel (use the option in the menubar)
to the "Pyspark" kernel. Once done, the change is stored in the notebook, so
saving it will make it work in future executions.
Note: this VM will not work with M1 & M2 Apple Mac computers, it needs an Intel-based computer
- Hardware & OS: A computer with enough free RAM (at least 2 GB of free RAM for the guest OS is advisable), and around 10 GB of hard disk space, with a 64-bit Windows (7 or above), Linux 64 bits (Ubuntu, RedHat/CentOS, etc) or Mac OS X.
- Software: The following must be installed in the computer:
- Virtualbox 6.0 or above (if possible, use the latest version available)
- Vagrant 2.0 or above (if possible, use the latest version available)
-
Copy the Vagrantfile + examples into the computer, either by cloning the repository or by downloading and extracting all files in the packaged ZIP file. Make sure to use a disk or partition with the mentioned 10 GB of free space. Also, in Windows it might be advisable to avoid using a folder name with spaces (since sometimes it causes problems).
-
If desired, open the Vagrantfile with a text editor and customize the options at the top of the file; see the relevant comments. Specially interesting might be the amount of RAM/CPUs assigned to the Virtual Machine and, if access to a remote Spark cluster is sought, the IP address of the cluster. Another configurable value is the notebook access password (in the
vm_password
variable). Note that no customization is needed to make the VM work (i.e. it will happily work with no changes to the Vagrantfile) -
Open a console/terminal, move (
cd
) to the folder where the Vagrantfile is located and execute avagrant up
command. -
Vagrant should launch the process and download the base box from the public repository (this takes time, depending on your network bandwidth, but it is only done once).
-
Then the VM will be started and provisioned. The process will print progress messages to the terminal.
The base box (the one that was created by the base repository) must be downloadable when provisioning this VM. The default URL in the Vagrantfile points to a box publicly available in VagrantCloud, so there should be no problem as long as there is a working Internet connection.
Once installation finishes, a notebook server will be running on
http://localhost:8008
. Use a Web browser to
access it. By default it is also accessible externally, i.e. other machines in
the network can also connect to it (unless the host computer has a firewall that
blocks port 8008).
Using the notebook interface will require the access password defined in
the Vagrantfile
(unless changed before installation, it will be vmuser
)
The additional files in the ZIP create a layout for sharing content between the host and the VM:
- The
vmfiles
subfolder in the host is configured to be mounted inside the VM as the/vagrant
directory, so anything in that subfolder can be accessed within the VM. - Furthermore, the notebook server is configured to browse the files in the
vmfiles/IPNB
subdirectory (it appears under the notebook interface as thehost
folder), so to add notebooks place them in that subdirectory. A few example mini-notebooks are already provided there. - Finally, the
vmfiles/hive
subdirectory is the place configured in Spark SQL for its metastore & tables (so it should survive to changes in the VM).
The Jupyter notebook server starts automatically. It can be managed (start/stop/restart) in a VM console session (see below) via:
sudo systemctl (start | stop | restart) notebook
For diagnostics, in addition to the messages appearing directly on notebooks,
logfiles are generated in /var/log/ipnb
inside the VM:
/var/log/ipnb/jupyter-notebook.out
contains log messages from the Jupyter server/var/log/ipnb/spark.log
contains log messages from Spark
In addition to creating, editing and executing Notebooks via the Web interface, there may also be the need to operate through a console session. There are three ways to obtain console access to the VM:
-
By using the console option in the Notebook Web interface (use the right menu: New -> Terminal)
-
By opening a text terminal on the host machine, changing to the directory where the Vagrantfile is located and executing
vagrant ssh
-
By using a standard SSH application (
ssh
in Linux, in Windows e.g. PuTTY). Connect to the following destination:- Host: 127.0.0.1 (localhost)
- Port: 2222
- User:
vagrant
- Password:
vagrant
In the two first cases the user logged in the console session will be vmuser
(or, if that was changed in the Vagrantfile, the username defined there). In the
last case it will be vagrant
.
- The
vmuser
user is intended to execute processing tasks (and is the one running the Jupyter Notebook server), including Spark command-line applications such asspark-submit
,spark-shell
,pyspark
, etc as well as Python commands (use eitheripython
orpython
, which points to the virtualenv where all is installed). - The
vagrant
user is intended for administrative tasks (and is the owner of all the installed Python & Spark stack).
Both users have sudo
permissions, in case it is needed for system
administration tasks.
Once we have a console session, it can also be used to edit the files inside
the VM. For this the VM includes the standard vi
and emacs
editors (for
text consoles), as well as nano
, a lightweight editor. Note that files in the
/vagrant
directory can be edited directly in the host, since it is a mounted
folder.
Inside the VM (i.e. as seen from within a console session), Spark is installed
in /opt/spark/current/
. The two important Spark config files are:
/opt/spark/current/conf/spark-env.sh
: environment variables used by the Spark deployment inside the VM/opt/spark/current/conf/spark-defaults.conf
: configuration properties used by Spark
The Spark kernel in Jupyter Notebook launches with the configuration defined by those two files. If they are changed, Spark kernels currently running will need to be restarted to make them read the new values.
Command-line spark processes launched from a console session (through
spark-submit
, pyspark
, etc) use also the same configuration, unless
overriden by command-line arguments.
An additional configuration file is /opt/spark/current/conf/hive-site.xml
.
This one is used by Spark SQL to determine the behaviour related to Hive tables.
In particular, two important bits defined there are the location of the
metastore DB (the VM uses Derby as a local metastore) and the place where
system tables will be written. Both places are configured to subfolders of the
./vmfiles/hive
directory in the host (mounted in the VM).
Note that due to Derby being a single-process DB, only one Spark SQL process may be active at any given moment (since it would collide over the use of the metastore). This means that there should be only one active Notebook (or Spark shell) making use of the Spark SQL engine.
To allow more than one Notebook at the same time, the
javax.jdo.option.ConnectionURL
configuration property in hive-site.xml
should be commented out; this would revert Spark to the default behaviour of
creating the metastore in the local directory in which the Notebook is running
(hence there can be more than one active Notebook running Spark SQL, provided
that they are in different directories).
The VM has not been secured. The base box comes with the default Vagrant key, and this is overwritten with a private key when launched for the first time, so ssh connections with certificate are more or less secure. But there are a number of security holes, among them:
- Both the
root
andvagrant
users usevagrant
as password. - Jupyter notebook listens on port 8008 with a very trivial password (if the host computer has no firewall, it can be accessed from anywhere)
There are a number of packages that have been defined in the Vagrantfile
but
that are not automatically installed. Instead, they must be explicitly
provisioned with:
vagrant provision --provision-with <name>
It includes the following list:
name | package contents |
---|---|
graphframes | Activate/deactivate GraphFrames in Spark config |
rstudio | RStudio Server (see note below) |
nbc | Notebook convert (functionality for Notebook conversion to document formats: LaTeX & PDF) |
nbc.es | Configure Notebook conversion to documents for Spanish |
mvn | Maven build automation tool for Java |
scala | Scala & SBT. Note that this is not needed to execute Scala code in the provided Jupyter kernel; it is for standalone Scala programs |
nlp | Some additional Python packages for Natural Language Processing |
dl | Deep Learning libraries (Tensorflow, PyTorch) |
Note: for RStudio it is also necessary that Vagrant opens port 8787 (this is
defined in the Vagrantfile). Once open (and the VM reloaded), the user/password
combination to be used is vmuser
& vmuser