A simple Scala application to connect to a Spark cluster and provide a generic, robust API to tap into various Spark APIs. Furthermore, this project intends to provide the ability to send both packaged jars (standard jobs) and code snippets (with revision capability) for scenarios like IPython for dynamic updates. Finally, the kernel is written with the future plan to allow multiple applications to connect to a single kernel to take advantage of the same Spark context.
A Vagrantfile is provided to easily setup a development environment. You will
need to install Virtualbox 4.3.12+
and Vagrant 1.6.2+. Once Vagrant
and Virtualbox are installed, the environment can be created by running
vagrant up
. When the VM is running, you can get to the source code by
executing:
vagrant ssh
cd /src/spark-kernel
To build the kernel from source, you need to have sbt installed on your machine. Once it is on your path, you can compile the Spark Kernel by running the following from the root of the Spark Kernel directory:
sbt compile
The recommended configuration options for sbt are as follows:
-Xms1024M
-Xmx2048M
-Xss1M
-XX:+CMSClassUnloadingEnabled
-XX:MaxPermSize=1024M
The Spark Kernel uses ZeroMQ as the medium for communication. In order for the Spark Kernel to be able to use this protocol, the ZeroMQ library needs to be installed on the system where the kernel will be run. The Vagrant and Docker environments set this up for you.
For Mac OS X, you can use Homebrew to install the library:
brew install zeromq22
For aptitude-based systems (such as Ubuntu 14.04), you can install via apt-get:
apt-get install libzmq-dev
The Spark Kernel is provided as a series of jars, which can be executed on their own or as part of the launch process of an IPython notebook.
The following command line options are available:
- --profile - the file to load containing the ZeroMQ port information
- --help - displays the help menu detailing usage instructions
- --master - location of the Spark master (defaults to local[*])
Additionally, ZeroMQ configurations can be passed as command line arguments
- --ip
- --stdin-port
- --shell-port
- --iopub-port
- --control-port
- --heartbeat-port
Ports can also be specified as Environment variables:
- IP
- STDIN_PORT
- SHELL_PORT
- IOPUB_PORT
- CONTROL_PORT
- HB_PORT
We are utilizing xerial/sbt-pack to package and distribute the Spark Kernel.
You can run the following to package the kernel, which creates a directory in kernel/target/pack contains a Makefile that can be used to install the kernel on the current machine:
sbt kernel/pack
To install the kernel, run the following:
cd kernel/target/pack
make install
This will place the necessary jars in ~/local/kernel/current/lib and provides a convient script to start the kernel located at ~/local/kernel/current/bin/sparkkernel.
The Spark Kernel can be run in a docker container using Docker 1.0.0+. There is a Dockerfile included in the root of the project. You will need to compile and pack the Spark Kernel before the docker image can be built.
sbt compile
sbt pack
docker build -t spark-kernel .
After the image has been successfully created, you can run your container by executing the command:
docker run -d -e IP=0.0.0.0 spark-kernel
You must always include -e IP=0.0.0.0
to allow the kernel to bind to the
docker container's IP. The environment variables listed in the getting
started section can be used in the docker run command. This allows you to
explicitly set the ports for the kernel.
You must have SBT 0.13.5+ installed. From the command line, you can attempt
to run the project by executing sbt kernel/run <args>
from the root
directory of the project. You can run all tests using sbt test
(see
instructions below for more testing details).
For IntelliJ developers, you can attempt to create an IntelliJ project
structure using sbt gen-idea
.
There are four levels of test in this project:
-
Unit - tests that isolate a specific class/object/etc for its functionality
-
Integration - tests that illustrate functionality between multiple components
-
System - tests that demonstrate correctness across the entire system
-
Scratch - tests isolated in a local branch, used for quick sanity checks, not for actual inclusion into testing solution
To execute specific tests, run sbt with the following:
-
Unit -
sbt unit:test
-
Integration -
sbt integration:test
-
System -
sbt system:test
-
Scratch -
sbt scratch:test
To run all tests, use sbt test
!
The naming convention for tests is as follows:
-
Unit - test classes end with Spec e.g. CompleteRequestSpec
- Placed under com.ibm.spark
-
Integration - test classes end with SpecForIntegration e.g. InterpreterWithActorSpecForIntegration
- Placed under integration
-
System - test classes end with SpecForSystem e.g. InputToAddJarSpecForSystem
- Placed under system
-
Scratch
- Placed under scratch
It is also possible to run tests for a specific project by using the following syntax in sbt:
sbt <PROJECT>/test