Skip to content

Latest commit

 

History

History
67 lines (37 loc) · 7.96 KB

File metadata and controls

67 lines (37 loc) · 7.96 KB

This document summarizes several architectures designed for training Deep Learning models on top of Spark developed at Arimo.

By:

  • Christopher Nguyen
  • Christopher Smith
  • Ushnish De
  • Vu Pham

1. Introduction

Deep Learning models are generally trained using Backpropagation, an algorithm for computing the error “signal” of each parameter in the model, and this error signal is then used to adjust the parameters according to various rules generally known by the umbrella term the Gradient Descent algorithms. The training process is inherently iterative, where training “batches” are processed sequentially. A Deep Learning model typically has thousands or millions or parameters, depending on the architecture of the model and the training set. Training Deep Learning models on big data, therefore, is challenging due to the computational requirements.

The default implementation of Gradient Descent in Spark performs the gradient computation at the workers, and the error signals are accumulated at the driver, who also updates the model. Generally the model parameters is stored in a shared variable, and each training iteration is done in one map-reduce step on a (relatively big) fraction of the whole training set.

While this approach is generic and works for a wide variety of problems, updating the weights after each map-reduce step can be a bit inefficient because the error signal was computed on a pretty big sample of the training set. In the literature, it is well-known that the model will converge faster if the size of the training batches in Gradient Descent is small (typically several hundreds samples per training batch). However, using small training batch on Spark with the above algorithm is inefficient because the computational time on each batch is too small compared to the time for communication over the wire between the workers and the driver.

Moreover, the synchronous nature of the above algorithm, where the model only gets updated after every map-reduce job, is also not very efficient. We believe this can be improved using a slightly different approach.

2. Alluxio coprocessor

In most distributed systems, collaboration between the machines can be done by either some form of shared memory (to which all the machines have access) or message passing protocols, or a mixed of the two. Alluxio (née Tachyon) is a distributed storage system at memory speed, organized as a file system (think HDFS in memory). With this design, one can easily adopt Alluxio as a shared memory between all the Spark workers.

However, we believe Alluxio can do more than just pure storage. We built a prototype and did some experiments with a new system called “Tachyon coprocessor”. Inspired by similar technique in distributed database management systems, a coprocessor is a process that runs on Tachyon workers, which is triggered when some certain events happen on some specific files stored in Tachyon. Those events include file creation, deletion or update, and the coprocessor can be set to monitor a single file or a whole directory. Many coprocessors can be added to a Tachyon cluster.

Using Tachyon coprocessor, we can train Deep Learning models more efficiently [1]. The whole system is depicted in the following figure.

The model parameters are stored as a file named model.bin in Alluxio, and the Spark workers read this model at the beginning of its task. Working on its own data partition, the Spark workers split the data into multiple small batches (of size typically 100-200 training samples), compute the error derivatives on each batch and store it back in a designated directory in Tachyon. To balance between the cost of computation and communication, the workers only send the accumulated derivatives after every nsend iterations, and fetch the (new) model parameters from Tachyon every nfetch iterations.

On the Alluxio side, there is a Coprocessor (called Observer in the diagram) monitoring the directory of derivatives files. For every new file written in this directory, the Coprocessor will perform adjusting the model, and update the model.bin file. The new parameters in model.bin file will be picked up by the Spark workers once they need it.

In practice, very often we deploy Alluxio on the same physical cluster with Spark, but Alluxio obviously can be installed on a different cluster.

With this approach, the Spark workers only perform the gradient computation, while the model update is delegated to the coprocessor running on Alluxio. Technically, because the model is updated more frequently (after every nsend iterations on each worker), training will converge faster.

Our contribution is two-fold:

  • For the first time, we demonstrated the concept of coprocessor on Alluxio, and proved that Alluxio could be able to do some computation, other than pure storage. This can be helpful for some use-cases where users want to perform some validation on the data being read/written, or custom logic that lives and runs directly inside the storage system.
  • By off-loading some computation from Spark to Alluxio, we neatly solve the problem of sharing the model between the workers. In fact in this Deep Learning use-case, Alluxio is used as a Parameter server, and the model is updated asynchronously, leading to faster convergence rate.

3. Alluxio as a shared storage system

Obviously Alluxio can be used as a shared storage system, where all the Spark workers and driver write data to and read data from.

In this architecture, before each map-reduce iteration, the driver writes the model to Alluxio, which will be read by the workers. The workers fetch the model, compute the gradient, and perform a reduce phase to produce the accumulated derivatives, which is also stored on Alluxio. The driver then read the accumulated derivatives, update the model and write the new model to Alluxio for the next iteration.

This is a traditional map-reduce approach for Gradient Descent on Spark. It is not much different from the default implementation of Gradient Descent in Spark MLLib, except that all the intermediate results are now written to Alluxio.

4. Training Tensorflow models on Spark

In this architecture [2], we use the Parameter Server design pattern, computing gradients on the Spark workers and relaying them with the Parameter Server on the Spark Driver. Each map-reduce operation gives a partition of the dataset to each worker. Each worker computes gradients on the partition in minibatches and sends the gradients back to the Parameter Server, using an out-of-band communication channel such as HTTP or websockets. The Parameter Server combines the gradients and applies them to it’s copy of the model, which is then shared via the same communication channel with the workers. Such communication can be synchronous or asynchronous, each giving certain advantages and disadvantages.

The relevant TensorFlow functions here are compute_gradients for the workers and apply_gradients for the parameter server. Essentially, we take the gradient descent method, split it into two - “Compute Gradient” followed by “Apply Gradients (Descent)” and insert a network boundary between them.

The driver and workers send and receive data via websockets.

References

[1]: First-ever scalable, distributed deep learning architecture using Spark and Tachyon, Strata+Hadoop World, Sept. 2015, New York.

[2]: Distributed TensorFlow on Spark: scaling Google's Deep Learning library, Spark summit East, Feb. 2016, New York.