Skip to content

Final Distributed Benchmarking Results for Sparse

Anirudh Subramanian edited this page Oct 6, 2017 · 30 revisions

Setup

  • We have launched a cluster of 16 machines using the cloudformation template. You can find the cloudformation template here.
  • Below we show experiments for different numbers of machines. The setup on each machine is one worker and server process. For example, if it is a single machine, it will have a single worker and server process running on it. If it is two machines it will have a server and worker process running on each of the two machines.
  • The ami id used is the following: ami-153b1b70. This is an AMI built on top of the Deep Learning AMI.
  • The mxnet repo sits on EFS which is mounted on each of the machines(EC2 instances).
  • We run all the following experiments in the dist_async mode and measure the results. For every new experiment with a different dataset we have used a different CFN cluster. So we terminate a cluster and launch a new cluster after doing experiments for 1, 2, 4, 8, 12 and 16 machines for one dataset.

Model and Experiment Details

  • We are using the sparse_end2end.py for doing the benchmark.
  • The script contains a linear regression model (without the bias term) with softmax output and uses SGD optimizer to update the weights. The input is CSR ndarray and the weight matrix is RSP. The operator used is dot(csr, rsp). We run the training for 10 epochs and measure the results. An example usage of the script is as follows :
python <MXNET_DIR>/benchmark/python/sparse/sparse_end2end.py --output-dim 1024 --batch-size=32784 --dataset=avazu --enable-logging-for=0,1,2,3,4,5,6,7,8,9,10,11 --kvstore=dist_async --num-epoch=10

The above script should be launched with a launch.py script as follows :

python <MXNET_DIR>/tools/launch.py -n 12 -H /opt/deeplearning/workers <ABOVE_COMMAND>
  • The first script tells us to launch the sparse script for avazu dataset with output_dim which will be the number of columns for weight matrix as 1024, and the batch_size which will be the number of rows for the input matrix.

  • It also tells to enable logging for 0 -> 11 workers and run it in a distributed setting with kvstore dist_async mode.

  • The second script tells us that you should launch the script with 12 workers and 12 server processes on the hosts mentioned in /opt/deeplearning/workers. Here are the contents of the hosts file for 12 workers:

deeplearning-worker0
deeplearning-worker1
deeplearning-worker2
deeplearning-worker3
deeplearning-worker4
deeplearning-worker5
deeplearning-worker6
deeplearning-worker7
deeplearning-worker8
deeplearning-worker9
deeplearning-worker10
deeplearning-worker11
  • Therefore it will launch a server and worker process on each of the hosts. We are using a similar setup for all the experiments and thus each host which is participating in distributed training will have 1 worker and 1 server process.

  • The commit id used for the experiment is 065adb3702c110af7b537799be3ec9c16c27a72b.

  • The experiment has been run 6 times for each num_machines row and the runtime results for workers have been averaged.

Important Note

The experiments have been performed for a commit before the following PR was merged :: https://github.com/apache/incubator-mxnet/pull/8111. One of the factors contributing to the variance in the number of batches in between runs on a single worker is the reset() call made at the start before one round of data iterator is finished leading to different num_batches for different runs. Also, the large variance in worker runtimes for different workers in distributed training is because of partitioning not guaranteed to be even. (https://github.com/apache/incubator-mxnet/pull/8111/files#diff-92c0e227fde4ecdd4d0aa9ac2a5529f9R225). See the full dataset experiment below for kdda. We still see a lot of variance among workers because of partitioning not being even.

Results

Terminology

  • NUM_MACHINES: The number of machines to run the experiment on
  • SLOWEST RUNTIME AMONG WORKER: MAX(All WORKERi end to end runtimes)
  • WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.
  • SPEEDUP for [num_machines=m]: (SLOWEST RUNTIME AMONG WORKER for num_machines=m) / (SLOWEST RUNTIME AMONG WORKER for num_machines=1)

avazu test dataset, batch_size: 32784, output_dim: 1024

NUM_MACHINES SLOWEST RUNTIME AMONG WORKER(seconds) SPEEDUP WORKER0(seconds) WORKER1(seconds) WORKER2(seconds) WORKER3(seconds) WORKER4(seconds) WORKER5(seconds) WORKER6(seconds) WORKER7(seconds) WORKER8(seconds) WORKER9(seconds) WORKER10(seconds) WORKER11(seconds) WORKER12(seconds) WORKER13(seconds) WORKER14(seconds) WORKER15(seconds)
1 287.2880573 1 287.2880573
2 174.1945976 1.649236321 171.6125228 174.1945976
4 89.76220971 3.200545733 89.7212379 89.76220971 89.15910661 88.05660012
8 55.76332435 5.151917692 42.84291658 55.76332435 42.58375612 43.24533948 44.04461628 41.84724376 43.5917654 41.57216197
16 32.00695248 8.975801663 22.85826961 21.71531868 23.15168015 23.01040534 22.99628973 22.96320085 22.58721086 23.12547255 23.48303775 32.00695248 22.98324096 22.46027581 22.49633149 22.81446465 27.68898551 23.36792783

Summary for avazu results

On a single worker the time taken to do the training is 287 seconds while on 16 workers, the slowest worker takes 32 seconds. There is some variance in the runtime of workers as you can see from the data with the fastest worker finishing 10 seconds before the slowest worker. The speedup is 9X on 16 workers. Some of the possible causes of the imperfect scaling:

  • number of output dimension is huge, leading to much more network traffic
  • the dataset is much denser than the others
  • we are comparing the slowest worker runtime for speedup, if we do the speedup calculation using the median runtime of the workers we get speedup of 12.5

criteo test dataset, batch_size: 4096, output_dim: 32

NUM_MACHINES SLOWEST RUNTIME AMONG WORKER(seconds) SPEEDUP WORKER0(seconds) WORKER1(seconds) WORKER2(seconds) WORKER3(seconds) WORKER4(seconds) WORKER5(seconds) WORKER6(seconds) WORKER7(seconds) WORKER8(seconds) WORKER9(seconds) WORKER10(seconds) WORKER11(seconds) WORKER12(seconds) WORKER13(seconds) WORKER14(seconds) WORKER15(seconds)
1 320.4591443 1 320.4591443
2 134.2140288 2.38767249 134.2140288 133.7781797
4 68.54574005 4.675113932 66.73764873 68.54574005 67.78707397 67.79657364
8 37.26959534 8.598406862 37.09379182 35.75921435 37.26959534 35.10171499 36.77896743 34.71712542 35.5227128 35.78200464
16 28.23276761 11.35061035 22.65901732 22.82429329 22.53910393 23.29732177 23.63758925 22.87697506 22.94470572 28.23276761 21.31139401 20.89695154 21.73933918 20.36472532 20.60589845 21.196763 20.7089105 21.11429569

Summary for criteo results

On a single worker the time taken to do the training is 320 seconds while on 16 workers, the slowest worker takes 28 seconds. The difference between fastest and slowest worker is close to 8 seconds. The speedup is 11.5X on 16 workers.

kdda test dataset, batch_size: 4096, output_dim: 8

NUM_MACHINES SLOWEST RUNTIME AMONG WORKER(seconds) SPEEDUP WORKER0(seconds) WORKER1(seconds) WORKER2(seconds) WORKER3(seconds) WORKER4(seconds) WORKER5(seconds) WORKER6(seconds) WORKER7(seconds) WORKER8(seconds) WORKER9(seconds) WORKER10(seconds) WORKER11(seconds) WORKER12(seconds) WORKER13(seconds) WORKER14(seconds) WORKER15(seconds)
1 322.3493165 1 322.3493165
2 152.2661681 2.117012075 152.2661681 56.0047285
4 78.78860804 4.091318841 78.78860804 31.11844766 29.74631359 30.14538327
8 38.6655634 8.336858128 38.6655634 17.77295431 16.99732262 17.93868786 17.35414038 16.68588871 17.0220588 16.74347107
16 20.69872398 15.57339075 20.69872398 16.74933181 11.99427773 12.32047569 12.54608685 12.17584972 12.66074888 12.0352035 11.99997986 11.96278872 11.72612387 11.86328058 12.02945746 11.76552521 11.72321774 12.07822296

Summary for kdda results

For kdda we get the best speedup of 15x for 16 workers. The difference between fastest and slowest workers for kdda is close to 9 seconds.

Full dataset experiment for kdda

  • WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.
WORKER0(seconds) WORKER1(seconds) WORKER2(seconds) WORKER3(seconds) WORKER4(seconds) WORKER5(seconds) WORKER6(seconds) WORKER7(seconds) WORKER8(seconds) WORKER9(seconds) WORKER10(seconds) WORKER11(seconds) WORKER12(seconds) WORKER13(seconds) WORKER14(seconds) WORKER15(seconds)
305.645803928 332.092835903 334.780326128 334.998679161 343.547188997 344.171431065 343.63071084 337.439732075 349.839564085 351.485404968 354.654353857 346.672199011 357.655499935 366.253351927 361.332320929 374.920572042

Summary for full dataset experiment

There is a huge variance among workers when training on the full kdda dataset. The worker runtime between maximum and minimum workers differs by 70 seconds.