Final Distributed Benchmarking Results for Sparse

Setup

We have launched a cluster of 16 machines using the cloudformation template. You can find the cloudformation template here.
Below we show experiments for different numbers of machines. The setup on each machine is one worker and server process. For example, if it is a single machine, it will have a single worker and server process running on it. If it is two machines it will have a server and worker process running on each of the two machines.
The ami id used is the following: ami-153b1b70. This is an AMI built on top of the Deep Learning AMI.
The mxnet repo sits on EFS which is mounted on each of the machines(EC2 instances).
We run all the following experiments in the dist_async mode and measure the results. For every new experiment with a different dataset we have used a different CFN cluster. So we terminate a cluster and launch a new cluster after doing experiments for 1, 2, 4, 8, 12 and 16 machines for one dataset.

Model and Experiment Details

We are using the sparse_end2end.py for doing the benchmark.
The script contains a linear regression model (without the bias term) with softmax output and uses SGD optimizer to update the weights. The input is CSR ndarray and the weight matrix is RSP. The operator used is dot(csr, rsp). We run the training for 10 epochs and measure the results. An example usage of the script is as follows :

python <MXNET_DIR>/benchmark/python/sparse/sparse_end2end.py --output-dim 1024 --batch-size=32784 --dataset=avazu --enable-logging-for=0,1,2,3,4,5,6,7,8,9,10,11 --kvstore=dist_async --num-epoch=10

The above script should be launched with a launch.py script as follows :

python <MXNET_DIR>/tools/launch.py -n 12 -H /opt/deeplearning/workers <ABOVE_COMMAND>

The first script tells us to launch the sparse script for avazu dataset with output_dim which will be the number of columns for weight matrix as 1024, and the batch_size which will be the number of rows for the input matrix.
It also tells to enable logging for 0 -> 11 workers and run it in a distributed setting with kvstore dist_async mode.
The second script tells us that you should launch the script with 12 workers and 12 server processes on the hosts mentioned in /opt/deeplearning/workers. Here are the contents of the hosts file for 12 workers:

deeplearning-worker0
deeplearning-worker1
deeplearning-worker2
deeplearning-worker3
deeplearning-worker4
deeplearning-worker5
deeplearning-worker6
deeplearning-worker7
deeplearning-worker8
deeplearning-worker9
deeplearning-worker10
deeplearning-worker11

Therefore it will launch a server and worker process on each of the hosts. We are using a similar setup for all the experiments and thus each host which is participating in distributed training will have 1 worker and 1 server process.
The commit id used for the experiment is 065adb3702c110af7b537799be3ec9c16c27a72b.
The experiment has been run 6 times for each num_machines row and the runtime results for workers have been averaged.

Important Note

The experiments have been performed for a commit before the following PR was merged :: https://github.com/apache/incubator-mxnet/pull/8111. One of the factors contributing to the variance in the number of batches in between runs on a single worker is the reset() call made at the start before one round of data iterator is finished leading to different num_batches for different runs. Also, the large variance in worker runtimes for different workers in distributed training is because of partitioning not guaranteed to be even. (https://github.com/apache/incubator-mxnet/pull/8111/files#diff-92c0e227fde4ecdd4d0aa9ac2a5529f9R225). See the full dataset experiment below for kdda. We still see a lot of variance among workers because of partitioning not being even.

Results

Terminology

NUM_MACHINES: The number of machines to run the experiment on
SLOWEST RUNTIME AMONG WORKER: MAX(All WORKERi end to end runtimes)
WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.
SPEEDUP for [num_machines=m]: (SLOWEST RUNTIME AMONG WORKER for num_machines=m) / (SLOWEST RUNTIME AMONG WORKER for num_machines=1)

avazu test dataset, batch_size: 32784, output_dim: 1024

NUM_MACHINES	SLOWEST RUNTIME AMONG WORKER(seconds)	SPEEDUP	WORKER0(seconds)	WORKER1(seconds)	WORKER2(seconds)	WORKER3(seconds)	WORKER4(seconds)	WORKER5(seconds)	WORKER6(seconds)	WORKER7(seconds)	WORKER8(seconds)	WORKER9(seconds)	WORKER10(seconds)	WORKER11(seconds)	WORKER12(seconds)	WORKER13(seconds)	WORKER14(seconds)	WORKER15(seconds)
1	287.2880573	1	287.2880573
2	174.1945976	1.649236321	171.6125228	174.1945976
4	89.76220971	3.200545733	89.7212379	89.76220971	89.15910661	88.05660012
8	55.76332435	5.151917692	42.84291658	55.76332435	42.58375612	43.24533948	44.04461628	41.84724376	43.5917654	41.57216197
16	32.00695248	8.975801663	22.85826961	21.71531868	23.15168015	23.01040534	22.99628973	22.96320085	22.58721086	23.12547255	23.48303775	32.00695248	22.98324096	22.46027581	22.49633149	22.81446465	27.68898551	23.36792783

Summary for avazu results

On a single worker the time taken to do the training is 287 seconds while on 16 workers, the slowest worker takes 32 seconds. There is some variance in the runtime of workers as you can see from the data with the fastest worker finishing 10 seconds before the slowest worker. The speedup is 9X on 16 workers. Some of the possible causes of the imperfect scaling:

number of output dimension is huge, leading to much more network traffic
the dataset is much denser than the others
we are comparing the slowest worker runtime for speedup, if we do the speedup calculation using the median runtime of the workers we get speedup of 12.5

criteo test dataset, batch_size: 4096, output_dim: 32

NUM_MACHINES	SLOWEST RUNTIME AMONG WORKER(seconds)	SPEEDUP	WORKER0(seconds)	WORKER1(seconds)	WORKER2(seconds)	WORKER3(seconds)	WORKER4(seconds)	WORKER5(seconds)	WORKER6(seconds)	WORKER7(seconds)	WORKER8(seconds)	WORKER9(seconds)	WORKER10(seconds)	WORKER11(seconds)	WORKER12(seconds)	WORKER13(seconds)	WORKER14(seconds)	WORKER15(seconds)
1	320.4591443	1	320.4591443
2	134.2140288	2.38767249	134.2140288	133.7781797
4	68.54574005	4.675113932	66.73764873	68.54574005	67.78707397	67.79657364
8	37.26959534	8.598406862	37.09379182	35.75921435	37.26959534	35.10171499	36.77896743	34.71712542	35.5227128	35.78200464
16	28.23276761	11.35061035	22.65901732	22.82429329	22.53910393	23.29732177	23.63758925	22.87697506	22.94470572	28.23276761	21.31139401	20.89695154	21.73933918	20.36472532	20.60589845	21.196763	20.7089105	21.11429569

Summary for criteo results

On a single worker the time taken to do the training is 320 seconds while on 16 workers, the slowest worker takes 28 seconds. The difference between fastest and slowest worker is close to 8 seconds. The speedup is 11.5X on 16 workers.

kdda test dataset, batch_size: 4096, output_dim: 8

NUM_MACHINES	SLOWEST RUNTIME AMONG WORKER(seconds)	SPEEDUP	WORKER0(seconds)	WORKER1(seconds)	WORKER2(seconds)	WORKER3(seconds)	WORKER4(seconds)	WORKER5(seconds)	WORKER6(seconds)	WORKER7(seconds)	WORKER8(seconds)	WORKER9(seconds)	WORKER10(seconds)	WORKER11(seconds)	WORKER12(seconds)	WORKER13(seconds)	WORKER14(seconds)	WORKER15(seconds)
1	322.3493165	1	322.3493165
2	152.2661681	2.117012075	152.2661681	56.0047285
4	78.78860804	4.091318841	78.78860804	31.11844766	29.74631359	30.14538327
8	38.6655634	8.336858128	38.6655634	17.77295431	16.99732262	17.93868786	17.35414038	16.68588871	17.0220588	16.74347107
16	20.69872398	15.57339075	20.69872398	16.74933181	11.99427773	12.32047569	12.54608685	12.17584972	12.66074888	12.0352035	11.99997986	11.96278872	11.72612387	11.86328058	12.02945746	11.76552521	11.72321774	12.07822296

Summary for kdda results

For kdda we get the best speedup of 15x for 16 workers. The difference between fastest and slowest workers for kdda is close to 9 seconds.

Full dataset experiment for kdda

WORKERi: The end to end runtime for WORKERi (for 10 epochs) averaged over multiple runs.

WORKER0(seconds)	WORKER1(seconds)	WORKER2(seconds)	WORKER3(seconds)	WORKER4(seconds)	WORKER5(seconds)	WORKER6(seconds)	WORKER7(seconds)	WORKER8(seconds)	WORKER9(seconds)	WORKER10(seconds)	WORKER11(seconds)	WORKER12(seconds)	WORKER13(seconds)	WORKER14(seconds)	WORKER15(seconds)
305.645803928	332.092835903	334.780326128	334.998679161	343.547188997	344.171431065	343.63071084	337.439732075	349.839564085	351.485404968	354.654353857	346.672199011	357.655499935	366.253351927	361.332320929	374.920572042

Summary for full dataset experiment

There is a huge variance among workers when training on the full kdda dataset. The worker runtime between maximum and minimum workers differs by 70 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly