Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNET 1.0.0 - marginal performance improvement Titan V (Volta) with half precision cuda 9.0 + cudnn 7.0.5 #9087

Closed
elabeca opened this issue Dec 15, 2017 · 11 comments

Comments

@elabeca
Copy link

elabeca commented Dec 15, 2017

Description

Running the following script: example/image-classification/train_cifar10.py on MXNET v1.0.0, seems to have marginal performance improvements on Titan V (Volta) cards with half precision set and CUDA 9.0 / CUDNN 7.0.5

Experimenting with half precision and without half precision we’re seeing marginal performance improvements with half-precision, actually worse performance when half-precision is set (dtype = float16) set, also tried with Titan X (Pascal), although we didn’t expect half precision to work on Pascal architecture but it did, and did much better than on volta.

This was tested with release 1.0.0

Running on a machine with CUDA 9.0 + CUDNN 7.0.5

To reproduce, one epoch on resnet for CIFAR10 script:

time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

for Titan V (Volta) we’re getting:

~2700 samples/sec with half precision on, and ~2900 samples/sec when off. Which I believe should be the opposite if anything.

Also we’re not getting a massive speed improvement between the Titan X (Pascal) and Titan V (Volta).

for Titan X (Pascal) we’re getting:

~2600 samples/sec with half precision on, and ~2228 samples/sec when off.

The performance improvement on the Titan X (Pascal) is much better.

Environment info (Required)

----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Nov 20 2017 18:23:56'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '9.0.1')
('Directory :', '/home/elie/.local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version :', '1.0.0')
('Directory :', '/home/elie/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform :', 'Linux-4.10.0-42-generic-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('node :', 'zeus')
('release :', '4.10.0-42-generic')
('version :', '#46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017')
----------Hardware Info----------
('machine :', 'x86_64')
('processor :', 'x86_64')
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
Stepping: 2
CPU MHz: 1397.308
CPU max MHz: 4100.0000
CPU min MHz: 1200.0000
BogoMIPS: 6999.98
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0608 sec, LOAD: 0.9193 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0531 sec, LOAD: 0.3470 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0671 sec, LOAD: 0.4028 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0546 sec, LOAD: 0.2149 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.3649 sec, LOAD: 0.3953 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.3955 sec, LOAD: 0.8753 sec.

Package used (Python/R/Scala/Julia):
Python 2.7

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):
gcc

MXNet commit hash:
25720d0

Build config:
#-------------------------------------------------------------------------------

Template configuration for compiling mxnet

If you want to change the configuration, please use the following

steps. Assume you are on the root directory of mxnet. First copy the this

file so that any local changes will be ignored by git

$ cp make/config.mk .

Next modify the according entries, and then compile by

$ make

or build in parallel with 8 threads

$ make -j8

#-------------------------------------------------------------------------------

#---------------------

choice of compiler

#--------------------

export CC = gcc
export CXX = g++
export NVCC = nvcc

whether compile with options for MXNet developer

DEV = 0

whether compile with debug

DEBUG = 0

whether compile with profiler

USE_PROFILER =

whether to turn on signal handler (e.g. segfault logger)

USE_SIGNAL_HANDLER =

the additional link flags you want to add

ADD_LDFLAGS =

the additional compile flags you want to add

ADD_CFLAGS =

#---------------------------------------------

matrix computation libraries for CPU/GPU

#---------------------------------------------

whether use CUDA during compile

USE_CUDA = 1

add the path to CUDA library to link and compile flag

if you have already add them to environment variable, leave it as NONE

USE_CUDA_PATH = /usr/local/cuda

USE_CUDA_PATH = /usr/local/cuda

whether use CuDNN R3 library

USE_CUDNN = 1

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

whether use opencv during compilation

you can disable it, however, you will not able to use

imbin iterator

USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

use openmp for parallelization

USE_OPENMP = 1

MKL ML Library for Intel CPU/Xeon Phi

Please refer to MKL_README.md for details

MKL ML Library folder, need to be root for /usr/local

Change to User Home directory for standard user

For USE_BLAS!=mkl only

MKLML_ROOT=/usr/local

whether use MKL2017 library

USE_MKL2017 = 0

whether use MKL2017 experimental feature for high performance

Prerequisite USE_MKL2017=1

USE_MKL2017_EXPERIMENTAL = 0

whether use NNPACK library

USE_NNPACK = 0

choose the version of blas you want to use

can be: mkl, blas, atlas, openblas

in default use atlas for linux while apple for osx

UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

whether use lapack during compilation

only effective when compiled with blas versions openblas/apple/atlas/mkl

USE_LAPACK = 1

path to lapack library in case of a non-standard installation

USE_LAPACK_PATH =

by default, disable lapack when using MKL

switch on when there is a full installation of MKL available (not just MKL2017/MKL_ML)

ifeq ($(USE_BLAS), mkl)
USE_LAPACK = 0
endif

add path to intel library, you may need it for MKL, if you did not add the path

to environment variable

USE_INTEL_PATH = NONE

If use MKL only for BLAS, choose static link automatically to allow python wrapper

ifeq ($(USE_MKL2017), 0)
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
endif
else
USE_STATIC_MKL = NONE
endif

#----------------------------

Settings for power and arm arch

#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
USE_SSE=0
else
USE_SSE=1
endif

#----------------------------

distributed computing

#----------------------------

whether or not to enable multi-machine supporting

USE_DIST_KVSTORE = 0

whether or not allow to read and write HDFS directly. If yes, then hadoop is

required

USE_HDFS = 0

path to libjvm.so. required if USE_HDFS=1

LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

whether or not allow to read and write AWS S3 directly. If yes, then

libcurl4-openssl-dev is required, it can be installed on Ubuntu by

sudo apt-get install -y libcurl4-openssl-dev

USE_S3 = 0

#----------------------------

performance settings

#----------------------------

Use operator tuning

USE_OPERATOR_TUNING = 1

Use gperftools if found

USE_GPERFTOOLS = 1

Use JEMalloc if found, and not using gperftools

USE_JEMALLOC = 1

#----------------------------

additional operators

#----------------------------

path to folders containing projects specific operators that you don't want to put in src/operators

EXTRA_OPERATORS =

#----------------------------

other features

#----------------------------

Create C++ interface package

USE_CPP_PACKAGE = 0

#----------------------------

plugins

#----------------------------

whether to use caffe integration. This requires installing caffe.

You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH

CAFFE_PATH = $(HOME)/caffe

MXNET_PLUGINS += plugin/caffe/caffe.mk

whether to use torch integration. This requires installing torch.

You also need to add TORCH_PATH/install/lib to your LD_LIBRARY_PATH

TORCH_PATH = $(HOME)/torch

MXNET_PLUGINS += plugin/torch/torch.mk

WARPCTC_PATH = $(HOME)/warp-ctc

MXNET_PLUGINS += plugin/warpctc/warpctc.mk

whether to use sframe integration. This requires build sframe

[email protected]:dato-code/SFrame.git

SFRAME_PATH = $(HOME)/SFrame

MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

None

Minimum reproducible example

time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

vs

time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

Steps to reproduce

time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

vs

time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

What have you tried to solve it?

Compared results between a Titan V (Volta) card and a Titan X (Pascal card). Tried with and without half precision set for the train_cifar10.py example on resenet, one epoch, 110 layers and 512 batch size.

Results for Volta (Titan V) with dtype float16 flag set:

INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[11:33:43] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
[11:33:47] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
[11:33:48] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20] Speed: 2734.21 samples/sec accuracy=0.142020
INFO:root:Epoch[0] Batch [40] Speed: 2709.23 samples/sec accuracy=0.202832
INFO:root:Epoch[0] Batch [60] Speed: 2724.39 samples/sec accuracy=0.233984
INFO:root:Epoch[0] Batch [80] Speed: 2751.87 samples/sec accuracy=0.268652
INFO:root:Epoch[0] Train-accuracy=0.303653
INFO:root:Epoch[0] Time cost=18.777
INFO:root:Epoch[0] Validation-accuracy=0.314453

real 0m26.451s
user 0m36.516s
sys 0m9.708s

Results for Volta (Titan V) without half-precision flag set:

time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[11:30:53] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
[11:30:56] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
[11:30:58] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20] Speed: 2855.89 samples/sec accuracy=0.121931
INFO:root:Epoch[0] Batch [40] Speed: 2933.23 samples/sec accuracy=0.191406
INFO:root:Epoch[0] Batch [60] Speed: 2944.27 samples/sec accuracy=0.239551
INFO:root:Epoch[0] Batch [80] Speed: 2871.48 samples/sec accuracy=0.271289
INFO:root:Epoch[0] Train-accuracy=0.301356
INFO:root:Epoch[0] Time cost=17.768
INFO:root:Epoch[0] Validation-accuracy=0.340820

real 0m25.560s
user 0m34.052s
sys 0m9.416s

Results for Pascal (Titan X) with dtype float16 flag set:

time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float16', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[11:33:43] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
[11:33:47] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
[11:33:48] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20] Speed: 2734.21 samples/sec accuracy=0.142020
INFO:root:Epoch[0] Batch [40] Speed: 2709.23 samples/sec accuracy=0.202832
INFO:root:Epoch[0] Batch [60] Speed: 2724.39 samples/sec accuracy=0.233984
INFO:root:Epoch[0] Batch [80] Speed: 2751.87 samples/sec accuracy=0.268652
INFO:root:Epoch[0] Train-accuracy=0.303653
INFO:root:Epoch[0] Time cost=18.777
INFO:root:Epoch[0] Validation-accuracy=0.314453

real 0m26.451s
user 0m36.516s
sys 0m9.708s

Results for Pascal (Titan X) without half-precision flag set:

time python2 train_cifar10.py --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 2
INFO:root:start with arguments Namespace(batch_size=512, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='2', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=1, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[11:32:37] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding..
[11:32:41] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding..
[11:32:42] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20] Speed: 2228.10 samples/sec accuracy=0.141927
INFO:root:Epoch[0] Batch [40] Speed: 2234.42 samples/sec accuracy=0.199609
INFO:root:Epoch[0] Batch [60] Speed: 2258.77 samples/sec accuracy=0.235449
INFO:root:Epoch[0] Batch [80] Speed: 2237.78 samples/sec accuracy=0.266992
INFO:root:Epoch[0] Train-accuracy=0.286880
INFO:root:Epoch[0] Time cost=22.809
INFO:root:Epoch[0] Validation-accuracy=0.343164

real 0m31.823s
user 0m41.688s
sys 0m11.076s

@ptrendx
Copy link
Member

ptrendx commented Dec 15, 2017

  1. CIFAR has really small images, which do not really stress the GPU. For benchmarking purposes ImageNet is much better.
  2. Set env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 2 - it is possible that MXNet does not choose the best cudnn algo (TensorCore) for Titan V in your case.

@elabeca
Copy link
Author

elabeca commented Dec 16, 2017

Thanks for the suggestion. I'll test with imagenet and report back shortly.

Only options for MXNET_CUDNN_AUTOTUNE_DEFAULT are 1 (true) or 0 (false) I believe.

Here are the latest results with train_cifar10.py:
Titan X (Pascal) MXNET_CUDNN_AUTOTUNE_DEFAULT=0 --dtype float32 ==> ~2020 samples/second
Titan X (Pascal) MXNET_CUDNN_AUTOTUNE_DEFAULT=0 --dtype float16 ==> ~2330 samples/second
Titan X (Pascal) MXNET_CUDNN_AUTOTUNE_DEFAULT=1 --dtype float32 ==> ~2230 samples/second
Titan X (Pascal) MXNET_CUDNN_AUTOTUNE_DEFAULT=1 --dtype float16 ==> ~2600 samples/second

Titan V (Volta) MXNET_CUDNN_AUTOTUNE_DEFAULT=0 --dtype float32 ==> ~2200 samples/second
Titan V (Volta) MXNET_CUDNN_AUTOTUNE_DEFAULT=0 --dtype float16 ==> ~2400 samples/second
Titan V (Volta) MXNET_CUDNN_AUTOTUNE_DEFAULT=1 --dtype float32 ==> ~2900 samples/second
Titan V (Volta) MXNET_CUDNN_AUTOTUNE_DEFAULT=1 --dtype float16 ==> ~2740 samples/second

As you can see, when autotune is on, half-precision performance is worse than full precision on Volta. Also my concern is that the performance improvements are marginal when comparing the Pascal card and the Volta card, as you suggested, perhaps not utilizing the TensorCore feature of the Volta architecture.

Is there a way to enforce the use of the TensorCores?

Thanks

@elabeca
Copy link
Author

elabeca commented Dec 16, 2017

Here are some further results, again against the CIFAR10 script, but this time with the MXNET_CUDA_ALLOW_TENSOR_CORE env variable:

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 MXNET_CUDA_ALLOW_TENSOR_CORE=0 float32
Speed: ~2210 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=1 MXNET_CUDA_ALLOW_TENSOR_CORE=0 float32
Speed: ~2920 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 MXNET_CUDA_ALLOW_TENSOR_CORE=1 float32
Speed: ~2190 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=1 MXNET_CUDA_ALLOW_TENSOR_CORE=1 float32
Speed: ~2870 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 MXNET_CUDA_ALLOW_TENSOR_CORE=0 float16
Speed: ~2440 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=1 MXNET_CUDA_ALLOW_TENSOR_CORE=0 float16
Speed: ~1790 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=0 MXNET_CUDA_ALLOW_TENSOR_CORE=1 float16
Speed: ~2430 samples/sec

MXNET_CUDNN_AUTOTUNE_DEFAULT=1 MXNET_CUDA_ALLOW_TENSOR_CORE=1 float16
Speed: ~2720 samples/sec

Best performance seems to be when float32 is set, autotune on, tensorcore off. So FP16 is no where near performing or leveraging TensorCores as it should.

@ptrendx
Copy link
Member

ptrendx commented Dec 16, 2017

What problem do you have with train_imagenet script? You can set the autotune env variable to 2. - the options are as follows: 0 does not do autotune, 1 does, but the chosen result is limited by the workspace size and 2 - choose the fastest algo no matter the workspace size.

@elabeca
Copy link
Author

elabeca commented Dec 18, 2017

Thanks Przemyslaw, I'll try another set with autotune set to 2 and I'll try with imagenet after I've resolved the imagenet download script issue.

@yangjunpro
Copy link

@elabeca
Hi Elabeca, any update regarding to your benchmark of ImageNet data set with Titan V?
We are planning to do some benchmark and optimization work also on Volta architecture, so your experience may be helpful for us.

Thanks

@rahul003
Copy link
Member

@yangjunpro With imagenet it should be possible to see 50-80% speedup. Let us know if you have more questions.

@rahul003
Copy link
Member

@cjolivier01 Please add the label Question. Thanks!

@cjolivier01
Copy link
Member

cjolivier01 commented Mar 19, 2018 via email

@henripal
Copy link

@yangjunpro ran this w/ imagenet on a TitanV; am getting 500 samples/sec in fp16 VS 290 samples/sec in fp32.

@elabeca
Copy link
Author

elabeca commented Apr 5, 2018

Sorry @yangjunpro - I didn't get around in the end to do further testing. Based on @henripal 's results I'm happy to close this unless someone else has anything to add. Thanks again.

@elabeca elabeca closed this as completed Apr 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants