-
Notifications
You must be signed in to change notification settings - Fork 491
Recommendations to achieve best performance
To achieve best performance with Intel® Distribution of Caffe* on Intel CPU please try the following configurations, and it is strongly recommended to tune the configurations on your specific machine.
- Make sure that your hardware configurations include fast SSD (M.2) drive. If during training or scoring you observe in logs "waiting for data" - you should install better SSD or reduce batchsize.
- BIOS configurations
-
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz:
Turbo Boost Technology: on
Hyper-treading (HT): off
NUMA: off
-
Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz:
Turbo Boost Technology: on
Hyper-treading (HT): on
NUMA: off
Memory Mode: cache
-
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz:
Turbo Boost Technology: on
Hyper-treading (HT): on
NUMA: on
-
- Optimize hardware in BIOS: set CPU max frequency, set 100% fan speed, check cooling system.
- For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:
Processor C6 = Enabled
Snoop Holdoff Count = 9
- It is recommended to use Linux Centos 7.2 or newer for Intel Caffe
- It is recommended to use newest XPPSL software for Intel Xeon Phi™ product family: [https://software.intel.com/en-us/articles/xeon-phi-software#downloads] (https://software.intel.com/en-us/articles/xeon-phi-software#downloads)
- For multinode Intel Xeon and Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:
irqbalance needs to be installed and configured with --hintpolicy=exact option
CPU frequency needs to be set via intel_pstate driver:
echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo
cpupower frequency-set -g performance
- Make sure that there are no unnecessary processes during training and scoring. Intel® Distribution of Caffe* is using all available resources and other processes (like monitoring tools, java processes, network traffic etc.) might impact performance.
- We recommend to compile Intel® Distribution of Caffe* with gcc 4.8.5 (or newer) or with Intel Compiler, see Build Caffe with Intel Compiler.
- We recommend compiling Intel® Distribution of Caffe* with MKLDNN engine, see Installation Guide.
- Cache cleanup with command
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
- With Intel Xeon Scalable processors (Skylake) , we recommend the following configurations:
After reboot run:
sudo cpupower frequency-set -g performance
sudo echo 0 > /proc/sys/kernel/numa_balancing
Run with numactl, for example numactl -l $TARGET_CAFFE_BUILD_DIR/tools/caffe time -iterations 100 -model <modelfile> –engine=MKLDNN
- With all the configuration set you can try Install Intel Caffe and run performance benchmark.
-
We provide two sets of prototxt files with Hyper-Parameters and network topologies. In default set you will find standard topologies and their configuration used by community. In BKM (Best Know Method) you will find our internally developed solution optimized for Intel MKLDNN and MKL2017 on Intel CPU.
-
When running performance and trainings - we recommend to starting working with default sets to establish baseline.
-
Use LMDBdata layer (Using ‘Images’ layer as data source will result in suboptimal performance). Our recommendation is to use 95% compression ratio for LMDB, or to achieve maximum theoretical performance - don't use any data layer.
-
Change batchsize in prototxt files. On some configurations higher batchsize will leads to better results.
-
Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For Intel Xeon Knights Lading multi-node test we recommend to use OMP_NUM_THREADS = numer_of_corres-4. For Intel Xeon Scalable processors (Skylake) product we recommend to use OMP_NUM_THREADS = numer_of_corres-2.
-
For our recommended Hyper Parameter, please see models/intel_optimized_models.
-
It is possible to speed-up training by Convolution weights initialization with Gabor Filters.