OpenCL Backend NUMA Issues #8

naibaf7 · 2015-08-17T13:42:58Z

Excerpt from my current thesis:

An issue that came up testing the OpenCL hybrid backend was
that the performance did not scale as expected with systems that have more than
one CPU. Such systems have non-unified memory access (NUMA) because the
CPUs share one address space for memory, but every processor has its own cache
and memory interface. Accessing data across the other CPU comes with a large
performance penalty. Compute kernels, such as the matrix-matrix multiplication
in the BLAS library or the custom OpenCL kernels, cause the threads to work on
adjacent data. This means a write operation of one CPU is likely to invalidate
cache lines across both CPUs. At this point, the synchronization overhead seems
to become larger than any speedup of having additional cores working on the al-
gorithms.

To get the expected speedup, the two (or more) processors need to be presented to the Caffe
library as separate devices. Then the library can be used in two individual
instances. As the OpenCL hybrid backend uses two separate parallelization mech-
anisms (OpenCL kernels and a parallelized BLAS), two solutions would need to
be applied:

The Caffe frontend needs to be tied to the cores of one CPU, so that the BLAS
library does not show NUMA issues.
The OpenCL backend needs to split up the processor setup into sub-devices
using device fission. The splitting rule needs to be that all cores belonging
to one processor (tested by cache affinity) are tied to the same sub-device.
Only one is then used per Caffe instance. Device fission is an extension to
OpenCL that is already available (cl_ext_fission).
The cores used in the frontend and selected sub-device need to be the same.

bhack · 2015-08-18T11:23:57Z

See also https://software.intel.com/en-us/node/540545

naibaf7 · 2015-08-18T11:33:48Z

@bhack
Ok, maybe not directly related (this affects CPUs while your article is about (multiple) Xeon Phi card(s)). But definitely an interesting read, thanks.
Sadly, currently don't have such a card available for testing anyways. However the Xeon Phi should not behave much different from regular GPUs when using OpenCL.

bhack · 2015-08-18T12:12:25Z

Yes it is something not related to Numa but to consider on some Intel device. Also interesting https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance

naibaf7 · 2015-08-18T12:19:07Z

@bhack
Yup, this link is exactly what needs to be added to the device initialization in the OpenCL backend. :)
You see, this is also a reason why I want to have multi-device training in OpenCL as with CUDA, as the training on a multiprocessor-NUMA system would benefit very heavily.

bhack · 2015-08-18T12:21:04Z

Other than Intel https://www.dcl.hpi.uni-potsdam.de/teaching/numasem/slides/NUMASem_OpenCL.pdf

naibaf7 · 2015-08-18T12:23:53Z

And also APU systems like AMD HSA Kaveri and Intel Broadwell, as this PDF points out, yup.

naibaf7 mentioned this issue Aug 17, 2015

Caffe OpenCL support BVLC/caffe#2610

Closed

manishrdmishra mentioned this issue Jun 8, 2016

Crash with error "Check failed: data_ MemoryDataLayer needs to be initalized by calling Reset" #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL Backend NUMA Issues #8

OpenCL Backend NUMA Issues #8

naibaf7 commented Aug 17, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015

OpenCL Backend NUMA Issues #8

OpenCL Backend NUMA Issues #8

Comments

naibaf7 commented Aug 17, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015

bhack commented Aug 18, 2015

naibaf7 commented Aug 18, 2015