Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL Backend NUMA Issues #8

Open
naibaf7 opened this issue Aug 17, 2015 · 6 comments
Open

OpenCL Backend NUMA Issues #8

naibaf7 opened this issue Aug 17, 2015 · 6 comments

Comments

@naibaf7
Copy link
Owner

naibaf7 commented Aug 17, 2015

Excerpt from my current thesis:

An issue that came up testing the OpenCL hybrid backend was
that the performance did not scale as expected with systems that have more than
one CPU. Such systems have non-unified memory access (NUMA) because the
CPUs share one address space for memory, but every processor has its own cache
and memory interface. Accessing data across the other CPU comes with a large
performance penalty. Compute kernels, such as the matrix-matrix multiplication
in the BLAS library or the custom OpenCL kernels, cause the threads to work on
adjacent data. This means a write operation of one CPU is likely to invalidate
cache lines across both CPUs. At this point, the synchronization overhead seems
to become larger than any speedup of having additional cores working on the al-
gorithms.

To get the expected speedup, the two (or more) processors need to be presented to the Caffe
library as separate devices. Then the library can be used in two individual
instances. As the OpenCL hybrid backend uses two separate parallelization mech-
anisms (OpenCL kernels and a parallelized BLAS), two solutions would need to
be applied:

  • The Caffe frontend needs to be tied to the cores of one CPU, so that the BLAS
    library does not show NUMA issues.
  • The OpenCL backend needs to split up the processor setup into sub-devices
    using device fission. The splitting rule needs to be that all cores belonging
    to one processor (tested by cache affinity) are tied to the same sub-device.
    Only one is then used per Caffe instance. Device fission is an extension to
    OpenCL that is already available (cl_ext_fission).
  • The cores used in the frontend and selected sub-device need to be the same.
@bhack
Copy link

bhack commented Aug 18, 2015

@naibaf7
Copy link
Owner Author

naibaf7 commented Aug 18, 2015

@bhack
Ok, maybe not directly related (this affects CPUs while your article is about (multiple) Xeon Phi card(s)). But definitely an interesting read, thanks.
Sadly, currently don't have such a card available for testing anyways. However the Xeon Phi should not behave much different from regular GPUs when using OpenCL.

@bhack
Copy link

bhack commented Aug 18, 2015

Yes it is something not related to Numa but to consider on some Intel device. Also interesting https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance

@naibaf7
Copy link
Owner Author

naibaf7 commented Aug 18, 2015

@bhack
Yup, this link is exactly what needs to be added to the device initialization in the OpenCL backend. :)
You see, this is also a reason why I want to have multi-device training in OpenCL as with CUDA, as the training on a multiprocessor-NUMA system would benefit very heavily.

@bhack
Copy link

bhack commented Aug 18, 2015

@naibaf7
Copy link
Owner Author

naibaf7 commented Aug 18, 2015

And also APU systems like AMD HSA Kaveri and Intel Broadwell, as this PDF points out, yup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants