Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default binding policy #4799

Closed
artpol84 opened this issue Feb 8, 2018 · 3 comments
Closed

Default binding policy #4799

artpol84 opened this issue Feb 8, 2018 · 3 comments

Comments

@artpol84
Copy link
Contributor

artpol84 commented Feb 8, 2018

OMPI version: v2.1

I was recently investigating the issue with PMIx_Get latency of a dstore. I was running on 1 node and observing growing numbers when PPN cont was increased. I was using the default binding policy thinking that it defaults to bind-to core.
The bottleneck was attributed to a thread shift part:
openpmix/openpmix#665 (comment).

Debugging the scheduler that PMIx service thread was assigned to a different core which was causing perf issues. You can see on the plot that starting from 4 procs the performance degrades noticeably. This is due to the fact that if IIRC up to 2 processes mpirun will bind to core and then it will be socket.
Perf confirmed that guess:

  • cpu # is enclosed in brackets: [0004];
  • pmix_intra_perf[164802] is the main thread
  • pmix_intra_perf[164807/164802] is a service thread.
$ perf sched timehist
...
  648540.416283 [0004]  pmix_intra_perf[164802]             0.005      0.000      0.005.                                                                                                                                                      
  648540.416289 [0008]  pmix_intra_perf[164807/164802]      0.003      0.000      0.007.                                                                                                                                                      
  648540.416294 [0004]  pmix_intra_perf[164802]             0.004      0.000      0.006.                                                                                                                                                      
  648540.416299 [0008]  pmix_intra_perf[164807/164802]      0.003      0.000      0.006.  
...

For 4 PPN case procs was remaining on their CPUs for the whole time (cpu4 and cpu8). But starting from 16PPN they began to actively migrate which caused more rapid growt:

$ perf sched timehist
...
  649086.369911 [0019]  pmix_intra_perf[165820/165811]      0.004      0.001      0.016.                                                                                                                                                      
  649086.369914 [0017]  pmix_intra_perf[165811]             0.012      0.000      0.006.                                                                                                                                                      
  649086.369921 [0019]  pmix_intra_perf[165820/165811]      0.001      0.000      0.007.                                                                                                                                                      
  649086.369925 [0017]  pmix_intra_perf[165811]             0.005      0.000      0.005.                                                                                                                                                      
  649086.369933 [0019]  pmix_intra_perf[165820/165811]      0.003      0.000      0.008.                                                                                                                                                      
  649086.369941 [0023]  pmix_intra_perf[165811]             0.006      0.000      0.009.                                                                                                                                                      
  649086.369948 [0019]  pmix_intra_perf[165820/165811]      0.006      0.001      0.008.                                                                                                                                                      
  649086.369953 [0023]  pmix_intra_perf[165811]             0.005      0.000      0.006.                                                                                                                                                      
  649086.369961 [0019]  pmix_intra_perf[165820/165811]      0.004      0.001      0.008.                                                                                                                                                      
  649086.369966 [0023]  pmix_intra_perf[165811]             0.005      0.000      0.007.                                                                                                                                                      
  649086.369984 [0019]  pmix_intra_perf[165820/165811]      0.012      0.009      0.010.                                                                                                                                                      
  649086.369994 [0027]  pmix_intra_perf[165811]             0.016      0.001      0.011.                                                                                                                                                      
  649086.369999 [0019]  pmix_intra_perf[165820/165811]      0.008      0.000      0.007.                                                                                                                                                      
  649086.370004 [0027]  pmix_intra_perf[165811]             0.004      0.000      0.006.                                                                                                                                                      
  649086.370012 [0019]  pmix_intra_perf[165820/165811]      0.004      0.000      0.008.                
...

After forcing bind-to core performance stabilized (yellow dashed curve):
openpmix/openpmix#665 (comment)

I this an additional input on the impact that default binding policy may have. The suggestion is to consider this at the next OMPI dev meeting.

@ggouaillardet
Copy link
Contributor

@artpol84 IIRC, the rationale for binding to sockets (instead of core) is to be friendly with those who have hybrid MPI+OpenMP applications, but fail to ask n cpus per MPI task.
And the rationale for binding to cores by default when there are 2 MPI tasks is simply to get better out of the box performance when comparing Open MPI vs an other MPI library.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 9, 2018

And when those decisions was made circumstances I highlighted here wasn’t taken into consideration.
I discussed this with @rhc54 in the context of PMIx performance and he suggested that OMPI defaults might need to be revisited considering these findings.

@rhc54
Copy link
Contributor

rhc54 commented Mar 20, 2018

This was discussed at the devel meeting, and the conclusion was that the need to adequately support multi-threaded applications overrides this issue. We don't know of any way to force the kernel to keep one thread local to another, following each other around the socket.

For performance tests like the one you are running, you should override the default binding policy with bind-to core. For OMPI, we feel that the current defaults are the correct ones to use.

@rhc54 rhc54 closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants