Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-node #70

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Fix multi-node #70

wants to merge 19 commits into from

Conversation

tsirif
Copy link
Contributor

@tsirif tsirif commented Aug 25, 2016

It seems that I can execute the tests on Helios, so I will be able to fix and test the multi-node code. Nevertheless, the requirement that a multi-node/GPU job must be reserved for full machines (i.e. 2 nodes with 8 gpus each) makes it possible to run a testing experiment once a day.

I found out another problem that I do not know how to solve.

From a controller's log file:

  1 WARNING! Using all compatible GPUs in gpu-k20-03.
  2 WARNING! Found 8 GPUs!
  3 ## On gpu-k20-03 using: cuda0 cuda1 cuda2 cuda3 cuda4 cuda5 cuda6 cuda7
  4 ## Running in multi-node mode.
  5 ## Starting worker on cuda0 ... Done
  6 ## Starting worker on cuda1 ... Done
  7 ## Starting worker on cuda2 ... Done
  8 ## Starting worker on cuda3 ... Done
  9 ## Starting worker on cuda4 ... Done
 10 ## Starting worker on cuda5 ... Done
 11 ## Starting worker on cuda6 ... Done
 12 ## Starting worker on cuda7 ... Done
 13 Caught signal 15. Killing workers and closing connections...
 14 Killing worker 171921...
 15 [gpu-k20-03:171924] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 16 [gpu-k20-03:171924] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 17 [gpu-k20-03:171927] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 18 [gpu-k20-03:171927] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 19 [gpu-k20-03:171925] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 20 [gpu-k20-03:171925] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 21 [gpu-k20-03:171922] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 22 [gpu-k20-03:171922] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 23 [gpu-k20-03:171923] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 24 [gpu-k20-03:171923] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 25 [gpu-k20-03:171921] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 26 [gpu-k20-03:171921] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 27 [gpu-k20-03:171928] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 28 [gpu-k20-03:171928] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22

Should I send Helios helpdesk a message about this?

@tsirif tsirif changed the title Fix/multi node Fix multi-node Aug 25, 2016
@nouiz
Copy link
Contributor

nouiz commented Aug 25, 2016

I think it would be better to merge this PR when all is working. Ping us in that case.

I know on helios, you can request interactive sessions. Could you do that to reserve 2 nodes and make more experiments in the same day to speed up the development?

@nouiz
Copy link
Contributor

nouiz commented Sep 7, 2016

@tsirif is everything working? Do you have more tests to make sure it work well?

@tsirif
Copy link
Contributor Author

tsirif commented Sep 7, 2016

I have not managed this yet. I will rework some internals when we run in
multi-node case because I suspect that spawning subprocesses in a host from
MPI processes does not work out well in Helios infrastructure, due to some
weird constrictions that Infiniband imposes.

e.g. https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork

I think of spawning dynamically MPI processes from each Controller process
to its host. This will also allow multi-node/cpu interface as well, which
is currently not supported in the new interface (although spawning MPI
Worker processes when using only multi-node/gpu case seems to be
redundant... but in order to be compatible with as many infrastructures as
possible, I cannot think of another way).

On Wed, Sep 7, 2016 at 9:41 PM Frédéric Bastien [email protected]
wrote:

@tsirif https://github.com/tsirif is everything working? Do you have
more tests to make sure it work well?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#70 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AD5VAfDK8x_Q74ISytZo0pEHBIp2go3qks5qnwVjgaJpZM4JtVAr
.

@tsirif tsirif force-pushed the fix/multi-node branch 2 times, most recently from 54c8a0d to 73b63f6 Compare February 22, 2017 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants