Fix multi-node #70

tsirif · 2016-08-25T17:48:11Z

It seems that I can execute the tests on Helios, so I will be able to fix and test the multi-node code. Nevertheless, the requirement that a multi-node/GPU job must be reserved for full machines (i.e. 2 nodes with 8 gpus each) makes it possible to run a testing experiment once a day.

I found out another problem that I do not know how to solve.

From a controller's log file:

  1 WARNING! Using all compatible GPUs in gpu-k20-03.
  2 WARNING! Found 8 GPUs!
  3 ## On gpu-k20-03 using: cuda0 cuda1 cuda2 cuda3 cuda4 cuda5 cuda6 cuda7
  4 ## Running in multi-node mode.
  5 ## Starting worker on cuda0 ... Done
  6 ## Starting worker on cuda1 ... Done
  7 ## Starting worker on cuda2 ... Done
  8 ## Starting worker on cuda3 ... Done
  9 ## Starting worker on cuda4 ... Done
 10 ## Starting worker on cuda5 ... Done
 11 ## Starting worker on cuda6 ... Done
 12 ## Starting worker on cuda7 ... Done
 13 Caught signal 15. Killing workers and closing connections...
 14 Killing worker 171921...
 15 [gpu-k20-03:171924] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 16 [gpu-k20-03:171924] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 17 [gpu-k20-03:171927] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 18 [gpu-k20-03:171927] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 19 [gpu-k20-03:171925] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 20 [gpu-k20-03:171925] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 21 [gpu-k20-03:171922] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 22 [gpu-k20-03:171922] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 23 [gpu-k20-03:171923] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 24 [gpu-k20-03:171923] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 25 [gpu-k20-03:171921] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 26 [gpu-k20-03:171921] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22
 27 [gpu-k20-03:171928] [[9983,1],0]->[[9983,0],0] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 22]
 28 [gpu-k20-03:171928] [[9983,1],0]-[[9983,0],0] mca_oob_tcp_peer_send_handler: unable to send message ON SOCKET 22

Should I send Helios helpdesk a message about this?

nouiz · 2016-08-25T18:17:27Z

I think it would be better to merge this PR when all is working. Ping us in that case.

I know on helios, you can request interactive sessions. Could you do that to reserve 2 nodes and make more experiments in the same day to speed up the development?

nouiz · 2016-09-07T18:41:37Z

@tsirif is everything working? Do you have more tests to make sure it work well?

tsirif · 2016-09-07T18:49:49Z

I have not managed this yet. I will rework some internals when we run in
multi-node case because I suspect that spawning subprocesses in a host from
MPI processes does not work out well in Helios infrastructure, due to some
weird constrictions that Infiniband imposes.

e.g. https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork

I think of spawning dynamically MPI processes from each Controller process
to its host. This will also allow multi-node/cpu interface as well, which
is currently not supported in the new interface (although spawning MPI
Worker processes when using only multi-node/gpu case seems to be
redundant... but in order to be compatible with as many infrastructures as
possible, I cannot think of another way).

On Wed, Sep 7, 2016 at 9:41 PM Frédéric Bastien [email protected]
wrote:

@tsirif https://github.com/tsirif is everything working? Do you have
more tests to make sure it work well?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#70 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AD5VAfDK8x_Q74ISytZo0pEHBIp2go3qks5qnwVjgaJpZM4JtVAr
.

* platoon-related bug fix * platoon-launcher fix

- Change to tcp address in Controller to point to localhost

- Remove imports of Controller and Worker from __init__ files

- Rename mpi_convert to mpi_util

- Remove backwards compatibility in importing (channel module)

- Before any imports to Theano, use device to update THEANO_FLAGS

tsirif changed the title ~~Fix/multi node~~ Fix multi-node Aug 25, 2016

tsirif added 2 commits February 21, 2017 18:35

Remove explicit MPI init and final

73b63f6

Make log directories for workers per host

bd8d506

tsirif force-pushed the fix/multi-node branch 2 times, most recently from 54c8a0d to 73b63f6 Compare February 22, 2017 00:29

tsirif and others added 17 commits February 23, 2017 00:55

Use MPI spawn multiple in a COMM_SELF to launch workers in multi-node

c7cd7f2

Fix platoon-related bug (#1)

bf6f2fd

* platoon-related bug fix * platoon-launcher fix

Pass only workers count per controller in launch_mpi_workers

2ef6328

Print args when failing to spawn mpi workers

4aefd73

Change rc.initialize to False in util.py

d57d1e6

- Change to tcp address in Controller to point to localhost

Split MPI conversion files from util.py

12fa50c

- Remove imports of Controller and Worker from __init__ files

Maybe keep worker.Worker inside __init__ imports

e376374

Merge branch 'test_zmq_error' into fix/multi-node

1146105

Move launch_mpi_workers to mpi_util

960fc3b

- Rename mpi_convert to mpi_util

Impl launch_mpi_workers with Spawn, pass all env vars

82dced4

Make Controller a singleton

018770d

Make a worker instance upon importing Worker module

f3b7bfd

- Remove backwards compatibility in importing (channel module)

Multi-node compat: Make worker request its device to controller

0bbf275

- Before any imports to Theano, use device to update THEANO_FLAGS

Attempt to fix multi-node, methods to finilize well

2f97bb2

Fix tests in test_util.py

cbfcffd

Create numpy array from shared buff in controller

0cc40b0

Change send buffer in mpi4py Allreduce to IN_PLACE

ba850d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-node #70

Fix multi-node #70

tsirif commented Aug 25, 2016

nouiz commented Aug 25, 2016

nouiz commented Sep 7, 2016

tsirif commented Sep 7, 2016

Fix multi-node #70

Are you sure you want to change the base?

Fix multi-node #70

Conversation

tsirif commented Aug 25, 2016

nouiz commented Aug 25, 2016

nouiz commented Sep 7, 2016

tsirif commented Sep 7, 2016