-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi-node #70
base: master
Are you sure you want to change the base?
Fix multi-node #70
Conversation
I think it would be better to merge this PR when all is working. Ping us in that case. I know on helios, you can request interactive sessions. Could you do that to reserve 2 nodes and make more experiments in the same day to speed up the development? |
@tsirif is everything working? Do you have more tests to make sure it work well? |
I have not managed this yet. I will rework some internals when we run in e.g. https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork I think of spawning dynamically MPI processes from each Controller process On Wed, Sep 7, 2016 at 9:41 PM Frédéric Bastien [email protected]
|
54c8a0d
to
73b63f6
Compare
* platoon-related bug fix * platoon-launcher fix
- Change to tcp address in Controller to point to localhost
- Remove imports of Controller and Worker from __init__ files
- Rename mpi_convert to mpi_util
- Remove backwards compatibility in importing (channel module)
- Before any imports to Theano, use device to update THEANO_FLAGS
It seems that I can execute the tests on Helios, so I will be able to fix and test the multi-node code. Nevertheless, the requirement that a multi-node/GPU job must be reserved for full machines (i.e. 2 nodes with 8 gpus each) makes it possible to run a testing experiment once a day.
I found out another problem that I do not know how to solve.
From a controller's log file:
Should I send Helios helpdesk a message about this?