Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support elastic #278

Open
gaocegege opened this issue Jul 7, 2020 · 11 comments
Open

[feature request] Support elastic #278

gaocegege opened this issue Jul 7, 2020 · 11 comments

Comments

@gaocegege
Copy link
Member

https://github.com/horovod/horovod/blob/master/docs/elastic.rst

It will be better if we support elastic training.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.98

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member Author

/cc @zw0610 @carmark

@terrytangyuan
Copy link
Member

That's interesting. Has any of you tried it out yet? We'll need some refactoring on the launcher logic and then support horovodrun which seems pretty similar to mpirun:

horovodrun -np 8 --host-discovery-script discover_hosts.sh --slots 4 python train.py

@gaocegege
Copy link
Member Author

Tried it locally, not on k8s. We should handle discover_hosts.sh for it if we want to support it.

@zw0610
Copy link
Member

zw0610 commented Jul 14, 2020

A simple idea for discover_hosts.sh will be opening a server on mpi-operator, exposing the status of the corresponding mpi-job and allowing pods from querying the status from the server. However, I'm not sure whether such idea is widely seen on Kubeflow or Kubernetes.

Are there any shortcuts we can exploit from the StatefulSet features so no pod-operator communication is needed?


I believe we discussed using ConfigMap to store and update the status of all pods in a StatefulSet. The concern comes from the latency of ConfigMap.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.72
area/operator 0.54

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@qifengz
Copy link

qifengz commented Jan 18, 2021

how to pass horovodrun's parameters like --host-discovery-script and --min-np when using mpirun command?

@gaocegege
Copy link
Member Author

Do you want to use it in mpijob or just in horovod?

@qifengz
Copy link

qifengz commented Jan 19, 2021

@gaocegege want to use it in mpijob. I tried like this

mpirun --allow-run-as-root -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python tensorflow2_mnist_elastic.py --host-discovery-script ./discover_hosts.sh --min-np 1

not failed, but --host-discovery-script and --min-np not work.
BTW, is mpijob v1/v1alpha2 support horovod job just?

@gaocegege
Copy link
Member Author

#332 is working on this issue.

@gaocegege
Copy link
Member Author

TODO list:

  • Unit test for elastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants