-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed hyper parameter optimization for dask. #6525
Comments
Thank you for taking this forward.
We are also seriously looking at Ray Distributed - because of its native
Integration with kubernetes (
https://docs.ray.io/en/master/cluster/kubernetes.html ) and its success at
Ant Financial in China (
https://docs.ray.io/en/master/cluster/kubernetes.html )
My thought is that you won't be able to make it generic enough to fit all
frameworks. And probably it's an overkill of engineering to do so.
The one thing I would strongly request you to align towards is kubernetes
compatibility. Because it's the management framework that has massive
amount of support.
It is for this reason that Torch chose to write a thin wrapper over
kubernetes for its distributed training library -
https://pytorch.org/elastic/0.2.1/index.html
https://aws.amazon.com/blogs/containers/fault-tolerant-distributed-machine-learning-training-with-the-torchelastic-controller-for-kubernetes/
So I would suggest when you look at solving this problem from a kubernetes
first standpoint. Compatibility with every framework is going to take up
more of your time than you might be willing to spend.
Sincerely
Sandeep
…On Thu, 17 Dec, 2020, 19:00 Jiaming Yuan, ***@***.***> wrote:
Right now, integration between dask and various single node machine
learning libraries are implemented as standalone dask extensions like
dask-ml and dask-optuna. These can be used with xgboost when xgboost is
performing single node training. That's using XGBRegressor and friends
with them, instead of using xgboost.dask.DaskXGBRegressor. If users want
to train the entire dataset on 1 model, the dask interface is required. The
underlying issue is xgboost by itself is a distributed learning library
employing a MPI like communication framework, but those extensions are
designed to extend single node libraries. To resolve it, we need to design
python wrappers that can glue them together.
Optuna is an exception as it's using callback function in xgboost, so the
xgboost.dask interface can be adopted to optuna. I will submit some changes
with demos later. Others like grid searching are more difficult to
implement.
Related: #5347 <#5347>
cc @pseudotensor <https://github.com/pseudotensor> @sandys
<https://github.com/sandys>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6525>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASYU3WUDQSFYIEA2IBT6DSVIBVRANCNFSM4U7VFG7Q>
.
|
@sandys k8s support is in 1.3. |
@sandys Use https://kubernetes.dask.org/en/latest/ to create the cluster, and train xgboost dask as usual. |
Yes. Although dask has some independent issues with kubernetes - on the
scaling side.
Ray uses kubernetes scheduler. Dask brings a centralised scheduler of its
own.
In any case, I would strongly suggest Xgboost pick one framework and play
well with kubernetes. Rather than expose bindings for many many frameworks
to work with Xgboost.
Just my 0.02$
…On Thu, 17 Dec, 2020, 19:10 Jiaming Yuan, ***@***.***> wrote:
@sandys <https://github.com/sandys> Use
https://kubernetes.dask.org/en/latest/ to create the cluster, and train
xgboost dask as usual.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6525 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASYU4F5QKQYZ32IYSACHDSVIC6JANCNFSM4U7VFG7Q>
.
|
@sandys Thanks for your feedback. k8s is definitely important to us. |
Hi @trivialfis thanks for your work on this. Just reading these threads now. I see in the other issue you recommend using What would it take to, in your words, "design python wrappers that can glue them together."? |
Hi @CarterFendley , there is ongoing work on HPO, could you please take a look at https://github.com/coiled/dask-xgboost-nyctaxi and see if it helps? |
Right now, integration between dask and various single node machine learning libraries are implemented as standalone dask extensions like dask-ml and dask-optuna. These can be used with xgboost when xgboost is performing single node training. That's using
XGBRegressor
and friends with them, instead of usingxgboost.dask.DaskXGBRegressor
. If users want to train the entire dataset on 1 model, the dask interface is required. The underlying issue is xgboost by itself is a distributed learning library employing a MPI like communication framework, but those extensions are designed to extend single node libraries. To resolve it, we need to design python wrappers that can glue them together.Optuna is an exception as it's using callback function in xgboost, so the xgboost.dask interface can be adopted to optuna. I will submit some changes with demos later. Others like grid searching are more difficult to implement.
Related: #5347
cc @pseudotensor @sandys
The text was updated successfully, but these errors were encountered: