Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed hyper parameter optimization for dask. #6525

Open
trivialfis opened this issue Dec 17, 2020 · 7 comments
Open

Distributed hyper parameter optimization for dask. #6525

trivialfis opened this issue Dec 17, 2020 · 7 comments

Comments

@trivialfis
Copy link
Member

Right now, integration between dask and various single node machine learning libraries are implemented as standalone dask extensions like dask-ml and dask-optuna. These can be used with xgboost when xgboost is performing single node training. That's using XGBRegressor and friends with them, instead of using xgboost.dask.DaskXGBRegressor. If users want to train the entire dataset on 1 model, the dask interface is required. The underlying issue is xgboost by itself is a distributed learning library employing a MPI like communication framework, but those extensions are designed to extend single node libraries. To resolve it, we need to design python wrappers that can glue them together.

Optuna is an exception as it's using callback function in xgboost, so the xgboost.dask interface can be adopted to optuna. I will submit some changes with demos later. Others like grid searching are more difficult to implement.

Related: #5347

cc @pseudotensor @sandys

@sandys
Copy link

sandys commented Dec 17, 2020 via email

@trivialfis
Copy link
Member Author

@sandys k8s support is in 1.3.

@trivialfis
Copy link
Member Author

@sandys Use https://kubernetes.dask.org/en/latest/ to create the cluster, and train xgboost dask as usual.

@sandys
Copy link

sandys commented Dec 17, 2020 via email

@trivialfis
Copy link
Member Author

@sandys Thanks for your feedback. k8s is definitely important to us.

@CarterFendley
Copy link

Hi @trivialfis thanks for your work on this. Just reading these threads now.

I see in the other issue you recommend using sklearn optimizers for out-of-core datasets and dask_ml optimizers for datasets which can fit in memory. I am wondering if you can expand / explain more on why xgboost.dask is incompatible with dask_ml at the moment.

What would it take to, in your words, "design python wrappers that can glue them together."?

@trivialfis
Copy link
Member Author

trivialfis commented Apr 13, 2023

Hi @CarterFendley , there is ongoing work on HPO, could you please take a look at https://github.com/coiled/dask-xgboost-nyctaxi and see if it helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants