Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask-XGBoost flow #13

Open
mrocklin opened this issue Feb 12, 2024 · 0 comments
Open

Dask-XGBoost flow #13

mrocklin opened this issue Feb 12, 2024 · 0 comments

Comments

@mrocklin
Copy link
Member

I'd like us to show off the use of dask-xgboost to train models on large datasets. Probably we run this at some longer cadence, and host some FastAPI endpoint with the hosted model (we already have the latter components for a model trained on a smaller dataset).

I don't know exactly how best to go about this, but I suspect that it looks something like ...

  1. Find a good problem within our schema to answer. For example looking at the Lineitem table one might ask "What makes an item likely to be returned?" (See the ReturnFlag column).
  2. Select the set of tables we want in order to answer that (maybe lineitem and supplier merged together or something if that's not too big?)
  3. Do whatever ML stuff one does (cross validation, etc..)
  4. Set that up as a flow with a cluster with appropriate hardware

But again, I don't know this space well, so whoever takes this on would have to take on the burden of figuring out exactly what makes sense and would be compelling. Mostly I just want people to see that dask-xgboost exists and works decently well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant