Dask-XGBoost flow #13

mrocklin · 2024-02-12T21:57:49Z

I'd like us to show off the use of dask-xgboost to train models on large datasets. Probably we run this at some longer cadence, and host some FastAPI endpoint with the hosted model (we already have the latter components for a model trained on a smaller dataset).

I don't know exactly how best to go about this, but I suspect that it looks something like ...

Find a good problem within our schema to answer. For example looking at the Lineitem table one might ask "What makes an item likely to be returned?" (See the ReturnFlag column).
Select the set of tables we want in order to answer that (maybe lineitem and supplier merged together or something if that's not too big?)
Do whatever ML stuff one does (cross validation, etc..)
Set that up as a flow with a cluster with appropriate hardware

But again, I don't know this space well, so whoever takes this on would have to take on the burden of figuring out exactly what makes sense and would be compelling. Mostly I just want people to see that dask-xgboost exists and works decently well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask-XGBoost flow #13

Dask-XGBoost flow #13

mrocklin commented Feb 12, 2024

Dask-XGBoost flow #13

Dask-XGBoost flow #13

Comments

mrocklin commented Feb 12, 2024