-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost.dask.DaskXGBClassifier not working with >1 dask distributed worker in case of large datasets #5451
Comments
Hi thanks for rasing an issue. Here are some questions:
|
Thanks for the prompt reply.
|
I am trying the same on a sample dataset. It runs with worker=1 but with workers > 1, it is still giving the same 'Label set is empty' warning. |
|
How to balance the data between workers ? I get the same warning on the sample dataset too. Can you please debug the sample dataset script and post the changes required to balance the data between workers ? |
hi guys - we have the same issue. |
Em .. Sometimes its the chunk size, sometimes it's other problems. XGBoost does not move data, it accepts whatever dask provides for each worker. Let me take a look. |
In the sample script above posted by @harshit-2115 , reducing the chunk size should balance the data enough to prevent starved workers: from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection
import xgboost
from sklearn.preprocessing import QuantileTransformer
from sklearn.datasets import make_classification
import dask.array as da
from dask.distributed import LocalCluster, Client
if __name__ == '__main__':
cluster = LocalCluster(n_workers=10, memory_limit='8GB')
print(cluster.dashboard_link)
client = Client(cluster)
X, y = make_classification(n_features=244, n_samples=815414, n_classes=2)
chunk_size = 1500
DX = da.from_array(X, chunks=chunk_size)
Dy = da.from_array(y, chunks=chunk_size)
post_pipe = Pipeline([('qt', QuantileTransformer(n_quantiles=10))])
pi = xgboost.dask.DaskXGBClassifier(tree_method='hist')
pi.client = client
param_grid = {
'learning_rate': [0.1, 0.2],
'n_estimators': [100],
'reg_lambda': [0.7],
}
kfold = 5
skf = model_selection.StratifiedKFold(n_splits=kfold,
shuffle=True,
random_state=30)
clf = GridSearchCV(estimator=pi,
param_grid=param_grid,
verbose=5,
cv=skf,
iid=True,
return_train_score=True,
scoring='neg_mean_squared_error',
refit=False)
pp = post_pipe.fit_transform(DX, Dy)
clf.fit(da.from_array(pp, chunks=chunk_size), Dy) |
Some simple testing from above script: for using 1500 as chunk size, the data is distributed quite nicely for the first round:
But somehow the data is moved in following rounds, will look further. |
Thank you for debugging the script @trivialfis
On a multi-core EC2 machine, which model xgboost(parallelized single-thread fits) or dask xgboost(single-fit distributed across workers) will give us more speed and performance ? |
Not surprising. Even within XGBoost, simply syncing gradients between workers has its overhead. On the dask end there are even more operations. There's always a trade off.
skl functions are designed to work on local data. In our case, it operates on
Again. Trade off. That's something responsible for dask to handle in the future. But https://docs.dask.org/en/latest/understanding-performance.html might help.
Accuracy performance? No. Computation performance? Probably. See above link. But I believe there's some low hanging fruits for computation perf. Like switching the backend to something else than pandas. I ( work for NVIDIA) use cudf most of the time. But I believe there are other backends mentioned in dask's document.
Hard to say. It really depends on your data. For small data just use normal single node multi-thread training. Your dataset xgboost.train({'tree_method': 'hist'}, ...) in no time. I use |
@trivialfis that's a very interesting comment. Do you always do your training on one BIG machine with GPU or do you do distributed (with many machines) - like the way Sagemaker recommends . So the question becomes - is xgboost distributed training better on 100 machines ...even though there are overheads ? another question - in general do you use gridsearchcv ever ? how would you do cross-validation with a k-fold ? |
@sandys Your questions are asking my personal opinion. So the following answers are from personal perspective.
Both. I'm a developer.
Depends. Is your data 100 machines worthy? For example, one normally don't put iris into a cluster. As commented, it's a trade off. See the performance section in https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7 . You can see the scaling status of training on HIGGS dataset with
Yes. But right now only with single machine. Rapidsai has a notebook for using dask-ml with single node XGBoost if you are interested. Dask is still new here. (and I'm no expert). Like we talked about in #5347 . There are issues we need to address. You can also try the Spark version of XGBoost, which has been here for a very long time (longer than me contributing to XGBoost ..) |
The questions here are really more about dask instead of XGBoost. I would recommend: https://docs.dask.org/en/latest/best-practices.html |
True, single xgboost fit takes no time. But if I am using gridsearchcv for hyperparameter tuning with 5 fold cv, total number of fits increases exponentially. That is why we use multi-core machines to distribute the fits. |
I trained my dataset using dask xgboost, 2000 chunksize with a single gridsearch candidate and 5 Fold cv. It took 28 mins for the first fit to complete, but the rest 4 fits took just 1-2 mins. I didn't get this. What do you think ? |
So for the first fit, dask took extra time to read the csv into memory. For the subsequent fits, it just did the computation.
Yeah, so do we need dask XGBoost only when we have a cluster of machines ? We can just use the n_jobs param of XGBoost on a single machine to use all cores, right ? |
What can we do about this ? @trivialfis |
If you need distributed training, then you can use Dask or YARN/Spark which are much more mature.
Correct. |
Thank you for clarifying. |
Yes, will come back to this after 1.1 release. |
@harshit-2115 Can this issue be closed, now that |
I believe so. Feel free to reopen if there's objection. |
Hi XGBoost devs,
I am running the this code on an EC2 machine with 32 threads and 128 GB ram. The size of csv being loaded in 800MB.
It works if the model is trained using a subset of the features with worker=1.
Some cases where it fails :
Same subset of features and with workers > 1, It keeps on running in the notebook with no result. In terminal,
WARNING: /home/conda/feedstock_root/build_artifacts/xgboost_1584539733809/work/src/objective/regression_obj.cu:58: Label set is empty.
Using all features with worker=1, it gives memory warnings in the terminal
How can a 800MB csv file consume 118 GB memory ?
Also, there is not 'predict_proba' attribute in DaskXgbClassifier, so metrics like roc_auc gives error.
Currently, we are using xgboost with sklearn gridsearch(to distribute the fits). With large datasets, hyper-parameter tuning jobs with 4k-5k fits take days to complete on EC2 and sagemaker.
We are trying dask xgboost to reduce training time.
The text was updated successfully, but these errors were encountered: