Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

Closed
blitchj opened this issue Jun 7, 2023 · 9 comments · Fixed by #9272
Closed

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

blitchj opened this issue Jun 7, 2023 · 9 comments · Fixed by #9272

Comments

@blitchj
Copy link
Contributor

blitchj commented Jun 7, 2023

Issue Description

When using the DaskXGBoostClassifier in Python with Distributed Dask and Tree Method: "hist" the following scenario causes an error:

  • If there are no NaNs, or at least one NaN in any column in each partition then there is no issue.

  • If there is a NaN in at least one column in at least one partition, but not a NaN in at least one column in every partition then it fails.

When it fails, one of 2 errors occur, AllReduce Failed or Poll Timeout Error.

  • .../rabit/include/rabit/internal/utils.h:86: Allreduce failed coroutine
  • .../rabit/include/rabit/internal/socket.h:170: Poll timeout

How to Reproduce

This can be easily reproduced with n_partitions equal to n_workers and greater than 1 and any dataset where there is a total number of NaNs less than n_partitions but at least 1.

Once this dataset is made and shown to fail, it can be modified to succeed again, either by removing the NaNs, or by adding additional NaNs in any column and verifying that the NaNs are spread to each partition.

Additionally, I tested forcing the NaN into the first partition incase any schema was being inferred, however this didn't solve the issue.

The following is code to reproduce the issue with a dummy dataset:

from xgboost.dask import DaskXGBClassifier
import pandas as pd
import numpy as np
from dask_gateway import GatewayCluster

xgb = DaskXGBClassifier(tree_method="hist")
X = pd.DataFrame({"a": range(10000), "b": range(10000, 0, -1)})
y = pd.Series([*[0]*5000,*[1]*5000])

X["a"][:5000:1000] = np.NaN # keep this line
# x["b"][5000::1000] = np.NaN # re-add this line to succeed

n_workers=8
cluster = GatewayCluster(
    worker_cores=4,
    worker_memory=8,
    scheduler_cores=1,
    scheduler_memory=8,
)

cluster.scale(n_workers)
client = cluster.get_client()
client.wait_for_workers(n_workers=n_workers)

xgb.fit(
    from_pandas(X, npartitions=8), from_pandas(y, npartitions=8),
)
@blitchj
Copy link
Contributor Author

blitchj commented Jun 9, 2023

Thank you for the quick fix!

@trivialfis
Copy link
Member

Thank you for raising the issue!

@droctothorpe
Copy link

We're seeing the exact same error messages on 1.7.3 but upgrading to 1.7.6 does not eliminate the problem. In addition, it's intermittent. Just confirming that 1.7.6 is supposed to include the fix.

@trivialfis
Copy link
Member

@droctothorpe The error message is not quite useful actually, it just means one of the workers dropped out during training but doesn't show why the worker dropped out. Could you please share the worker log and open a new issue? Also, please try the latest XGBoost.

@droctothorpe
Copy link

Thanks, @trivialfis! Much appreciated. 🙏 Testing against 1.7.6 and 2.1.0. Will create a new issue if the problem resurfaces.

@droctothorpe
Copy link

Turns out it was cluster autoscaler trying to bin-pack workers and the specific xgboost computation (RFE) not being resilient to worker loss.

Adding spec["metadata"]["annotations"]["cluster-autoscaler.kubernetes.io/safe-to-evict"] = "false" resolved the issue. Thanks again, @trivialfis! Learning that the AllReduce error indicates that a worker dropped out really put me on the right track. Do you think it makes sense to expand that exception to include information about that?

@droctothorpe
Copy link

@zazulam suggested working on a contribution to improve XGBoost's resilience in the face of Dask worker outages. Do you have a sense of the scale of that lift, @trivialfis? I'm guessing if it wasn't a giant PITA, it probably would have been solved for by now 🙃 .

@trivialfis
Copy link
Member

trivialfis commented Jul 24, 2024

Thank you for bringing this up @droctothorpe @zazulam . We have been thinking about it for a while now, also have been trying to engage with the dask developers dask/distributed#8624 (see the related issues there as well). I think we can handle some exceptions in XGBoost like OOM without dask intervening, but for more sophisticated errors like a killed worker, there has to be some cooperation between the distributed framework and the application. Another option is to just restart the training without the problematic worker (with data lost) after a timeout or some other signals, I have been thinking about making these options a user-provided policy (callback).

In the latest XGBoost, we raised the timeout threshold significantly due to imbalanced data size in some clusters, but plan to introduce a user parameter in the next release.

@trivialfis
Copy link
Member

Having said that, we are open to suggestions. Feel free to raise a new issue for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants