DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

blitchj · 2023-06-07T21:10:28Z

Issue Description

When using the DaskXGBoostClassifier in Python with Distributed Dask and Tree Method: "hist" the following scenario causes an error:

If there are no NaNs, or at least one NaN in any column in each partition then there is no issue.
If there is a NaN in at least one column in at least one partition, but not a NaN in at least one column in every partition then it fails.

When it fails, one of 2 errors occur, AllReduce Failed or Poll Timeout Error.

.../rabit/include/rabit/internal/utils.h:86: Allreduce failed coroutine
.../rabit/include/rabit/internal/socket.h:170: Poll timeout

How to Reproduce

This can be easily reproduced with n_partitions equal to n_workers and greater than 1 and any dataset where there is a total number of NaNs less than n_partitions but at least 1.

Once this dataset is made and shown to fail, it can be modified to succeed again, either by removing the NaNs, or by adding additional NaNs in any column and verifying that the NaNs are spread to each partition.

Additionally, I tested forcing the NaN into the first partition incase any schema was being inferred, however this didn't solve the issue.

The following is code to reproduce the issue with a dummy dataset:

from xgboost.dask import DaskXGBClassifier
import pandas as pd
import numpy as np
from dask_gateway import GatewayCluster

xgb = DaskXGBClassifier(tree_method="hist")
X = pd.DataFrame({"a": range(10000), "b": range(10000, 0, -1)})
y = pd.Series([*[0]*5000,*[1]*5000])

X["a"][:5000:1000] = np.NaN # keep this line
# x["b"][5000::1000] = np.NaN # re-add this line to succeed

n_workers=8
cluster = GatewayCluster(
    worker_cores=4,
    worker_memory=8,
    scheduler_cores=1,
    scheduler_memory=8,
)

cluster.scale(n_workers)
client = cluster.get_client()
client.wait_for_workers(n_workers=n_workers)

xgb.fit(
    from_pandas(X, npartitions=8), from_pandas(y, npartitions=8),
)

The text was updated successfully, but these errors were encountered:

blitchj · 2023-06-09T21:07:45Z

Thank you for the quick fix!

trivialfis · 2023-06-10T03:22:54Z

Thank you for raising the issue!

droctothorpe · 2024-07-23T11:17:57Z

We're seeing the exact same error messages on 1.7.3 but upgrading to 1.7.6 does not eliminate the problem. In addition, it's intermittent. Just confirming that 1.7.6 is supposed to include the fix.

trivialfis · 2024-07-23T11:42:29Z

@droctothorpe The error message is not quite useful actually, it just means one of the workers dropped out during training but doesn't show why the worker dropped out. Could you please share the worker log and open a new issue? Also, please try the latest XGBoost.

droctothorpe · 2024-07-23T15:15:41Z

Thanks, @trivialfis! Much appreciated. 🙏 Testing against 1.7.6 and 2.1.0. Will create a new issue if the problem resurfaces.

droctothorpe · 2024-07-24T15:57:09Z

Turns out it was cluster autoscaler trying to bin-pack workers and the specific xgboost computation (RFE) not being resilient to worker loss.

Adding spec["metadata"]["annotations"]["cluster-autoscaler.kubernetes.io/safe-to-evict"] = "false" resolved the issue. Thanks again, @trivialfis! Learning that the AllReduce error indicates that a worker dropped out really put me on the right track. Do you think it makes sense to expand that exception to include information about that?

droctothorpe · 2024-07-24T16:07:43Z

@zazulam suggested working on a contribution to improve XGBoost's resilience in the face of Dask worker outages. Do you have a sense of the scale of that lift, @trivialfis? I'm guessing if it wasn't a giant PITA, it probably would have been solved for by now 🙃 .

trivialfis · 2024-07-24T16:23:15Z

Thank you for bringing this up @droctothorpe @zazulam . We have been thinking about it for a while now, also have been trying to engage with the dask developers dask/distributed#8624 (see the related issues there as well). I think we can handle some exceptions in XGBoost like OOM without dask intervening, but for more sophisticated errors like a killed worker, there has to be some cooperation between the distributed framework and the application. Another option is to just restart the training without the problematic worker (with data lost) after a timeout or some other signals, I have been thinking about making these options a user-provided policy (callback).

In the latest XGBoost, we raised the timeout threshold significantly due to imbalanced data size in some clusters, but plan to introduce a user parameter in the next release.

trivialfis · 2024-07-24T16:53:56Z

Having said that, we are open to suggestions. Feel free to raise a new issue for discussion.

trivialfis added type: bug Blocking labels Jun 8, 2023

trivialfis mentioned this issue Jun 8, 2023

Disable dense optimization in hist for distributed training. #9272

Merged

trivialfis closed this as completed in #9272 Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

blitchj commented Jun 7, 2023 •

edited

Loading

blitchj commented Jun 9, 2023

trivialfis commented Jun 10, 2023

droctothorpe commented Jul 23, 2024

trivialfis commented Jul 23, 2024

droctothorpe commented Jul 23, 2024

droctothorpe commented Jul 24, 2024

droctothorpe commented Jul 24, 2024

trivialfis commented Jul 24, 2024 •

edited

Loading

trivialfis commented Jul 24, 2024

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

DaskXGBoostClassifier Tree Method Hist NaN Value Training Bug #9271

Comments

blitchj commented Jun 7, 2023 • edited Loading

Issue Description

How to Reproduce

blitchj commented Jun 9, 2023

trivialfis commented Jun 10, 2023

droctothorpe commented Jul 23, 2024

trivialfis commented Jul 23, 2024

droctothorpe commented Jul 23, 2024

droctothorpe commented Jul 24, 2024

droctothorpe commented Jul 24, 2024

trivialfis commented Jul 24, 2024 • edited Loading

trivialfis commented Jul 24, 2024

blitchj commented Jun 7, 2023 •

edited

Loading

trivialfis commented Jul 24, 2024 •

edited

Loading