Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KNN error message. #4782

Merged
merged 1 commit into from
Jun 27, 2022
Merged

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis trivialfis added bug Something isn't working non-breaking Non-breaking change labels Jun 22, 2022
@trivialfis trivialfis requested a review from a team as a code owner June 22, 2022 17:49
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jun 22, 2022
@trivialfis
Copy link
Member Author

trivialfis commented Jun 22, 2022

Ah, the XGBoost error is still here. Is there anything I can help?

11:40:50 - nothing provides requested xgboost 1.5.2dev.rapidsai22.08**

@cjnolet
Copy link
Member

cjnolet commented Jun 22, 2022

@trivialfis I'm actually backing the xgboost package out to 1.5.2dev.rapids22.06 because the updated version seems to be causing some reproducible hangs in the pytests. I still haven't been able to figure out why but the evidence I've gathered so far seems to point towards a resource issue- thread usage, deadlock somewhere, OS exception being hidden.

@trivialfis
Copy link
Member Author

Not an expert on container, XGBoost starts honoring the thread limit from CFS "/sys/fs/cgroup/cpu/cpu.cfs_quota_us" dmlc/xgboost#7654 in 1.6.x . Could that be an issue?

@cjnolet
Copy link
Member

cjnolet commented Jun 22, 2022

@trivialfis weird thing is that only 1 of the hanging tests is related to FIL or xgboost. The other is FAISS' ivfpq. Let me try skipping all of the tests which run xgboost and see if that helps at all.

@cjnolet
Copy link
Member

cjnolet commented Jun 22, 2022

While I'm waiting- is it possible there might somehow be a rogue thread or xgboost process which might not be getting properly cleaned up but which might end up causing a deadlock for cuml pytests downstream? I'm not completely sure how this could happen in the python layer if pytest is properly cleaning up, but I suppose in the C++ layer there could be a process forked off that isn't cleaned up after the tests execute.

Both of the tests which are failing are using openmp at some point underneath and my hunch is that there either aren't enough resources to schedule them or some of those resources aren't allowing preemption.

@trivialfis
Copy link
Member Author

I don't think XGBoost itself can cause such an issue as I'm not entirely sure how to make an OpenMP thread go rogue without causing the process to abort. You mentioned that the hang is reproducible, I can try it locally and attach gdb to the hanging processes if there's document/guidance on how to reproduce.

@cjnolet
Copy link
Member

cjnolet commented Jun 22, 2022

@trivialfis unfortunately, I've only been able to reproduce it in CI, however I have not tried limiting the resources by specifying --cpus at runtime. Not sure that will be enough to cause the hang but it's certainly worth a try. It's the lightgbm pytest in test_fil and the ivfpq test_pred() test in test_nearest_neighbors that are hanging in CI.

@trivialfis
Copy link
Member Author

trivialfis commented Jun 22, 2022

Got it. I will try the ci/local/build.sh script tomorrow if the issue persists.

@trivialfis
Copy link
Member Author

rerun tests

@dantegd
Copy link
Member

dantegd commented Jun 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b26fe7e into rapidsai:branch-22.08 Jun 27, 2022
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants