-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Nightly] datasets_ingest_train_infer failing #26966
Comments
git log --oneline 8fe4399..bf97a69 8d7b865 is passing https://console.anyscale.com/o/anyscale-internal/configurations/app-config-details/bld_pkgTMjHxMk6ZDnwDDg3Athfa |
https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_tmHr67ueMvbrpFd3NTuwdn4s?command-history-section=command_history datasets_preprocess_ingest_1658788018 is failing with similar stacktraces. |
a012033 passed as well https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_QvFbtJf8zRh2fYUErB9HxkGd actually failed after retry |
I think i'm 95% confident e19cf16 is causing the regression. most likely because we have too many objects. |
I think after e19cf16 is addressed, we should still work on improving the robustness of GCS in the same situation? (as a P1 issue?) |
Reopen for the further investigation on GCS disconnection. TL;DR Raylets should never be disconnected like that although there are some application loads e19cf16 caused. We should investigate the root cause and make that part more robust. |
Btw cc @iycheng I think this error
means our gcs death detecting logic probably has a bug. I think this check failure happens when GCS is actually alive. Looking at code, some potential issues are
Additionally, I think we should also make ray/src/ray/rpc/gcs_server/gcs_rpc_client.h Line 546 in 60f3377
|
We talked in person, and we agreed we need some tweaks in detection logic + grpc config. Yi will look at it after rc1 deadline. |
@rkooo567 now it's no longer release-blocker right? |
What happened + What you expected to happen
failing test: https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_rFHQEVVAA3KZgi4apbnTAkcd?command-history-section=command_history
Versions / Dependencies
latest ray
Reproduction script
run nightly test
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: