-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release][CI] dataset_shuffle_random_shuffle_1tb failed #29294
Comments
for this test, seems raylet overloaded and failure is not uncommon; however previously the test could handle these node failures (most likely mitigated by lineage reconstruction) |
@clarkzinzow i'm a bit overloaded right now (5 total release blocker). Mind take this one? |
Might be due to agent memory leak #29199 |
I've been looking into this, but taking a step back, it looks like this release test has not had a good success rate since August: https://b534fd88.us1a.app.preset.io/superset/dashboard/19/?native_filters_key=V5NMHssrZzj86978-asD6zeCWfwL9FF61U2d4tZ_PE3wDDt6-2hTI6MMw_UcRjIK |
Most recent failure looks to be due to failing to connect to the GCS, making the agent memory leak causing the head node to get OOMKilled a bit more likely: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_GhqMXjBcR8nutFSsTCc8Z7yp?command-history-section=command_history
|
I think it is unlikely it's related to agent leak. The agent leak is not that fast (it fails within 8 minutes. Agent leak is about 1GB / h). |
@rkooo567 agreed! |
This test looks flaky, get a succeessful run on release branch later - https://buildkite.com/ray-project/release-tests-branch/builds/1123#0183d941-4949-4929-b3c8-233d0ca816e1 . Removed as release blocker. |
Should we declare this unstable? cc @scv119 |
@scv119 I think it should be P0 for core in 2.2. |
should we try the oom killer if the node is timing out due to oom / freezing? |
can data team take over this test? @c21 @matthewdeng |
Hi @scv119 - yes I think data team should own it. We should work with core to come up action items. |
Feel we should try oom killer or use a bigger machine? |
@c21 do you mind taking the action item ? Seems Clark already has another P0 |
@clarng - sure, I will take a look later. |
What happened + What you expected to happen
Build
Cluster
But looks like this might be root cause:
Versions / Dependencies
Release 2.1 master
Reproduction script
NA
Issue Severity
No response
The text was updated successfully, but these errors were encountered: