-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635
Comments
I ran two bisect https://buildkite.com/ray-project/release-tests-bisect/builds/98#_ and https://buildkite.com/ray-project/release-tests-bisect/builds/101#_, both blame 7c9da5c (committed 2 weeks ago) |
Thanks @can-anyscale That's possible - will look into this bisecting and the potential root caused commit later next week. What's weird to me is this failure itself doesn't seem to have any logs generated? |
The issue might be due to pubsub. The theory here is that,
Thus leak in the end. Going to verify this theory. |
I think the theory is correct here. Basically, some subscriber got removed, but before the worker stopped, it sends another subscription to the node. I think the right way to fix this is to broadcast the worker failure after the pid exits. |
The leak PR is merged. Start a new run https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ztbbc72u62tfzvrkypgbdqpq7c |
Look like it's still failing :( |
The dashboard agent failed somehow because it failed to talk with GCS. I think don't make agent fate sharing with raylet is critical. @SongGuyang does your team still have bandwidth for this? I think if no bandwidth, maybe we should cover this cc @rkooo567 Still checking what caused the regression. GCS seem ok. and the raylet failure is because of agent failure. Agent failure is because of failing to talk to GCS. |
The PR got reverted and the revert-revert is here #35091 I'll test once the wheel is built. |
Seems like this test hasn't run for a while. We should follow up with @can-anyscale to verify why it hasn't run |
Sorry for the late update. We already restarted this work. The pr will be created in a few days. |
@rkooo567: this test in on a nightly 3x schedule; last time it ran I think it's still failing |
Is it consistently failing? Maybe we should just reduce the num_nodes to something like 1000 instead of making it keep failing |
@rkooo567: it has been failing for 3 weeks, we should prioritize this to avoid delay the upcoming release |
@iycheng if #35091 doesn't solve the issue (can you verify it by running the release test from the PR?), I think the best way is to reduce our scalability envelope and revisit. |
Can we give this a try @iycheng , @rkooo567 , thankks |
@rkooo567 I think it's a regression. Shouldn't we fix the regression? The test has been running healthy for more than a month. Then later due to the file descriptor leak, it's broken. What's worse, during the regression, there is another bug which prevent us from bisecting. This means even fixing the FD leak, it's not enough. So there are two bugs. The goal should be to identify the why it's broken and fix it in 2.5 not just that we are able to run the test. Btw, the FD leak quick fix we discussed last time doesn't prevent GCS FD leaking. The root cause is still the same. |
After applying the fixing of FD issues, the new logs shows:
Seems worker failure is not handled in GCS. The crash is in destruction:
|
@iycheng: this test is passing on master now, can you confirm and close the issue if that's true. Thanks |
@iycheng : this test is now failing again with a different reason https://buildkite.com/ray-project/release-tests-branch/builds/1657#01882a60-6091-4c35-99ba-52dc33df0c93
|
It also break one CI test. It got reverted. Taking another look. |
The failure is because sometimes, worker failure is failed to be reported. Investigation the root cause. |
When the test failed, there are other things happened which increase the fd usage of a core worker. That's why previous my test passed and later it's merged, it failed. For short term, we'll update the test: #35546 Successful run here:
I'll rerun the tests in the PR before merge. After that, I'll check which pr increase the fds and add tests to prevent that from happening again. |
BTW, @iycheng, 2.5 release branch still has your previous fix that was reverted in master, do we need to revert it in release branch as well? Thanks |
Since this is a release-blocker issue, please close it only after the cherry pick fix is merged into 2.5 release branch. Please add @ArturNiederfahrenhorst as one of the reviewer of the fix as well for tracking purpose. Thankks! |
@can-anyscale I double checked that, I don't think it's there. I previously created a cherry-pick one #35420 and I have already closed it. |
@iycheng got you, that's great, thank you |
After a day's of checking, it turns out because of getting rid of grpcio work, it actually increase the sockets in GCS. So there are two ways for this test:
I have two PRs for both. Both passed the nightly tests. The fixing one has ci failures. I'll try to fix them. But if it can't be merged in time, we'll go with option 1. |
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635
@can-anyscale the fix has been merged. Feel free to verify it in your end once the master wheel is built. The successful run: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ywxup58wj76i8567e52l3uiijb |
@iycheng w00h00 thanks |
Triggered another run through the buildkite on master branch and it passed: https://buildkite.com/ray-project/release-tests-branch/builds/1692#0188500b-3c34-4fc8-8cbe-2566859c716c |
@iycheng , awesome, let's pick this! |
Nice! 🙂 |
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635
## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635 Signed-off-by: e428265 <[email protected]>
What happened + What you expected to happen
The test failed with timeout - seems to be an infra issue to me:
Versions / Dependencies
master
Reproduction script
https://buildkite.com/ray-project/release-tests-branch/builds/1568#01879142-b94c-48b2-9af8-bac773cd78b0
Issue Severity
None
The text was updated successfully, but these errors were encountered: