[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

rickyyx · 2023-04-20T22:59:05Z

What happened + What you expected to happen

The test failed with timeout - seems to be an infra issue to me:

No logs (from job submit & through ray logs download) are available

Versions / Dependencies

master

Reproduction script

https://buildkite.com/ray-project/release-tests-branch/builds/1568#01879142-b94c-48b2-9af8-bac773cd78b0

Issue Severity

None

can-anyscale · 2023-04-23T16:02:52Z

I ran two bisect https://buildkite.com/ray-project/release-tests-bisect/builds/98#_ and https://buildkite.com/ray-project/release-tests-bisect/builds/101#_, both blame 7c9da5c (committed 2 weeks ago)

rickyyx · 2023-04-23T21:14:35Z

Thanks @can-anyscale That's possible - will look into this bisecting and the potential root caused commit later next week.

What's weird to me is this failure itself doesn't seem to have any logs generated?

rickyyx · 2023-04-24T22:19:41Z

https://github.com/anyscale/product/issues/19989

fishbone · 2023-04-26T19:42:12Z

The issue might be due to pubsub. The theory here is that,

Raylet report failure
CoreWorker close slower than before
GCS's closed the long polling.
CoreWorker sent another pubsub request
GCS got the long polling again.

Thus leak in the end.

Going to verify this theory.

fishbone · 2023-04-27T23:06:58Z

581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 Adding
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 faf9864ff731dc86d96007303211de3a7df3a66e8b979cc47ba055bd Adding
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 Remove
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 faf9864ff731dc86d96007303211de3a7df3a66e8b979cc47ba055bd Remove
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 faf9864ff731dc86d96007303211de3a7df3a66e8b979cc47ba055bd Adding
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 6179df5e269ae1961183d8f6de985f0c33bd64d613e4a071a9b92b85 Adding
581e298771cdbeb7e50c61e3e7bdf14028d07259cf3ab93db3aa91b7 b82ce2de937dc9c429842407abcaa8ada3d0aacb778d22334bd80cbb Adding

I think the theory is correct here.

Basically, some subscriber got removed, but before the worker stopped, it sends another subscription to the node.

I think the right way to fix this is to broadcast the worker failure after the pid exits.

fishbone · 2023-05-05T20:05:10Z

The leak PR is merged. Start a new run https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ztbbc72u62tfzvrkypgbdqpq7c

can-anyscale · 2023-05-05T20:51:31Z

Look like it's still failing :(

fishbone · 2023-05-05T21:07:35Z

2023-05-05 13:13:13,234 ERROR reporter_agent.py:1112 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1092, in _perform_iteration
    timeout=GCS_RPC_TIMEOUT_SECONDS,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 167, in wrapper
    return await f(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 280, in internal_kv_get
    reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-05-05T13:13:13.234129778-07:00", grpc_status:4}"
>
2023-05-05 13:13:18,736 ERROR reporter_agent.py:1112 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1092, in _perform_iteration
    timeout=GCS_RPC_TIMEOUT_SECONDS,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 167, in wrapper
    return await f(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/gcs_utils.py", line 280, in internal_kv_get
    reply = await self._kv_stub.InternalKVGet(req, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/aio/_call.py", line 291, in __await__
    self._cython_call._status)
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-05-05T13:13:18.736149177-07:00", grpc_status:4}"
>

The dashboard agent failed somehow because it failed to talk with GCS. I think don't make agent fate sharing with raylet is critical. @SongGuyang does your team still have bandwidth for this? I think if no bandwidth, maybe we should cover this cc @rkooo567

Still checking what caused the regression. GCS seem ok. and the raylet failure is because of agent failure. Agent failure is because of failing to talk to GCS.

fishbone · 2023-05-05T23:02:53Z

The PR got reverted and the revert-revert is here #35091

I'll test once the wheel is built.

rkooo567 · 2023-05-08T22:12:14Z

Seems like this test hasn't run for a while. We should follow up with @can-anyscale to verify why it hasn't run

SongGuyang · 2023-05-09T09:33:58Z

The dashboard agent failed somehow because it failed to talk with GCS. I think don't make agent fate sharing with raylet is critical. @SongGuyang does your team still have bandwidth for this? I think if no bandwidth, maybe we should cover this cc @rkooo567

Sorry for the late update. We already restarted this work. The pr will be created in a few days.

can-anyscale · 2023-05-09T15:11:10Z

@rkooo567: this test in on a nightly 3x schedule; last time it ran I think it's still failing

rkooo567 · 2023-05-10T01:26:09Z

Is it consistently failing? Maybe we should just reduce the num_nodes to something like 1000 instead of making it keep failing

can-anyscale · 2023-05-10T01:49:33Z

@rkooo567: it has been failing for 3 weeks, we should prioritize this to avoid delay the upcoming release

rkooo567 · 2023-05-10T02:01:07Z

@iycheng if #35091 doesn't solve the issue (can you verify it by running the release test from the PR?), I think the best way is to reduce our scalability envelope and revisit.

can-anyscale · 2023-05-11T04:00:55Z

I think the best way is to reduce our scalability envelope and revisit.

Can we give this a try @iycheng , @rkooo567 , thankks

fishbone · 2023-05-14T06:31:21Z

@iycheng if #35091 doesn't solve the issue (can you verify it by running the release test from the PR?), I think the best way is to reduce our scalability envelope and revisit.

@rkooo567 I think it's a regression. Shouldn't we fix the regression? The test has been running healthy for more than a month. Then later due to the file descriptor leak, it's broken.

What's worse, during the regression, there is another bug which prevent us from bisecting. This means even fixing the FD leak, it's not enough. So there are two bugs.

The goal should be to identify the why it's broken and fix it in 2.5 not just that we are able to run the test.
2.4 we are able to run it and later it regressed due to the children process closing. (or it regressed in 2.4 and we just let it go?)

Btw, the FD leak quick fix we discussed last time doesn't prevent GCS FD leaking. The root cause is still the same.

fishbone · 2023-05-15T05:43:03Z

After applying the fixing of FD issues, the new logs shows:

[2023-05-13 15:37:52,450 I 247 247] (gcs_server) gcs_actor_scheduler.cc:521: Retry creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:52,450 I 247 247] (gcs_server) gcs_actor_scheduler.cc:445: Start creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:52,666 I 247 247] (gcs_server) gcs_actor_scheduler.cc:521: Retry creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:52,666 I 247 247] (gcs_server) gcs_actor_scheduler.cc:445: Start creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:52,940 I 247 247] (gcs_server) gcs_actor_scheduler.cc:521: Retry creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:52,940 I 247 247] (gcs_server) gcs_actor_scheduler.cc:445: Start creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000
[2023-05-13 15:37:53,195 I 247 247] (gcs_server) gcs_actor_scheduler.cc:521: Retry creating actor e9bc491a509697c0d7d1f4a803000000 on worker a4bd10b252cd36fcf7a7c62d55e2d1fa105eaff53c76e5c80b458052 at node 566a2ba4
df0619b34c3a262af92f980b70bad018fd0d99799827b952, job id = 03000000

Seems worker failure is not handled in GCS.

The crash is in destruction:

[2023-05-13 15:35:12,273 I 671 671] core_worker_process.cc:148: Destructing CoreWorkerProcessImpl. pid: 671
[2023-05-13 15:35:12,273 I 671 671] io_service_pool.cc:47: IOServicePool is stopped.
[2023-05-13 15:35:12,310 I 671 671] core_worker.cc:615: Core worker is destructed
[2023-05-13 15:35:12,360 E 671 671] logging.cc:104: Stack trace:
 /home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(+0xdbdcaa) [0x7f3a1fa0acaa] ray::operator<<()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(+0xdc0468) [0x7f3a1fa0d468] ray::TerminateHandler()
/home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f3a1eae035a] __cxxabiv1::__terminate()
/home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7f3a1eae03c5]
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(+0x7059a1) [0x7f3a1f3529a1]
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(_ZN3ray4core6worker19TaskEventBufferImplD0Ev+0x12) [0x7f3a1f3529c2] ray::core::worker::TaskEventBufferImpl::~TaskEventBufferImpl()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorkerD1Ev+0x50) [0x7f3a1f2ccf90] ray::core::CoreWorker::~CoreWorker()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(+0x5675aa) [0x7f3a1f1b45aa] std::_Sp_counted_base<>::_M_release()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(_ZN3ray4core21CoreWorkerProcessImplD1Ev+0x101) [0x7f3a1f3094d1] ray::core::CoreWorkerProcessImpl::~CoreWorkerProcessImpl()
/home/ray/anaconda3/lib/python3.7/site-packages/ray/_raylet.so(_ZN3ray4core17CoreWorkerProcess12HandleAtExitEv+0x29) [0x7f3a1f3096a9] ray::core::CoreWorkerProcess::HandleAtExit()
/lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f3a2061e8a7]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f3a2061ea60] on_exit
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f3a205fc08a] __libc_start_main
ray::IDLE() [0x543fce]

rickyyx · 2023-05-15T22:25:29Z

cc @rickyyx the crash seems relevant to the task backend? @iycheng does the same error happen if you set RAY_task_events_report_interval_ms=0?

Likely due to the non-graceful exit (no Stop() being called) when this happens? But yeah, I guess this could be fixed. #35357

can-anyscale · 2023-05-16T21:31:44Z

@iycheng: this test is passing on master now, can you confirm and close the issue if that's true. Thanks

can-anyscale · 2023-05-17T17:01:22Z

@iycheng : this test is now failing again with a different reason https://buildkite.com/ray-project/release-tests-branch/builds/1657#01882a60-6091-4c35-99ba-52dc33df0c93


500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status
--
  | Traceback (most recent call last):
  | File "/tmp/ray/session_2023-05-17_08-46-16_836088_151/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_many_nodes_actor_test_on_v2_aws_lhepbbdgxn__anyscale_pkg_1cccf5e7097acae2f60b48928d4b62c1/distributed/dashboard_test.py", line 72, in ping
  | resp.raise_for_status()
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status
  | 500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status
  | Traceback (most recent call last):
  | File "/tmp/ray/session_2023-05-17_08-46-16_836088_151/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_many_nodes_actor_test_on_v2_aws_lhepbbdgxn__anyscale_pkg_1cccf5e7097acae2f60b48928d4b62c1/distributed/dashboard_test.py", line 72, in ping
  | resp.raise_for_status()
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status
  | 500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status
  | Traceback (most recent call last):
  | File "/tmp/ray/session_2023-05-17_08-46-16_836088_151/runtime_resources/working_dir_files/s3_ray-release-automation-results_working_dirs_many_nodes_actor_test_on_v2_aws_lhepbbdgxn__anyscale_pkg_1cccf5e7097acae2f60b48928d4b62c1/distributed/dashboard_test.py", line 72, in ping
  | resp.raise_for_status()
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
  | raise HTTPError(http_error_msg, response=self)
  | requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://10.0.13.29:8265/api/cluster_status

fishbone · 2023-05-17T20:13:55Z

It also break one CI test. It got reverted. Taking another look.

fishbone · 2023-05-18T23:41:24Z

The failure is because sometimes, worker failure is failed to be reported. Investigation the root cause.

fishbone · 2023-05-19T08:42:10Z

When the test failed, there are other things happened which increase the fd usage of a core worker. That's why previous my test passed and later it's merged, it failed.

For short term, we'll update the test: #35546

Successful run here:
https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_7c11at2fr37mqi4xrnybedquxp

Result: {
  "many_nodes_actor_tests_10000": {
    "actor_launch_time": 2.7677583710000135,
    "actor_ready_time": 64.35485191700002,
    "total_time": 67.12261028800003,
    "num_actors": 10000,
    "success": "1",
    "throughput": 148.9810952984909
  },
  "many_nodes_actor_tests_15000": {
    "actor_launch_time": 3.516498482999964,
    "actor_ready_time": 195.170059293,
    "total_time": 198.68655777599997,
    "num_actors": 15000,
    "success": "1",
    "throughput": 75.49579683649792
  }
}

I'll rerun the tests in the PR before merge.

After that, I'll check which pr increase the fds and add tests to prevent that from happening again.

can-anyscale · 2023-05-19T15:36:35Z

BTW, @iycheng, 2.5 release branch still has your previous fix that was reverted in master, do we need to revert it in release branch as well? Thanks

can-anyscale · 2023-05-19T18:02:30Z

Since this is a release-blocker issue, please close it only after the cherry pick fix is merged into 2.5 release branch.

Please add @ArturNiederfahrenhorst as one of the reviewer of the fix as well for tracking purpose. Thankks!

fishbone · 2023-05-19T18:34:53Z

@can-anyscale I double checked that, I don't think it's there. I previously created a cherry-pick one #35420 and I have already closed it.

can-anyscale · 2023-05-22T17:10:44Z

@iycheng got you, that's great, thank you

fishbone · 2023-05-23T21:12:44Z

After a day's of checking, it turns out because of getting rid of grpcio work, it actually increase the sockets in GCS.

So there are two ways for this test:

update the scalability envelop and accept the regression
fix this in some way and keep the metrics.

I have two PRs for both. Both passed the nightly tests.

The fixing one has ci failures. I'll try to fix them. But if it can't be merged in time, we'll go with option 1.

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635

fishbone · 2023-05-24T19:58:06Z

@can-anyscale the fix has been merged. Feel free to verify it in your end once the master wheel is built.

The successful run: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_ywxup58wj76i8567e52l3uiijb

can-anyscale · 2023-05-24T20:07:39Z

@iycheng w00h00 thanks

fishbone · 2023-05-25T00:41:45Z

Triggered another run through the buildkite on master branch and it passed: https://buildkite.com/ray-project/release-tests-branch/builds/1692#0188500b-3c34-4fc8-8cbe-2566859c716c

can-anyscale · 2023-05-25T15:24:24Z

@iycheng , awesome, let's pick this!

ArturNiederfahrenhorst · 2023-05-25T16:51:03Z

Nice! 🙂

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number #34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635

## Why are these changes needed? After GCS client is moved to cpp, the FD usage is increased by one. Previously it's 2 and after this, it's 3. In the fix, we reuse the channel to make sure only 2 connections between GCS and CoreWorker. We still create 3 channels, but we use the same arguments to create the channels and depends on gRPC to reuse the TCP connections created. The reason why previously it's 2 hasn't been figured out. Maybe gRPC has some work hidden which can reuse the connection in sone way. ## Related issue number ray-project#34635 Signed-off-by: e428265 <[email protected]>

rickyyx added this to the Core Nightly/CI Regressions milestone Apr 20, 2023

rickyyx changed the title ~~[core][nightly] any_nodes_actor_test_on_v2.aws failed~~ [core][nightly] many_nodes_actor_test_on_v2.aws failed Apr 20, 2023

rickyyx assigned fishbone Apr 24, 2023

fishbone mentioned this issue Apr 27, 2023

[CI] linux://python/ray/tests:test_advanced_9 is failing/flaky on master. #34342

Closed

can-anyscale mentioned this issue May 2, 2023

Deflakey test advanced 9 #34883

Merged

8 tasks

can-anyscale added the release-test release test label May 2, 2023

fishbone mentioned this issue May 14, 2023

[core] Sending ReportWorkerFailure after the process died. #35320

Merged

8 tasks

fishbone added the release-blocker P0 Issue that blocks the release label May 15, 2023

scv119 removed the release-test release test label May 15, 2023

can-anyscale added the release-test release test label May 16, 2023

fishbone closed this as completed in #35320 May 17, 2023

can-anyscale reopened this May 17, 2023

fishbone mentioned this issue May 19, 2023

[test] Update scalability test #35546

Closed

8 tasks

fishbone mentioned this issue May 23, 2023

[core] Fix GCS FD usage increase regression. #35624

Merged

8 tasks

fishbone mentioned this issue May 24, 2023

[cherry-pick][core] Fix GCS FD usage increase regression. (#35624) #35738

Merged

8 tasks

fishbone closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

rickyyx commented Apr 20, 2023

can-anyscale commented Apr 23, 2023

rickyyx commented Apr 23, 2023

rickyyx commented Apr 24, 2023

fishbone commented Apr 26, 2023

fishbone commented Apr 27, 2023 •

edited

Loading

fishbone commented May 5, 2023

can-anyscale commented May 5, 2023

fishbone commented May 5, 2023 •

edited

Loading

fishbone commented May 5, 2023

rkooo567 commented May 8, 2023

SongGuyang commented May 9, 2023

can-anyscale commented May 9, 2023

rkooo567 commented May 10, 2023

can-anyscale commented May 10, 2023

rkooo567 commented May 10, 2023

can-anyscale commented May 11, 2023

fishbone commented May 14, 2023 •

edited

Loading

fishbone commented May 15, 2023

rickyyx commented May 15, 2023 •

edited

Loading

can-anyscale commented May 16, 2023

can-anyscale commented May 17, 2023 •

edited

Loading

fishbone commented May 17, 2023

fishbone commented May 18, 2023 •

edited

Loading

fishbone commented May 19, 2023

can-anyscale commented May 19, 2023

can-anyscale commented May 19, 2023

fishbone commented May 19, 2023 •

edited

Loading

can-anyscale commented May 22, 2023

fishbone commented May 23, 2023 •

edited

Loading

fishbone commented May 24, 2023

can-anyscale commented May 24, 2023

fishbone commented May 25, 2023

can-anyscale commented May 25, 2023

ArturNiederfahrenhorst commented May 25, 2023

[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

[core][nightly] many_nodes_actor_test_on_v2.aws failed #34635

Comments

rickyyx commented Apr 20, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

can-anyscale commented Apr 23, 2023

rickyyx commented Apr 23, 2023

rickyyx commented Apr 24, 2023

fishbone commented Apr 26, 2023

fishbone commented Apr 27, 2023 • edited Loading

fishbone commented May 5, 2023

can-anyscale commented May 5, 2023

fishbone commented May 5, 2023 • edited Loading

fishbone commented May 5, 2023

rkooo567 commented May 8, 2023

SongGuyang commented May 9, 2023

can-anyscale commented May 9, 2023

rkooo567 commented May 10, 2023

can-anyscale commented May 10, 2023

rkooo567 commented May 10, 2023

can-anyscale commented May 11, 2023

fishbone commented May 14, 2023 • edited Loading

fishbone commented May 15, 2023

rickyyx commented May 15, 2023 • edited Loading

can-anyscale commented May 16, 2023

can-anyscale commented May 17, 2023 • edited Loading

fishbone commented May 17, 2023

fishbone commented May 18, 2023 • edited Loading

fishbone commented May 19, 2023

can-anyscale commented May 19, 2023

can-anyscale commented May 19, 2023

fishbone commented May 19, 2023 • edited Loading

can-anyscale commented May 22, 2023

fishbone commented May 23, 2023 • edited Loading

fishbone commented May 24, 2023

can-anyscale commented May 24, 2023

fishbone commented May 25, 2023

can-anyscale commented May 25, 2023

ArturNiederfahrenhorst commented May 25, 2023

fishbone commented Apr 27, 2023 •

edited

Loading

fishbone commented May 5, 2023 •

edited

Loading

fishbone commented May 14, 2023 •

edited

Loading

rickyyx commented May 15, 2023 •

edited

Loading

can-anyscale commented May 17, 2023 •

edited

Loading

fishbone commented May 18, 2023 •

edited

Loading

fishbone commented May 19, 2023 •

edited

Loading

fishbone commented May 23, 2023 •

edited

Loading