Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][State Observability][Log] gRPC max size limit when ray logs --tail 1000000 #27009

Closed
rkooo567 opened this issue Jul 26, 2022 · 2 comments · Fixed by #28188
Closed

[Core][State Observability][Log] gRPC max size limit when ray logs --tail 1000000 #27009

rkooo567 opened this issue Jul 26, 2022 · 2 comments · Fixed by #28188
Assignees
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P1 Issue that should be fixed within a few weeks Ray 2.4

Comments

@rkooo567
Copy link
Contributor

What happened + What you expected to happen

Run non_streaming_shuffle_1tb_1000_partition with a debug log on.

Run ray logs raylet.out --tail 1000000 causes gRPC max size limit. It should never happen in theory since we stream logs.

(base) ray:~/e2e-tests% ray logs raylet.out --tail 1000000
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2539, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 2120, in logs
    timeout=timeout,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/experimental/state/api.py", line 805, in get_log
    raise RayStateApiException(error_msg)
ray.experimental.state.exception.RayStateApiException: 0Closing HTTP stream due to internal server error.
<AioRpcError of RPC that terminated with:
        status = StatusCode.RESOURCE_EXHAUSTED
        details = "Received message larger than max (195781839 vs. 4194304)"
        debug_error_string = "{"created":"@1658831854.038323713","description":"Received message larger than max (195781839 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":205,"grpc_status":8}"
>

Versions / Dependencies

Master

Reproduction script

Run non_streaming_shuffle_1tb_1000_partition with a debug log on.

Run ray logs raylet.out --tail 1000000

Issue Severity

No response

@rkooo567 rkooo567 added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks labels Jul 26, 2022
@rkooo567 rkooo567 added this to the Ray State Observability milestone Jul 26, 2022
@rkooo567
Copy link
Contributor Author

cc @rickyyx this is the issue we talked today. I think we should fix it after the branch cut. When specifying --tail -1, it worked

@rickyyx
Copy link
Contributor

rickyyx commented Jul 26, 2022

Yeah, I think this is the fact we only do chunking when tail == -1 in the log agent when serving the stream. I will see if I could get to it by branch-cut.

@rickyyx rickyyx self-assigned this Jul 26, 2022
rkooo567 pushed a commit that referenced this issue Jul 27, 2022
…ger (#27071)

Signed-off-by: rickyyx [email protected]

# Why are these changes needed?

This should make the state API more scalable, and somewhat lower the chance issues like #27009 happening.
Rohan138 pushed a commit to Rohan138/ray that referenced this issue Jul 28, 2022
…ger (ray-project#27071)

Signed-off-by: rickyyx [email protected]

# Why are these changes needed?

This should make the state API more scalable, and somewhat lower the chance issues like ray-project#27009 happening.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this issue Aug 18, 2022
…ger (ray-project#27071)

Signed-off-by: rickyyx [email protected]

# Why are these changes needed?

This should make the state API more scalable, and somewhat lower the chance issues like ray-project#27009 happening.

Signed-off-by: Stefan van der Kleij <[email protected]>
@richardliaw richardliaw added the core Issues that should be addressed in Ray Core label Oct 7, 2022
@scottsun94 scottsun94 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Oct 13, 2022
@rkooo567 rkooo567 added dashboard Issues specific to the Ray Dashboard and removed observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 30, 2022
@rickyyx rickyyx added Ray 2.4 and removed core Issues that should be addressed in Ray Core labels Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard P1 Issue that should be fixed within a few weeks Ray 2.4
Projects
None yet
4 participants