Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task backend - Profile events capping #33321

Merged
merged 3 commits into from
Mar 20, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Mar 15, 2023

Why are these changes needed?

  1. This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
  2. This PR also refactos the metrics tracking to reduce lock contention on the core worker.

Related issue number

Release tests:
https://buildkite.com/ray-project/release-tests-pr/builds/30968#0186e373-19ec-4335-a4af-b82d5e9e0e42

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <[email protected]>
@rickyyx rickyyx marked this pull request as ready for review March 15, 2023 06:36
Signed-off-by: rickyyx <[email protected]>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 questions!

  1. Is it common to have a task event with 100+ profile events? It seems extremely unlikely already
  2. When profile events > threshold, why do we drop the task event itself?

src/ray/common/ray_config_def.h Show resolved Hide resolved
src/ray/core_worker/task_event_buffer.cc Show resolved Hide resolved
@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 15, 2023

Release test:

  • stress_test_state_api_scale: passes:
  • stress_test_many_tasks: stage 3 regression fixed from 2.7k to 2.5k,

Signed-off-by: rickyyx <[email protected]>
@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 17, 2023

Added comments.

@rickyyx rickyyx added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 17, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Mar 17, 2023

@rkooo567 - this shoudl be ready to merge.

@rkooo567 rkooo567 merged commit fb7e9d6 into ray-project:master Mar 20, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
This PR also refactos the metrics tracking to reduce lock contention on the core worker.

Signed-off-by: Edward Oakes <[email protected]>
clarng pushed a commit to clarng/ray that referenced this pull request Mar 23, 2023
This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
This PR also refactos the metrics tracking to reduce lock contention on the core worker.
chaowanggg pushed a commit to chaowanggg/ray-dev that referenced this pull request Apr 4, 2023
This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
This PR also refactos the metrics tracking to reduce lock contention on the core worker.

Signed-off-by: chaowang <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
This PR also refactos the metrics tracking to reduce lock contention on the core worker.

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
This PR restricts the number of profile events to be sent and aggregates task events from the same task attempt on the worker side to reduce the data sent to GCS.
This PR also refactos the metrics tracking to reduce lock contention on the core worker.

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants