Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded #27964

Merged
merged 6 commits into from
Aug 18, 2022

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Aug 17, 2022

Signed-off-by: jianoaix [email protected]

Why are these changes needed?

There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose the more complex logic? There is nothing wrong with creating a admin actor: it consumes zero resources, and you can bound its state with a fifo queue.

The recreate code is very scary and practically guaranteed to break in the future.

@jianoaix jianoaix changed the title Make stats actor per-job instead of per-cluster/detached Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded Aug 18, 2022
@jianoaix
Copy link
Contributor Author

@ericl suggestion adopted. PTAL, thanks.


TODO(ekl) we should consider refactoring LazyBlockList so stats can be
extracted without using an out-of-band actor."""

def __init__(self):
def __init__(self, max_stats=100 * 1000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit high?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 1 or 10k?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced to 10k. IIRC there were people talking about number of blocks above 1000, so set it 10k for safety (it won't take significant amount of bytes anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm it looks uuid is per Dataset not per block, lowering down to 1k.

# Add the fourth stats to exceed the limit.
actor.record_start.remote(3)
# The first stats (with uuid=0) should have been purged.
assert ray.get(actor.get.remote(0))[0] == {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An even stronger test would be to query the internal dict sizes, to verify the deletion actually happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a private method to the StatsActor to query the dict sizes.

@ericl ericl self-assigned this Aug 18, 2022
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 18, 2022
@jianoaix jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 18, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ericl ericl merged commit 440ae62 into ray-project:master Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants