-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded #27964
Conversation
Signed-off-by: jianoaix <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why choose the more complex logic? There is nothing wrong with creating a admin actor: it consumes zero resources, and you can bound its state with a fifo queue.
The recreate code is very scary and practically guaranteed to break in the future.
@ericl suggestion adopted. PTAL, thanks. |
python/ray/data/_internal/stats.py
Outdated
|
||
TODO(ekl) we should consider refactoring LazyBlockList so stats can be | ||
extracted without using an out-of-band actor.""" | ||
|
||
def __init__(self): | ||
def __init__(self, max_stats=100 * 1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 1 or 10k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced to 10k. IIRC there were people talking about number of blocks above 1000, so set it 10k for safety (it won't take significant amount of bytes anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm it looks uuid is per Dataset not per block, lowering down to 1k.
# Add the fourth stats to exceed the limit. | ||
actor.record_start.remote(3) | ||
# The first stats (with uuid=0) should have been purged. | ||
assert ray.get(actor.get.remote(0))[0] == {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An even stronger test would be to query the internal dict sizes, to verify the deletion actually happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a private method to the StatsActor to query the dict sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Signed-off-by: jianoaix [email protected]
Why are these changes needed?
There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.