-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] prevent in memory metric store in handles from growing in memory #44877
Conversation
74e4301
to
e5acb10
Compare
96a4a86
to
2f0a091
Compare
def prune_data(self, start_timestamp_s: float): | ||
"""Prune keys that haven't had new data recorded after start_timestamp_s.""" | ||
for key, datapoints in list(self.data.items()): | ||
if len(datapoints) == 0 or datapoints[-1].timestamp < start_timestamp_s: | ||
del self.data[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's probably a better datastructure we can be using for this, e.g., use numpy arrays instead of lists numpy
but for now this is very unlikely to be a bottleneck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name it "prune_keys" to make it clear that this does not prune datapoints that are outside of the window if a key is still active
def process_finished_request(self, replica_id: ReplicaID, *args): | ||
def dec_num_running_requests_for_replica(self, replica_id: ReplicaID, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[non blocking] It's a little worrisome to me that we now have two separate GC paths, one to prune outdated datapoints and one to prune keys with no datapoints. Is it possible to combine them into the same path w/o any performance degradation?
505f57e
to
56834d0
Compare
Results from running
|
Comparing results master, and the basic fix + combining the GC into a single path:
|
@edoakes I've combined them. It doesn't seem like there's any noticeable performance degradation. |
d9fb461
to
6ff3878
Compare
There are two potential sources of memory leak for the `InMemoryMetricsStore` in the handles that's used to record/report autoscaling metrics: 1. Old replica ID keys are never removed. We remove old replica keys from `num_queries_sent_to_replicas` when we get an updated list of running replicas from the long poll update, but we don't do any such cleaning for the in memory metrics store. This means there is leftover, uncleaned data for replicas that are no longer running. 2. We don't delete data points recorded from more than `look_back_period_s` ago for replicas except during window avg queries. This should mostly be solved once (1) is solved because this should only be a problem for replicas that are no longer running. This PR addresses (1) and (2) by periodically * pruning keys that haven't had updated data points in the past `look_back_period_s`. * compacting datapoints that are more than `look_back_period_s` old Main benchmarks picked from the full microbenchmark run posted below in the comments: |metric| master | current changes | % change | |------|---|---| -------- | |http_p50_latency|11.082594282925129|11.626139283180237|4.9044924534731305| |http_1mb_p50_latency|11.81719359010458|12.776304967701435|8.116236484439| |http_10mb_p50_latency|17.57313683629036|18.03796272724867|2.6450934473940757| |http_avg_rps|204.2|195.04|-4.48579823702252| |grpc_p50_latency|7.965719327330589|8.844093419611454|11.026927465| |grpc_1mb_p50_latency|17.652496695518494|19.921275787055492|12.852454418603475| |grpc_10mb_p50_latency|142.39510521292686|153.88561598956585|8.069456291673038| |grpc_avg_rps|203.35|211.01|3.766904352102296| |handle_p50_latency|4.890996962785721|4.082906059920788|-16.522007864929765| |handle_1mb_p50_latency|11.582874692976475|10.905216448009014|-5.8505186573275525| |handle_10mb_p50_latency|65.54202642291784|67.52330902963877|3.0229193615962657| |handle_avg_rps|394.57|404.85|2.6053678688192194| There is no performance degradation in latencies or throughput. All benchmarks were run with autoscaling turned on (instead of `num_replicas=1`, I just set `autoscaling_config={"min_replicas": 1, "max_replicas": 1}`) Closes ray-project#44870. Signed-off-by: Cindy Zhang <[email protected]>
[serve] prevent in memory metric store in handles from growing in memory
There are two potential sources of memory leak for the
InMemoryMetricsStore
in the handles that's used to record/report autoscaling metrics:num_queries_sent_to_replicas
when we get an updated list of running replicas from the long poll update, but we don't do any such cleaning for the in memory metrics store. This means there is leftover, uncleaned data for replicas that are no longer running.look_back_period_s
ago for replicas except during window avg queries. This should mostly be solved once (1) is solved because this should only be a problem for replicas that are no longer running.This PR addresses (1) and (2) by periodically
look_back_period_s
.look_back_period_s
oldMain benchmarks picked from the full microbenchmark run posted below in the comments:
There is no performance degradation in latencies or throughput. All benchmarks were run with autoscaling turned on (instead of
num_replicas=1
, I just setautoscaling_config={"min_replicas": 1, "max_replicas": 1}
)Closes #44870.
Signed-off-by: Cindy Zhang [email protected]