-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store: High memory usage on startup after upgrarding to 0.31.0 #6251
Comments
As similar issue was reported in another ticket for the Receive component: #6176 (comment). Does removing the |
@fpetkovski I just tried dropping that the limit, but the memory spike still happens. |
@fpetkovski any other ideas to try regarding this is appreciated. |
Unfortunately I am not aware of any other changes that could be contributing to the memory spike. |
Hi. I've observed a change in behavior between v0.30.2 and v0.31.0, regarding the type of memory used. Both versions use the same amount of memory, but v <= 0.30.2 is using RSSFile (file cache?), and v0.31.0 is using RSSAnon. In my Kubernetes setup, this change triggers OOMkill since RSSAnon is taken into account for memory limit. For v<=0.30.2 :
For v0.31.0 :
|
This PR could have fixed the issue: #6509 |
Thanos, Prometheus and Golang version used:
Thanos:
goversion="go1.19.7", revision="50c464132c265eef64254a9fd063b1e2419e09b7", version="0.31.0"
Prometheus:
goversion="go1.19.2", revision="dcd6af9e0d56165c6f5c64ebbc1fae798d24933a", version="2.39.1"
Object Storage Provider:
GCP Storage and AWS S3
What happened:
Memory usage spike during startup after upgrading from 0.28.0 to 0.31.0.
After the memory spike I downgraded and started upgrading gradually from 0.28.0. I noticed that the memory spike on start up happens only from 0.30.2 to 0.31.0. So the changes in 0.31.0 are the culprit.
What you expected to happen:
Memory usage stays roughly the same.
How to reproduce it (as minimally and precisely as possible):
We run Thanos on both GCP and AWS and I noticed the issue on both cloud providers.
POD args
This store exposes metrics that are older than ~30 days. Our retention is 2 years (the 2 years - 30 days store is very rarely queried, that's why we delegate it to a single instance).
Full logs to relevant components:
There is nothing special in the logs, just a huge list of events like the one below :
Anything else we need to know:
The text was updated successfully, but these errors were encountered: