Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Open
anas-aso opened this issue Mar 31, 2023 · 8 comments
Open

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

anas-aso opened this issue Mar 31, 2023 · 8 comments

Comments

@anas-aso
Copy link
Contributor

Thanos, Prometheus and Golang version used:
Thanos: goversion="go1.19.7", revision="50c464132c265eef64254a9fd063b1e2419e09b7", version="0.31.0"
Prometheus: goversion="go1.19.2", revision="dcd6af9e0d56165c6f5c64ebbc1fae798d24933a", version="2.39.1"

Object Storage Provider:
GCP Storage and AWS S3

What happened:
Memory usage spike during startup after upgrading from 0.28.0 to 0.31.0.
After the memory spike I downgraded and started upgrading gradually from 0.28.0. I noticed that the memory spike on start up happens only from 0.30.2 to 0.31.0. So the changes in 0.31.0 are the culprit.
Screenshot 2023-03-31 at 12 26 11

What you expected to happen:
Memory usage stays roughly the same.

How to reproduce it (as minimally and precisely as possible):
We run Thanos on both GCP and AWS and I noticed the issue on both cloud providers.

POD args

    spec:
      containers:
      - args:
        - store
        - --log.format=json
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/thanos_config.yaml
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:19191
        - --consistency-delay=10m
        - --ignore-deletion-marks-delay=0s
        - --max-time=-719h
        - --store.grpc.series-max-concurrency=5
        - --store.grpc.series-sample-limit=50000000
        - --store.enable-index-header-lazy-reader
        image: thanosio/thanos:v0.31.0

This store exposes metrics that are older than ~30 days. Our retention is 2 years (the 2 years - 30 days store is very rarely queried, that's why we delegate it to a single instance).

Full logs to relevant components:
There is nothing special in the logs, just a huge list of events like the one below :

Logs

{
    "@timestamp": "2023-03-31T10:15:18.234182290Z",
    "caller": "bucket.go:654",
    "elapsed": "5.849035528s",
    "id": "01FNAN7EDKBJ9762ZVSV0VDCSH",
    "level": "info",
    "msg": "loaded new block"
}

Anything else we need to know:

@fpetkovski
Copy link
Contributor

As similar issue was reported in another ticket for the Receive component: #6176 (comment).

Does removing the --store.grpc.series-sample-limit=50000000 eliminate the spike?

@anas-aso
Copy link
Contributor Author

@fpetkovski I just tried dropping that the limit, but the memory spike still happens.

@anas-aso
Copy link
Contributor Author

@fpetkovski any other ideas to try regarding this is appreciated.

@fpetkovski
Copy link
Contributor

Unfortunately I am not aware of any other changes that could be contributing to the memory spike.

@demikl
Copy link

demikl commented Jul 20, 2023

Hi.

I've observed a change in behavior between v0.30.2 and v0.31.0, regarding the type of memory used.

Both versions use the same amount of memory, but v <= 0.30.2 is using RSSFile (file cache?), and v0.31.0 is using RSSAnon. In my Kubernetes setup, this change triggers OOMkill since RSSAnon is taken into account for memory limit.

For v<=0.30.2 :

/ # cat /proc/1/status
[...]
VmPeak: 23497568 kB
VmSize: 23497568 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  22818140 kB
VmRSS:  22818140 kB
RssAnon:     1500400 kB
RssFile:    21317740 kB
RssShmem:          0 kB
VmData:  1557020 kB
VmStk:       140 kB
VmExe:     24052 kB
VmLib:         8 kB
VmPTE:     44788 kB
VmSwap:        0 kB

For v0.31.0 :

/ # cat /proc/1/status
[...]
VmPeak: 30499504 kB
VmSize: 30499504 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  26583296 kB
VmRSS:  26568004 kB
RssAnon:    24831888 kB
RssFile:     1736116 kB
RssShmem:          0 kB
VmData: 26235004 kB
VmStk:       140 kB
VmExe:     27896 kB
VmLib:         8 kB
VmPTE:     53868 kB
VmSwap:        0 kB

@fpetkovski
Copy link
Contributor

This PR could have fixed the issue: #6509

@jpds
Copy link
Contributor

jpds commented Aug 19, 2023

Upgraded a system from 0.28.0 to 0.32.0-rc.0 and this is still an issue:

thanos-store-api-memory-basic

@yeya24
Copy link
Contributor

yeya24 commented Sep 11, 2023

@jpds I believe the issue in 0.32.0-rc.0 has been fixed? Please try v0.32.2 and see if it works for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants