Store: High memory usage on startup after upgrarding to 0.31.0 #6251

anas-aso · 2023-03-31T10:39:13Z

Thanos, Prometheus and Golang version used:
Thanos: goversion="go1.19.7", revision="50c464132c265eef64254a9fd063b1e2419e09b7", version="0.31.0"
Prometheus: goversion="go1.19.2", revision="dcd6af9e0d56165c6f5c64ebbc1fae798d24933a", version="2.39.1"

Object Storage Provider:
GCP Storage and AWS S3

What happened:
Memory usage spike during startup after upgrading from 0.28.0 to 0.31.0.
After the memory spike I downgraded and started upgrading gradually from 0.28.0. I noticed that the memory spike on start up happens only from 0.30.2 to 0.31.0. So the changes in 0.31.0 are the culprit.

What you expected to happen:
Memory usage stays roughly the same.

How to reproduce it (as minimally and precisely as possible):
We run Thanos on both GCP and AWS and I noticed the issue on both cloud providers.

POD args

    spec:
      containers:
      - args:
        - store
        - --log.format=json
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/thanos_config.yaml
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:19191
        - --consistency-delay=10m
        - --ignore-deletion-marks-delay=0s
        - --max-time=-719h
        - --store.grpc.series-max-concurrency=5
        - --store.grpc.series-sample-limit=50000000
        - --store.enable-index-header-lazy-reader
        image: thanosio/thanos:v0.31.0

This store exposes metrics that are older than ~30 days. Our retention is 2 years (the 2 years - 30 days store is very rarely queried, that's why we delegate it to a single instance).

Full logs to relevant components:
There is nothing special in the logs, just a huge list of events like the one below :

Logs

{
    "@timestamp": "2023-03-31T10:15:18.234182290Z",
    "caller": "bucket.go:654",
    "elapsed": "5.849035528s",
    "id": "01FNAN7EDKBJ9762ZVSV0VDCSH",
    "level": "info",
    "msg": "loaded new block"
}

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

fpetkovski · 2023-03-31T10:45:05Z

As similar issue was reported in another ticket for the Receive component: #6176 (comment).

Does removing the --store.grpc.series-sample-limit=50000000 eliminate the spike?

anas-aso · 2023-03-31T10:52:01Z

@fpetkovski I just tried dropping that the limit, but the memory spike still happens.

anas-aso · 2023-05-31T09:21:04Z

@fpetkovski any other ideas to try regarding this is appreciated.

fpetkovski · 2023-06-01T17:04:49Z

Unfortunately I am not aware of any other changes that could be contributing to the memory spike.

demikl · 2023-07-20T09:51:24Z

Hi.

I've observed a change in behavior between v0.30.2 and v0.31.0, regarding the type of memory used.

Both versions use the same amount of memory, but v <= 0.30.2 is using RSSFile (file cache?), and v0.31.0 is using RSSAnon. In my Kubernetes setup, this change triggers OOMkill since RSSAnon is taken into account for memory limit.

For v<=0.30.2 :

/ # cat /proc/1/status
[...]
VmPeak: 23497568 kB
VmSize: 23497568 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  22818140 kB
VmRSS:  22818140 kB
RssAnon:     1500400 kB
RssFile:    21317740 kB
RssShmem:          0 kB
VmData:  1557020 kB
VmStk:       140 kB
VmExe:     24052 kB
VmLib:         8 kB
VmPTE:     44788 kB
VmSwap:        0 kB

For v0.31.0 :

/ # cat /proc/1/status
[...]
VmPeak: 30499504 kB
VmSize: 30499504 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  26583296 kB
VmRSS:  26568004 kB
RssAnon:    24831888 kB
RssFile:     1736116 kB
RssShmem:          0 kB
VmData: 26235004 kB
VmStk:       140 kB
VmExe:     27896 kB
VmLib:         8 kB
VmPTE:     53868 kB
VmSwap:        0 kB

fpetkovski · 2023-07-20T10:21:52Z

This PR could have fixed the issue: #6509

jpds · 2023-08-19T08:10:54Z

Upgraded a system from 0.28.0 to 0.32.0-rc.0 and this is still an issue:

yeya24 · 2023-09-11T19:18:44Z

@jpds I believe the issue in 0.32.0-rc.0 has been fixed? Please try v0.32.2 and see if it works for you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

anas-aso commented Mar 31, 2023

fpetkovski commented Mar 31, 2023

anas-aso commented Mar 31, 2023

anas-aso commented May 31, 2023

fpetkovski commented Jun 1, 2023

demikl commented Jul 20, 2023

fpetkovski commented Jul 20, 2023

jpds commented Aug 19, 2023

yeya24 commented Sep 11, 2023

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Store: High memory usage on startup after upgrarding to 0.31.0 #6251

Comments

anas-aso commented Mar 31, 2023

fpetkovski commented Mar 31, 2023

anas-aso commented Mar 31, 2023

anas-aso commented May 31, 2023

fpetkovski commented Jun 1, 2023

demikl commented Jul 20, 2023

fpetkovski commented Jul 20, 2023

jpds commented Aug 19, 2023

yeya24 commented Sep 11, 2023