Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High error rate and choking of receives during load test #5452

Closed
philipgough opened this issue Jun 29, 2022 · 7 comments
Closed

High error rate and choking of receives during load test #5452

philipgough opened this issue Jun 29, 2022 · 7 comments

Comments

@philipgough
Copy link
Contributor

Thanos, Prometheus and Golang version used:
Same behaviour tested with both v0.27.0-rc.0 and v0.25.2

What happened:

We notice issues when load testing Thanos Receive to handle twenty million active series at 2 DPM.

Relevant config from receiver:

            - '--receive.replication-factor=3'
            - '--tsdb.retention=4d'
            - '--receive-forward-timeout=2m'

Note, that the load test has had several successful runs when the replication factor was set to 1, but appears to reproducible at will given the above configuration.

We have 6 receive replicas running on r5.2xlarge instances.
They schedule with memory requests of 55GiB and 64GiB limits.
We don't observe any OOMKilled events during the high error rate.

Screenshot 2022-06-29 at 10 31 57

We burn through our error budget within minutes at which point I killed the load test.

We have Jaeger running in the cluster and I have taken a screenshot of a sample of requests taking in excess of 2m (the forward timeout)

Screenshot 2022-06-29 at 10 15 53

Screenshot 2022-06-29 at 10 16 04

We can see from the span there are issues with writing to the TSDB.

Screenshot 2022-06-29 at 10 16 38

Memory usage appears well within our resource constraints:

Screenshot 2022-06-29 at 10 36 17

We see the receiver logs spammed with the following:

level=debug ts=2022-06-29T09:10:17.418878006Z caller=handler.go:688 component=receive component=receive-handler msg="request failed, but not needed to achieve quorum" err="forwarding request to endpoint observatorium-thanos-receive-default-4.observatorium-thanos-receive-default.observatorium-metrics-testing.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Followed by a spamming of:

level=debug ts=2022-06-29T09:10:27.850125745Z caller=writer.go:90 component=receive component=receive-writer tenant=0fc2b00e-201b-4c17-b9f2-19d91adc4fd2 msg="Out of order sample" lset="{__name__=\"k6_generated_metric_12667\", __replica__=\"replica_0\", cardinality_1e1=\"1266793\", cardinality_1e2=\"126679\", cardinality_1e3=\"12667\", cardinality_1e4=\"1266\", cardinality_1e5=\"126\", cardinality_1e6=\"12\", cardinality_1e7=\"1\", cardinality_1e8=\"0\", cardinality_1e9=\"0\", cluster=\"cluster_0\", series_id=\"12667936\"}" value=1.65649371e+09 timestamp=1656493710432

This all looks similar to what I see reported in #4831, but it does appear to be an impact of the replication factor.

Screenshot 2022-06-29 at 10 44 52

As I said, we can reproduce this at will so let me know if there is anything else that I can provide which is useful to assist in the investigation.

@bwplotka
Copy link
Member

Thanks for the report.

One thing worth checking is CPU usage. CPU might be simply saturated, causing slowdowns. If not CPU, then lock contention might be issue. In both cases profiles would be amazing.

@bwplotka
Copy link
Member

Also do I see correctly we see 6 x 10 millions head chunks? Perhaps our replication is too naive and picks series randomly (what @moadz suggested at some point)?

@bwplotka
Copy link
Member

bwplotka commented Jun 29, 2022

Also if we have any repro scripts, it would be amazing to link here (:

@bill3tt
Copy link
Contributor

bill3tt commented Jun 29, 2022

60 million head chunks is going to be right here as the script is producing 20M active series and we have a replication factor of 3.

@bwplotka
Copy link
Member

ack, makes sense, I thought we have 10M pushed

@fpetkovski
Copy link
Contributor

fpetkovski commented Jul 1, 2022

It is worth checking container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total to see if and how often receivers are getting CPU throttled.

@philipgough
Copy link
Contributor Author

@fpetkovski thanks for the pointer, looks like in this case there was some resourcing issues and some of the slowness at least is taken care of by #5566.

I'm going to close this based on this comment and will reopen if I notice again despite having additional resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants