High error rate and choking of receives during load test #5452

philipgough · 2022-06-29T09:47:00Z

Thanos, Prometheus and Golang version used:
Same behaviour tested with both v0.27.0-rc.0 and v0.25.2

What happened:

We notice issues when load testing Thanos Receive to handle twenty million active series at 2 DPM.

Relevant config from receiver:

            - '--receive.replication-factor=3'
            - '--tsdb.retention=4d'
            - '--receive-forward-timeout=2m'

Note, that the load test has had several successful runs when the replication factor was set to 1, but appears to reproducible at will given the above configuration.

We have 6 receive replicas running on r5.2xlarge instances.
They schedule with memory requests of 55GiB and 64GiB limits.
We don't observe any OOMKilled events during the high error rate.

We burn through our error budget within minutes at which point I killed the load test.

We have Jaeger running in the cluster and I have taken a screenshot of a sample of requests taking in excess of 2m (the forward timeout)

We can see from the span there are issues with writing to the TSDB.

Memory usage appears well within our resource constraints:

We see the receiver logs spammed with the following:

level=debug ts=2022-06-29T09:10:17.418878006Z caller=handler.go:688 component=receive component=receive-handler msg="request failed, but not needed to achieve quorum" err="forwarding request to endpoint observatorium-thanos-receive-default-4.observatorium-thanos-receive-default.observatorium-metrics-testing.svc.cluster.local:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Followed by a spamming of:

level=debug ts=2022-06-29T09:10:27.850125745Z caller=writer.go:90 component=receive component=receive-writer tenant=0fc2b00e-201b-4c17-b9f2-19d91adc4fd2 msg="Out of order sample" lset="{__name__=\"k6_generated_metric_12667\", __replica__=\"replica_0\", cardinality_1e1=\"1266793\", cardinality_1e2=\"126679\", cardinality_1e3=\"12667\", cardinality_1e4=\"1266\", cardinality_1e5=\"126\", cardinality_1e6=\"12\", cardinality_1e7=\"1\", cardinality_1e8=\"0\", cardinality_1e9=\"0\", cluster=\"cluster_0\", series_id=\"12667936\"}" value=1.65649371e+09 timestamp=1656493710432

This all looks similar to what I see reported in #4831, but it does appear to be an impact of the replication factor.

As I said, we can reproduce this at will so let me know if there is anything else that I can provide which is useful to assist in the investigation.

The text was updated successfully, but these errors were encountered:

bwplotka · 2022-06-29T10:32:31Z

Thanks for the report.

One thing worth checking is CPU usage. CPU might be simply saturated, causing slowdowns. If not CPU, then lock contention might be issue. In both cases profiles would be amazing.

bwplotka · 2022-06-29T10:33:35Z

Also do I see correctly we see 6 x 10 millions head chunks? Perhaps our replication is too naive and picks series randomly (what @moadz suggested at some point)?

bwplotka · 2022-06-29T10:34:13Z

Also if we have any repro scripts, it would be amazing to link here (:

bill3tt · 2022-06-29T12:15:12Z

60 million head chunks is going to be right here as the script is producing 20M active series and we have a replication factor of 3.

bwplotka · 2022-06-29T12:16:29Z

ack, makes sense, I thought we have 10M pushed

fpetkovski · 2022-07-01T10:44:57Z

It is worth checking container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total to see if and how often receivers are getting CPU throttled.

philipgough · 2022-08-05T14:51:41Z

@fpetkovski thanks for the pointer, looks like in this case there was some resourcing issues and some of the slowness at least is taken care of by #5566.

I'm going to close this based on this comment and will reopen if I notice again despite having additional resources.

matej-g added component: receive needs-investigation labels Jun 29, 2022

philipgough closed this as completed Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High error rate and choking of receives during load test #5452

High error rate and choking of receives during load test #5452

philipgough commented Jun 29, 2022

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022 •

edited

Loading

bill3tt commented Jun 29, 2022

bwplotka commented Jun 29, 2022

fpetkovski commented Jul 1, 2022 •

edited

Loading

philipgough commented Aug 5, 2022

High error rate and choking of receives during load test #5452

High error rate and choking of receives during load test #5452

Comments

philipgough commented Jun 29, 2022

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022

bwplotka commented Jun 29, 2022 • edited Loading

bill3tt commented Jun 29, 2022

bwplotka commented Jun 29, 2022

fpetkovski commented Jul 1, 2022 • edited Loading

philipgough commented Aug 5, 2022

bwplotka commented Jun 29, 2022 •

edited

Loading

fpetkovski commented Jul 1, 2022 •

edited

Loading