Thanos load test / benchmark #346

bwplotka · 2018-05-22T12:05:44Z

Hi All,
We are planning to start some initiative of Thanos load test to check the common metrics like query responsiveness, resource consumptions during common operations on excessively scaled setup.

We want your input before we start! Do you have any particular ideas:

what metrics to measure?
what tests to actually perform?

We would like to focus on Thanos features that we want to test, for example:

Query test

Setup:

Spin up X (50?) Prometheuses (in different groups in 1, 2, 8 replicas even?).
Create some artificially fed K8s cluster (similar to the Prometheus benchmark) and make sure every Prometheus scrape the same things. (do we actually need the whole cluster? maybe single time series would be enough?)

Operations:
Perform certain query for fresh data (what range? what metric? with dedup?)

What to measure, what is the goal?

Find the max number of Prometheus (scraper) instances, single Thanos query can handle.
Measure the query latency and CPU/Mem resources.

What components/features it tests
Single thanos-query capabilities (global view scalability, deduplication).

Notes:
Usually, you can have 50 Prometheuses connected, but you will ask only for metric filtered by some external labels (e.g cluster or environment from only some instances). This will limit fanout. For test purposes, we can mimick the case for the metric that is present and available on all 50 instances to test full fanout.

Historical data test

Setup

Generate artificially old data. (1year)? Can be just single time series. (or maybe having lot's of data actually will put more focus on how we index stuff with TSDB and perform smart (fanin) queries)
Compact some metrics, downsample some.

Operations
Query single store against choosen provider (do we really need thanos-query here on top? we could use just gRPC API). Query old data that is compacted/not-compacted/downsampled/not-downsampled for different time ranges .

What to measure, what is the goal?

How responsive are queries against S3, GCS? Is there any difference?
How compaction actually helps? Maybe last 2w level compactor is too big/ too small?
How downsampling actually helps?
What queries OOMs the store gateway?
How to prevent OOMs and alert instead?

Measure query latency and Mem/CPU consumption.

What components/features it tests
Historical data fetch, thanos store gateway.

Maybe some compactor tests as well?

Also any useful tools for benchmarks? I can see:
https://github.com/prometheus/prombench
Prometheus 2 benchmark results: https://coreos.com/blog/prometheus-2.0-storage-layer-optimization

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-05-22T12:17:56Z

Preferable test case description format:

Test name

Setup
...

Operations
...

What to measure, what is the goal?
...

What components/features it tests
...

Notes
...

povilasv · 2018-05-22T13:36:39Z

Test name
Thanos vs Prometheus Federation

Setup

Launch 2 different Prometheis connected via Thanos sidecar and Thanos query
Launch 1 Prometheus federating data from the different Prometheus

Load a bunch of data

Operations

Do a bunch of queries

What to measure, what is the goal?

Compare performance (Is/How much Prometheus is faster than Thanos?)
Compare Resource use (how much does Thanos has overhead of cpu, mem resources)

What components/features it tests
Thanos sidecar, Thanos query.

Notes

I have a couple of people worrying about Thanos performance against Federated Prometheus

TimSimmons · 2018-05-22T15:35:21Z

Historical data test with many time series over many windows

Setup

Generate artificially old data, up to a year should be fine.

Create many time series with the same metric name (up to 100k) and many label permutations.
Compact/downsample the metrics.

Operations

Query the data via query nodes. With various amounts of timeseries touched. Start with 1, work up through 100, 500, 1000, 10000, 100000 timeseries over periods of instant, 1h, 6h, 12h, 1d, 3d, 1w, 2w, 4w, 8w, 12w, 24w, 36w, 52w.

What to measure, what is the goal?

Measure query times for various combinations and see where things become problematic. Perhaps the original small query over a long time takes longer loading a bunch of data. There should be plenty of insights to be gained. It might be a good way to find bottlenecks in the query nodes, or see what the effects of scaling them might be. You could also go against just the store gRPC API to see the difference in query times.

What components/features it tests

Historical data fetching, querying, high cardinality queries.
Thanos store gateway, query nodes, sidecars, compactor.

Notes

Some of these queries might not work (it's unreasonable to expect they would), but the idea here is to simulate what people will do to Thanos in the real world. Deploying Thanos internally at companies will mean dealing with sets of metrics that are not well designed but "very important to the business", and they'll want Thanos to work this way. Finding out where the limits are and being able to give recommendations about what is possible up front should be useful for Thanos developers and users.

mihailgmihaylov · 2018-05-30T07:17:59Z

Thanos Query performance test

Setup

Run a set of particularly heavy queries against Thanos Query API and against Prometheus API and compare load time.

Operations

In both cases, the query period should be 2h so that we are querying from prometheus local data not through Thanos Store.
The prometheus cache should be taken into account, so if the two queries are executed in short interval to each other the the once after the first execution would be a lot faster so the first query should be excluded from the test. The test should test Thanos Query API latency not cache efficiency so I think querying from memory loaded data is OK.
Perform 10 queries against Thanos API and 10 against Prom API. At the end compare the mean of the two.

What to measure, what is the goal?
Thanos latency. Although, Thanos Query is working fast it adds or may add a latency to the queries.
This is a small price to pay to the deduplication, high availability and long term storage features that we get but it would be handy to know how much exactly.

What components/features it tests
Thanos Query

Notes

This test would be handy to find if the number of Thanos Query nodes is helping the performance or decreasing it.
Also, it would create a benchmark for future tests on Thanos Query changes so that there we are sure that the performance is not decreasing.

asbjxrn · 2018-06-22T06:06:29Z

Compaction/downsampling performance

Setup

Generate artificially data from many prometheus servers (200? 1000?)

Volume of data should be similar to what one might get from node-exporter for 10-1000 scraped servers per prometheus server

Data doesn't have to be very old, workload should be what's expected between runs of thanos compact.

Data should be uploaded to long-term datastore.

Operations

Run thanos compact and measure time to do one pass of compaction/downsampling.

Preferably run it through some proxy that can add various amount of latency/bandwidth throttling.

What to measure, what is the goal?

Time to do compaction and how do network limitations affect it.
Since "thanos compact" is a singleton, it may limit the amount of data that's possible to stored. How much data can be processed given latency/bandwidth limitation to the store.

This is relevant for organisations which run on their own hardware but use S3/GCS for long term storage.

What components/features it tests

thanos compact

Notes

bwplotka · 2018-06-22T20:26:15Z

Preferably run it through some proxy that can add various amount of latency/bandwidth throttling.

Proxy between what things? (:

Since "thanos compact" is a singleton, it may limit the amount of data that's possible to stored. How much data can be processed given latency/bandwidth limitation to the store.

Well, you can always have a different bucket if that's an issue. (: But would be nice to know the answer, true.

asbjxrn · 2018-06-25T03:50:31Z

Proxy between what things? (:

Proxy between thanos compact and S3/GCS. Just to introduce a delay/throttle to simulate distance from the store.

Well, you can always have a different bucket if that's an issue. (: But would be nice to know the answer, true.

True, and it would be good for users to be aware of up front. I only realised this might be an issue for us once I was running the compactor on my workstation which had 100ms latency to the store.

adamhosier · 2018-06-27T09:39:42Z

Hey all,

A small update on the progress we're making on the loadtests. We've put together a few tools to help us run some tests. These include a tool to spin up a minimal thanos installation (prometheus + sidecar, thanos-query & thanos-store), a tool to measure query performance of a prometheus or thanos endpoint, and a tool to generate historic TSDB blocks to simulate metrics in LTS.

Some early results are giving positive signs, showing we can query 1 year of metrics (sum over 100 timeseries, taken at 15s scrape interval, 210 million total samples) in about 30 seconds. This is using 2 week long blocks with no downsampling. We have noticed that thanos does add some overhead to regular queries, causing most queries to take about twice as long to run on thanos-query compared to vanilla prometheus. We have observed this on short running queries (e.g. just fetching 45k metrics took 0.055s on thanos vs 0.022s on prometheus) as well as longer running queries (rate over 4.5 million metrics took 5.79s on thanos compared to 3.13s on prometheus). This is more or less expected, as we hit the network twice when using thanos-query. We aim to get some further results on metric ingestion rate & performance of different queries soon, so keep an eye on this issue.

We do plan on releasing the tooling & testing framework we have built for these tests soon, and I'll update this issue when progress has been made on that.

adamhosier · 2018-07-16T14:42:11Z

Hi,

We’ve completed our first round of benchmarking, check out the results -> https://github.com/improbable-eng/thanos/tree/master/benchmark#results.

We’ve also released the tooling used to run the benchmarks, so if the results are not appropriate for your use case, feel free to clone & have a play. Enjoy 🙂

bwplotka · 2018-07-23T21:08:11Z

Quick idea: Deduplication benchmark might be interesting.

bwplotka · 2019-09-11T11:10:24Z

FYI: We removed the tool from repo for now as it was not very well maintained (still used old Thanos version) and it was a separate go module repo that was causing troubles.

73dd0da

We might start something like prombench in a separate repo, but for now, it's recommended to just profile/benchmark Thanos directly on your deployment as it will give the closes possible experience to what you will have on production.

if you have any different ideas how this tool should look like, let us know (:

bwplotka assigned adamhosier May 22, 2018

bwplotka added the question label May 22, 2018

bwplotka mentioned this issue Jun 1, 2018

HA handling for store nodes #199

Closed

bwplotka closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos load test / benchmark #346

Thanos load test / benchmark #346

bwplotka commented May 22, 2018 •

edited

Loading

bwplotka commented May 22, 2018

povilasv commented May 22, 2018 •

edited

Loading

TimSimmons commented May 22, 2018

mihailgmihaylov commented May 30, 2018 •

edited

Loading

asbjxrn commented Jun 22, 2018 •

edited

Loading

bwplotka commented Jun 22, 2018

asbjxrn commented Jun 25, 2018

adamhosier commented Jun 27, 2018

adamhosier commented Jul 16, 2018

bwplotka commented Jul 23, 2018

bwplotka commented Sep 11, 2019

Thanos load test / benchmark #346

Thanos load test / benchmark #346

Comments

bwplotka commented May 22, 2018 • edited Loading

Query test

Historical data test

bwplotka commented May 22, 2018

Test name

povilasv commented May 22, 2018 • edited Loading

TimSimmons commented May 22, 2018

Historical data test with many time series over many windows

Setup

Operations

What to measure, what is the goal?

What components/features it tests

Notes

mihailgmihaylov commented May 30, 2018 • edited Loading

Thanos Query performance test

asbjxrn commented Jun 22, 2018 • edited Loading

Compaction/downsampling performance

Setup

Operations

What to measure, what is the goal?

What components/features it tests

Notes

bwplotka commented Jun 22, 2018

asbjxrn commented Jun 25, 2018

adamhosier commented Jun 27, 2018

adamhosier commented Jul 16, 2018

bwplotka commented Jul 23, 2018

bwplotka commented Sep 11, 2019

bwplotka commented May 22, 2018 •

edited

Loading

povilasv commented May 22, 2018 •

edited

Loading

mihailgmihaylov commented May 30, 2018 •

edited

Loading

asbjxrn commented Jun 22, 2018 •

edited

Loading