From 1d8dc4e50dba70c9e0b7d48edc0e4c2808748c99 Mon Sep 17 00:00:00 2001 From: Kemal Akkoyun Date: Fri, 18 Oct 2019 19:52:07 +0200 Subject: [PATCH] store: Start metric and status probe HTTP server as earlier as possible (#1656) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Start metric and status probe server as soon as possible Signed-off-by: Kemal Akkoyun * Update changelog Signed-off-by: Kemal Akkoyun * Schedule a separate goroutine to start server Signed-off-by: Kemal Akkoyun * Add InitSync to the rungroup Signed-off-by: Kemal Akkoyun * Fix linter pointed issues Signed-off-by: Kemal Akkoyun * Move InitSync to alreay existed run.Group Signed-off-by: Kemal Akkoyun * Remove unnecessary changes and update CHANGELOG Signed-off-by: Kemal Akkoyun * Add simple explanation for probes Signed-off-by: Kemal Akkoyun * Make requested changes Signed-off-by: Kemal Akkoyun * Update CHANGELOG.md Co-Authored-By: Martin Chodur Signed-off-by: Kemal Akkoyun Signed-off-by: Giedrius Statkevičius --- CHANGELOG.md | 16 +++++++---- cmd/thanos/store.go | 59 ++++++++++++++++++++++------------------ docs/components/store.md | 14 ++++++++-- 3 files changed, 54 insertions(+), 35 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c7c45a703b1..eb96112711d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,10 @@ We use *breaking* word for marking changes that are not backward compatible (rel - [#1660](https://github.com/thanos-io/thanos/pull/1660) Add a new `--prometheus.ready_timeout` CLI option to the sidecar to set how long to wait until Prometheus starts up. +### Fixed + +- [#1656](https://github.com/thanos-io/thanos/pull/1656) Thanos Store now starts metric and status probe HTTP server earlier in its start-up sequence. `/-/healthy` endpoint now starts to respond with success earlier. `/metrics` endpoint starts serving metrics earlier as well. Make sure to point your readiness probes to the `/-/ready` endpoint rather than `/metrics`. + ## [v0.8.1](https://github.com/thanos-io/thanos/releases/tag/v0.8.1) - 2019.10.14 ### Fixed @@ -23,12 +27,12 @@ We use *breaking* word for marking changes that are not backward compatible (rel * NOTE: `thanos_store_nodes_grpc_connections` metric is now per `external_labels` and `store_type`. It is a recommended metric for Querier storeAPIs. `thanos_store_node_info` is marked as obsolete and will be removed in next release. * NOTE2: Store Gateway is now advertising artificial: `"@thanos_compatibility_store_type=store"` label. This is to have the current Store Gateway compatible with Querier pre v0.8.0. This label can be disabled by hidden `debug.advertise-compatibility-label=false` flag on Store Gateway. - + ## [v0.8.0](https://github.com/thanos-io/thanos/releases/tag/v0.8.0) - 2019.10.10 Lot's of improvements this release! Noteworthy items: - First Katacoda tutorial! 🐱 -- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files. +- Fixed Deletion order causing Compactor to produce not needed 👻 blocks with missing random files. - Store GW memory improvements (more to come!). - Querier allows multiple deduplication labels. - Both Compactor and Store Gateway can be **sharded** within the same bucket using relabelling! @@ -42,7 +46,7 @@ both Prometheus and sidecar with Thanos: https://prometheus.io/blog/2019/10/10/r - [#1619](https://github.com/thanos-io/thanos/pull/1619) Thanos sidecar allows to limit min time range for data it exposes from Prometheus. - [#1583](https://github.com/thanos-io/thanos/pull/1583) Thanos sharding: - - Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components. + - Add relabel config (`--selector.relabel-config-file` and `selector.relabel-config`) into Thanos Store and Compact components. Selecting blocks to serve depends on the result of block labels relabeling. - For store gateway, advertise labels from "approved" blocks. - [#1540](https://github.com/thanos-io/thanos/pull/1540) Thanos Downsample added `/-/ready` and `/-/healthy` endpoints. @@ -55,8 +59,8 @@ Selecting blocks to serve depends on the result of block labels relabeling. - [#1362](https://github.com/thanos-io/thanos/pull/1362) Optional `replicaLabels` param for `/query` and `/query_range` querier endpoints. When provided overwrite the `query.replica-label` cli flags. - [#1482](https://github.com/thanos-io/thanos/pull/1482) Thanos now supports Elastic APM as tracing provider. -- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag. -- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change. +- [#1612](https://github.com/thanos-io/thanos/pull/1612) Thanos Rule added `resendDelay` flag. +- [#1480](https://github.com/thanos-io/thanos/pull/1480) Thanos Receive flushes storage on hashring change. - [#1613](https://github.com/thanos-io/thanos/pull/1613) Thanos Receive now traces forwarded requests. ### Changed @@ -76,7 +80,7 @@ once for multiple deduplication labels like: `--query.replica-label=prometheus_r - [#1544](https://github.com/thanos-io/thanos/pull/1544) Iterating over object store is resilient to the edge case for some providers. - [#1469](https://github.com/thanos-io/thanos/pull/1469) Fixed Azure potential failures (EOF) when requesting more data then blob has. - [#1512](https://github.com/thanos-io/thanos/pull/1512) Thanos Store fixed memory leak for chunk pool. -- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts. +- [#1488](https://github.com/thanos-io/thanos/pull/1488) Thanos Rule now now correctly links to query URL from rules and alerts. ## [v0.7.0](https://github.com/thanos-io/thanos/releases/tag/v0.7.0) - 2019.09.02 diff --git a/cmd/thanos/store.go b/cmd/thanos/store.go index 139a7c8a748..70b213937ee 100644 --- a/cmd/thanos/store.go +++ b/cmd/thanos/store.go @@ -126,7 +126,11 @@ func runStore( selectorRelabelConf *extflag.PathOrContent, advertiseCompatibilityLabel bool, ) error { + // Initiate HTTP listener providing metrics endpoint and readiness/liveness probes. statusProber := prober.NewProber(component, logger, prometheus.WrapRegistererWithPrefix("thanos_", reg)) + if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil { + return errors.Wrap(err, "schedule HTTP server") + } confContentYaml, err := objStoreConfig.Content() if err != nil { @@ -185,29 +189,35 @@ func runStore( return errors.Wrap(err, "create object storage store") } - begin := time.Now() - level.Debug(logger).Log("msg", "initializing bucket store") - if err := bs.InitialSync(context.Background()); err != nil { - return errors.Wrap(err, "bucket store initial sync") - } - level.Debug(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String()) - - ctx, cancel := context.WithCancel(context.Background()) - g.Add(func() error { - defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client") - - err := runutil.Repeat(syncInterval, ctx.Done(), func() error { - if err := bs.SyncBlocks(ctx); err != nil { - level.Warn(logger).Log("msg", "syncing blocks failed", "err", err) + // bucketStoreReady signals when bucket store is ready. + bucketStoreReady := make(chan struct{}) + { + ctx, cancel := context.WithCancel(context.Background()) + g.Add(func() error { + defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client") + + level.Info(logger).Log("msg", "initializing bucket store") + begin := time.Now() + if err := bs.InitialSync(ctx); err != nil { + close(bucketStoreReady) + return errors.Wrap(err, "bucket store initial sync") } - return nil + level.Info(logger).Log("msg", "bucket store ready", "init_duration", time.Since(begin).String()) + close(bucketStoreReady) + + err := runutil.Repeat(syncInterval, ctx.Done(), func() error { + if err := bs.SyncBlocks(ctx); err != nil { + level.Warn(logger).Log("msg", "syncing blocks failed", "err", err) + } + return nil + }) + + runutil.CloseWithLogOnErr(logger, bs, "bucket store") + return err + }, func(error) { + cancel() }) - - runutil.CloseWithLogOnErr(logger, bs, "bucket store") - return err - }, func(error) { - cancel() - }) + } l, err := net.Listen("tcp", grpcBindAddr) if err != nil { @@ -221,17 +231,14 @@ func runStore( s := newStoreGRPCServer(logger, reg, tracer, bs, opts) g.Add(func() error { - level.Info(logger).Log("msg", "Listening for StoreAPI gRPC", "address", grpcBindAddr) + <-bucketStoreReady + level.Info(logger).Log("msg", "listening for StoreAPI gRPC", "address", grpcBindAddr) statusProber.SetReady() return errors.Wrap(s.Serve(l), "serve gRPC") }, func(error) { s.Stop() }) - if err := scheduleHTTPServer(g, logger, reg, statusProber, httpBindAddr, nil, component); err != nil { - return errors.Wrap(err, "schedule HTTP server") - } - level.Info(logger).Log("msg", "starting store node") return nil } diff --git a/docs/components/store.md b/docs/components/store.md index 5149fd17b45..9e5b7fe9433 100644 --- a/docs/components/store.md +++ b/docs/components/store.md @@ -122,11 +122,11 @@ Flags: ``` -## Time based partioning +## Time based partitioning By default Thanos Store Gateway looks at all the data in Object Store and returns it based on query's time range. -Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time. +Thanos Store `--min-time`, `--max-time` flags allows you to shard Thanos Store based on constant time or time duration relative to current time. For example setting: `--min-time=-6w` & `--max-time==-2w` will make Thanos Store Gateway return metrics that fall within `now - 6 weeks` up to `now - 2 weeks` time range. @@ -136,6 +136,14 @@ Thanos Store Gateway might not get new blocks immediately, as Time partitioning We recommend having overlapping time ranges with Thanos Sidecar and other Thanos Store gateways as this will improve your resiliency to failures. -Thanos Querier deals with overlapping time series by merging them together. +Thanos Querier deals with overlapping time series by merging them together. Filtering is done on a Chunk level, so Thanos Store might still return Samples which are outside of `--min-time` & `--max-time`. + +## Probes + +- Thanos Store exposes two endpoints for probing. + - `/-/healthy` starts as soon as initial setup completed. + - `/-/ready` starts after all the bootstrapping completed (e.g initial index building) and ready to serve traffic. + +> NOTE: Metric endpoint starts immediately so, make sure you set up readiness probe on designated HTTP `/-/ready` path.