Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395

zygiss · 2020-11-03T11:27:19Z

Thanos, Prometheus and Golang version used:

Thanos Docker image: thanosio/thanos:master-2020-10-22-57076a58

Object Storage Provider:

GCP

What happened:

We wanted to make use of the new compactor functionality introduced in #3115, so upgraded compactor pods to version "master-2020-10-22-57076a58".

Thanos compactors run in K8s with liveness and readiness probes configured to probe /-/healthy and /-/ready respectively.

After upgrade we noticed pods getting killed regularly because they were failing liveness probes.

This also means that we aren't able to collect any metrics from the affected compactor pods until HTTP socket is open.

Compactor configuration:

    Command:
      /bin/thanos
      compact
    Args:
      --compact.concurrency=10
      --downsampling.disable
      --data-dir=/var/thanos/data/compact
      --log.level=debug
      --no-debug.halt-on-error
      --wait
      --delete-delay=30m
      --objstore.config=type: GCS
      config:
        bucket: "thanos-region01"
    Liveness:   http-get http://:metrics-thanos/-/healthy delay=5s timeout=3s period=5s #success=1 #failure=5
    Readiness:  http-get http://:metrics-thanos/-/ready delay=10s timeout=3s period=5s #success=1 #failure=5

What you expected to happen:

Thanos compactors operates w/o issues :)

How to reproduce it (as minimally and precisely as possible):

Deploy Thanos compactor version master-2020-10-22-57076a58 or later.

Observe that compactor doesn't open the HTTP socket right away by running netstat in the pod:

kubectl exec thanos-compactor-region01-0 -- netstat -lnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State

Full logs to relevant components:

kubectl logs thanos-compactor-region01-0
ts=2020-11-03T11:01:21.535998817Z caller=level.go:63 level=debug msg="maxprocs: Updating GOMAXPROCS=[23]: determined from CPU quota"
ts=2020-11-03T11:01:21.536158425Z caller=level.go:63 level=info msg="Tracing will be disabled"
ts=2020-11-03T11:01:21.536338823Z caller=level.go:63 level=info msg="loading bucket configuration"
ts=2020-11-03T11:01:21.954057729Z caller=level.go:63 level=debug msg="block is too fresh for now" block=01EP6W6YX651D200WHWG3HF5DJ
ts=2020-11-03T11:01:21.954107448Z caller=level.go:63 level=debug msg="block is too fresh for now" block=01EP6W6YWTMVJJG14GSWYYZ358
ts=2020-11-03T11:01:23.632574527Z caller=level.go:63 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.095557456s cached=62 returned=60 partial=0
ts=2020-11-03T11:01:23.634966292Z caller=level.go:63 level=info msg="started cleaning of aborted partial uploads"
ts=2020-11-03T11:01:23.634997898Z caller=level.go:63 level=info msg="cleaning of aborted partial uploads done"
ts=2020-11-03T11:01:23.635005475Z caller=level.go:63 level=info msg="started cleaning of blocks marked for deletion"
ts=2020-11-03T11:01:23.635011578Z caller=level.go:63 level=info msg="cleaning of blocks marked for deletion done"
ts=2020-11-03T11:01:23.876462042Z caller=level.go:63 level=debug msg="block is too fresh for now" block=01EP6W6YX651D200WHWG3HF5DJ
ts=2020-11-03T11:01:23.876511887Z caller=level.go:63 level=debug msg="block is too fresh for now" block=01EP6W6YWTMVJJG14GSWYYZ358
ts=2020-11-03T11:01:25.718763288Z caller=level.go:63 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.083739741s cached=62 returned=60 partial=0
ts=2020-11-03T11:01:25.721962352Z caller=level.go:63 level=info msg="starting compact node"
ts=2020-11-03T11:01:25.721997011Z caller=level.go:63 level=info msg="changing probe status" status=ready
ts=2020-11-03T11:01:25.722158382Z caller=level.go:63 level=info msg="start sync of metas"
ts=2020-11-03T11:01:25.722177529Z caller=level.go:63 level=info msg="changing probe status" status=healthy
ts=2020-11-03T11:01:25.722241997Z caller=level.go:63 level=info service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902

This is especially notable with buckets that have a large number of incoming blocks - initial bucket scan & cleanup can take over a minute.

The text was updated successfully, but these errors were encountered:

GiedriusS · 2020-11-03T15:59:39Z

...

Yes, this is expected because we want to reduce the count of objects if possible. Imagine a scenario where Thanos Compactor goes into a crash loop and not even the old files get deleted. What do you propose? Why doesn't increasing initialDelaySeconds work for you?

zygiss · 2020-11-03T19:42:51Z

Thanks for triaging.

Sure, we could increase initialDelaySeconds on the healthchecks. I think there's a more general issue here however, which I could've explained better in my initial report.

The problem is that effectively we have compactor perform some work, run the initial cleanup (or so we hope) and once that's done, it goes to serve metrics and healthchecks. This seems to be the wrong way around.

Granted, our particular setup makes this issue more visible than it would otherwise be (we have several huge buckets that receive > 500GB of metrics per day, so there's a lot of compaction and cleanup to do), but it's the behaviour itself that I think we should review.

What do you think?

bwplotka · 2020-12-02T19:01:42Z

What we did indeed is that we added cleanup job before compactor starts. I think the bug here is that probe should say 200 OK sooner (: cc @GiedriusS WDYT?

bwplotka · 2020-12-02T19:24:53Z

Help wanted, to me, it's just enabling probe early on Compactor. To me it sounds valid.

GiedriusS · 2020-12-07T11:52:52Z

Fixed in v0.17.2 (#3532).

GiedriusS added component: compact feature request/improvement needs-more-info labels Nov 3, 2020

ahurtaud mentioned this issue Dec 2, 2020

Liveness and readiness probes are failing with 503 in compactor and prometheus after upgrading to 0.17.0 #3466

Closed

bwplotka added help wanted and removed needs-more-info labels Dec 2, 2020

GiedriusS mentioned this issue Dec 2, 2020

compact: do not cleanup blocks on boot #3532

Merged

2 tasks

GiedriusS closed this as completed Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395

Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395

zygiss commented Nov 3, 2020 •

edited

Loading

GiedriusS commented Nov 3, 2020 •

edited

Loading

zygiss commented Nov 3, 2020

bwplotka commented Dec 2, 2020

bwplotka commented Dec 2, 2020

GiedriusS commented Dec 7, 2020

Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395

Compactor: HTTP socket is only opened after initial cleanup of compacted blocks is complete #3395

Comments

zygiss commented Nov 3, 2020 • edited Loading

GiedriusS commented Nov 3, 2020 • edited Loading

zygiss commented Nov 3, 2020

bwplotka commented Dec 2, 2020

bwplotka commented Dec 2, 2020

GiedriusS commented Dec 7, 2020

zygiss commented Nov 3, 2020 •

edited

Loading

GiedriusS commented Nov 3, 2020 •

edited

Loading