-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compact: 0.22.0 never starts downsampling work #4531
Comments
So, you believe that it is stuck? Did I understand this correctly? |
Yes that's correct @GiedriusS, AFAICT it doesn't start the downsampling work at all (e.g. downloading blocks) |
Another data point that might be useful: I downgraded to 0.21.1 and thanos compact is working as expected (i.e. it started downsampling work) with the same command line |
I cannot reproduce this on 0.22.0. A big goroutine number indicates to me that something is going on, that it is not stuck. Could you please upload the full goroutine dump when Compactor runs into such a situation? |
thank you for following up @GiedriusS ! I've included all goroutines in the "details" section (from the debug endpoint), is that what you are referring to? unfortunately I had to resume downsampling with 0.21.1 and I don't know how easy it'll be for me to reproduce, but I can do it if needed, let me know! |
Hello @filippog, thanks for reporting the issue. After checking the goroutines, I think the download process is still running, but just a little bit slow.
Related goroutines are listed above. You can see they are blocked at Based on goroutine 1076598, the minio client is listing the objects in the buckets and this is part of the block download process. If you want to make sure the compactor is downloading blocks, maybe you can check whether the new directory has been created or we can add more debug logs for this. I have been using this feature for several weeks, but we are using concurrency 2 ~ 4 and I didn't see any issue. I will try concurrency 1 as well to see if I can reproduce the problem. |
I've been running this feature for 2 weeks as well on concurrency 20+ without any issues. I can also confirm that the majority of the time the downsample goroutines are waiting on the How long do you wait before declaring that downsampling isn't happening? In practice I've seen block downloads range from sub second to > 8h. Were you able to check the file system for partially downloaded blocks? Block download starts with meta.json which should be pretty quick so at the very least you should see that, if not actual chunks. We could add some debug logs in the upload and download methods pkg/block.go to see operations on individual files as a way of measuring progress too. |
Thank you folks for taking a look at this! It certainly could be the case of slow downloads, although I've seen the compactor not print anything for >20h (Aug 6th 13:20, then Aug 7th 9:33 is when I restarted compact. Sadly I don't have the goroutine dump from that event, afterwards I downgraded to 0.21.1). The host's network during that time has been pretty much idle (public stats are available, host network for thanos-fe2001)
I'll run 0.22.0 again and report back on your suggestions including partially downloaded blocks on the filesystem, thank you again! |
I was able to reproduce this after leaving thanos-compact running for a week more or less, the last log line from
Though thanos-compact itself does keep logging and doing at least some operations, recent logs:
See below for full goroutine dump and ss + connection age. It does look like there are a bunch of connections "stuck" and never time out. Something perhaps worth of note is that the address thanos-compact connects to is actually on the host's localhost interface. I can leave the process alone for a few days for further debugging, what do you think ? Thanks! Goroutine dump
ss output limited to thanos-compact pid
Creation dates for thanos-compact FDs
|
On the filesystem only
|
On the server side (swift + s3 api compat) I can see the index for that block being downloaded, though at one of the multipart segment (segment
|
I have to take this back, I need to deploy changes to envoy and swift. This might unblock thanos-compact, I'll report back |
Restarting envoy and swift (but not thanos-compact) on the host obviously reset the connections but
|
From what you found here, the compactor was downloading the block but the connection seems terminated and there was no progress. So the downsampling process started actually and the issue might be related to the server side (envoy + swift). Then if you switch back to v0.21.0 version thanos, can you still reproduce the same bug? |
I'll be testing today with v0.21.0 and report back. It might be taking a while, I've observed the bug when the "big" downsampling work comes around every other week or so, I'll report back! re: server side, it could be certainly a problem there too, however only |
After a few days with
My wild guess: is it possible in 0.22.0 the "unexpected EOF" error is still there but swallowed/ignored ? Thank you! |
thanos version thanos, version 0.23.0-rc.0 (branch: HEAD, revision: 81841aed4a9e3d6f6ed772fea287f04504d164f3)
build user: root@d1bae1e2c93c
build date: 20210908-14:46:03
go version: go1.16.7
platform: linux/amd64 thanos compact systemctl config /data/app/thanos/bin/thanos compact -w \
--wait-interval=1m \
--compact.cleanup-interval=1m \
--compact.concurrency=4 \
--downsample.concurrency=4 \
--block-sync-concurrency=60 \
--tracing.config-file=/data/app/thanos/conf.d/trace.jaeger.yml \
--block-meta-fetch-concurrency=64 \
--objstore.config-file=/data/app/thanos/conf.d/object.json \
--data-dir=/data/compact/ \
--delete-delay=4h \
--retention.resolution-raw=5d \
--retention.resolution-5m=15d \
--retention.resolution-1h=0d system log print Sep 14 01:58:07 compact-thanos thanos: level=info ts=2021-09-13T17:58:07.430409827Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=54.535886944s duration_ms=54535 cached=27804 returned=27700 partial=1
Sep 14 01:58:07 compact-thanos thanos: level=info ts=2021-09-13T17:58:07.645774356Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
Sep 14 01:58:07 compact-thanos thanos: level=info ts=2021-09-13T17:58:07.64582784Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
Sep 14 01:58:07 compact-thanos thanos: level=info ts=2021-09-13T17:58:07.64583744Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
Sep 14 01:58:07 compact-thanos thanos: level=info ts=2021-09-13T17:58:07.645886557Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
Sep 14 01:58:43 compact-thanos thanos: level=info ts=2021-09-13T17:58:43.68316275Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.611941601s duration_ms=7611 cached=27804 returned=27804 partial=1
Sep 14 01:59:01 compact-thanos systemd: Started Session 550727 of user root.
Sep 14 01:59:06 compact-thanos thanos: level=info ts=2021-09-13T17:59:06.578244726Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=53.683732519s duration_ms=53683 cached=27804 returned=27700 partial=1
Sep 14 01:59:06 compact-thanos thanos: level=info ts=2021-09-13T17:59:06.771342122Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
Sep 14 01:59:06 compact-thanos thanos: level=info ts=2021-09-13T17:59:06.771481457Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
Sep 14 01:59:06 compact-thanos thanos: level=info ts=2021-09-13T17:59:06.771495597Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
Sep 14 01:59:06 compact-thanos thanos: level=info ts=2021-09-13T17:59:06.771540635Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
Sep 14 01:59:43 compact-thanos thanos: level=info ts=2021-09-13T17:59:43.871266182Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.804192955s duration_ms=7804 cached=27804 returned=27804 partial=1
Sep 14 02:00:01 compact-thanos systemd: Started Session 550728 of user root.
Sep 14 02:00:01 compact-thanos systemd: Started Session 550730 of user root.
Sep 14 02:00:01 compact-thanos systemd: Started Session 550729 of user root.
Sep 14 02:00:01 compact-thanos systemd: Started Session 550731 of user root.
Sep 14 02:00:01 compact-thanos systemd: Started Session 550732 of user root.
Sep 14 02:00:07 compact-thanos thanos: level=info ts=2021-09-13T18:00:07.42932569Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=54.534824185s duration_ms=54534 cached=27804 returned=27700 partial=1
Sep 14 02:00:07 compact-thanos thanos: level=info ts=2021-09-13T18:00:07.646884928Z caller=clean.go:33 msg="started cleaning of aborted partial uploads"
Sep 14 02:00:07 compact-thanos thanos: level=info ts=2021-09-13T18:00:07.646937279Z caller=clean.go:60 msg="cleaning of aborted partial uploads done"
Sep 14 02:00:07 compact-thanos thanos: level=info ts=2021-09-13T18:00:07.6469482Z caller=blocks_cleaner.go:43 msg="started cleaning of blocks marked for deletion"
Sep 14 02:00:07 compact-thanos thanos: level=info ts=2021-09-13T18:00:07.646992159Z caller=blocks_cleaner.go:57 msg="cleaning of blocks marked for deletion done"
Sep 14 02:00:43 compact-thanos thanos: level=info ts=2021-09-13T18:00:43.850182041Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.783070531s duration_ms=7783 cached=27804 returned=27804 partial=1 |
We discovered recently, that compact is not downsampling, too. If downsample ist started standalone, everything works as expected. thanos downsample: The later does what is expected, compact does not downsample. Als of #4592, these also seems to be skipped. |
mentioned in #4592 (comment). Downsample starts after all compaction is done. |
...
I'd like to report back that I'm still seeing the occasional "unexpected EOF" error, and still running with 0.21.1. Although again that's not a big deal in this case because thanos-compact gets restarted and progress is made. Any thoughts on what could be going on with 0.22.0 @yeya24 ? Thank you! |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
I think this has been fixed by @GiedriusS. Thanks for the amazing work. |
I'm running Thanos 0.22.0 binary from GH releases
Object Storage Provider: swift + s3api compat
What happened:
Thanos compact stops making progress after
"start first pass of downsampling"
message, e.g. I don't see anydownload block
log fromprocessDownsampling
. I'm running compact like this (i.e.compact.concurrency=1
and default downsample concurrency):I'm seeing this in the logs and then nothing ever again from
compact.go
, though I'm quite sure there is downsampling work to do:See below for full goroutine dump, I suspect this is due to the recent downsample concurrency change, cc @yeya24.
What you expected to happen:
Compact making progress
How to reproduce it (as minimally and precisely as possible):
Wait for compact to start downsampling pass
Full logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: