Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to query trace in s3 storage, and index.json.gz has not been updated for a long time. #3369

Closed
aaashen opened this issue Feb 6, 2024 · 8 comments
Labels
stale Used for stale issues / PRs

Comments

@aaashen
Copy link

aaashen commented Feb 6, 2024

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo using helm chart(Tempo 2.3.0)
  2. Perform Operations (Read/Write/Others)

Expected behavior

Environment:

  • Infrastructure: Kubernetes 1.20
  • Deployment tool: helm

Additional Context
It looks like the image of v2.3.0 includes the commit of polling improvements(#2652).
By executing "./tempo-cli list blocks single-tenant -c tempo.yaml" and printing the blockId obtained each time,
we found that it seemed to be trapped in an endless loop, and there were duplicate blockIds in the log.
After rolling back the image to v2.3.0-rc, everything went backed to normal.
Is it possible there is a bug in the listBlocks method? Or any wrong with our environment or configuration?

@joe-elliott
Copy link
Member

It's hard to say. Let's start by reviewing your compactor logs. The compactor is the component responsible for updating the tenant index and it may contain some clues about why your index is so out of date.

cc @zalegrala

@zalegrala
Copy link
Contributor

It looks like #3224 hasn't been released, which contains an important fix for the PR linked above.

@aaashen aaashen closed this as completed Feb 7, 2024
@aaashen
Copy link
Author

aaashen commented Feb 7, 2024

It's hard to say. Let's start by reviewing your compactor logs. The compactor is the component responsible for updating the tenant index and it may contain some clues about why your index is so out of date.

cc @zalegrala

hi joe, i am sorry that we did not save the complete logs. Here is a portion of the compactor logs.

level=info ts=2024-02-02T12:46:15.522406823Z caller=tempodb.go:428 msg="polling enabled" interval=5m0s concurrency=50
level=debug ts=2024-02-02T12:46:15.5863981592 coller=s3.go:273 msg="listing blocks" keypath=/ found=1 IsTruncoted=true NextMarker=single-tenant/0078d61a-93a5-4290-a036-c37e2aad89b7/data.parquet
level=debug ts=2024-02-02T12:46:15.630589307Z caller=s3.go:273 msg="listing blocks" keypath=/ found=0 IsTruncated=false NextMarker
level=debug ts=2024-02-02T12:46:15.630975992Z caller=compactor.go:195 msg="checking hash" hash=build-tenant-index-0-single-tenant
level=debug ts=2024-02-02T12:46:15.631127599Z caller=compactor.go:214 msg="checking addresses" owning_addr=10.178.13.70:0 this_addr=10.178.13.70:0

There were no other error logs or logs like 'listing blocks complete'.

@aaashen
Copy link
Author

aaashen commented Feb 7, 2024

It looks like #3224 hasn't been released, which contains an important fix for the PR linked above.

Hi, zalegrala,thanks for reply. I check the #3224, it fixed poller waitgroup handling in pollUnknown method. But it seems that cmd-list-blocks does not call pollUnknown?

@aaashen aaashen reopened this Feb 7, 2024
@zalegrala
Copy link
Contributor

Checking a little closer, it looks like the polling change is also unreleased. Are you overriding the image in your helm values? v2.3.1...main

The fix above as you mention will help the compactor, but not the ListBlocks() call. I tested the tempo-cli out locally to check for duplicates, but didn't see any in this output. This is from a main build.

./bin/linux/tempo-cli-amd64 list blocks ops -c tempo.yaml | awk '/| / {print $2 }' | sort | uniq -d

The tempo-cli that you are using is also from the v2.3.0 release, correct? A quick way to know if the polling change is in place is to include s3.list_blocks_concurrency in your config. This was introduced with the polling change.

@mi5guided
Copy link

Yup - this was a bummer. I used an image from docker hub that was a little bit after the 2.3.0 release and it had this issue. Since I was new to Tempo, I thought I was screwing up the configuration with the s3 backend. Finally pieced together that the index.json.gz file was missing and that the compactor was responsible for creating/updating that file. I deployed 2.3.0-rc0 from docker hub and things work great!

@zalegrala
Copy link
Contributor

We keep release branches so not all commits to main get released right away with the immediate next release. If you want to run only the released images, stick to the tagged versions but drop the leading v so just :2.3.0 for the image tag. I just tested out grafana/tempo:2.3.0 for example. We encourage you to read the release notes when updating as usual.

The upcoming 2.4.0 release will have the polling change and the fix. We've been running it with good results for the last few months.

Was this the same issue as originally reported? Was the image used was from main? Please correct me if I misunderstood.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Apr 24, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

4 participants