Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Magic number error"s after changing compression format on the ingesters #937

Closed
mdisibio opened this issue Sep 7, 2021 · 0 comments · Fixed by #939
Closed

"Magic number error"s after changing compression format on the ingesters #937

mdisibio opened this issue Sep 7, 2021 · 0 comments · Fixed by #939
Milestone

Comments

@mdisibio
Copy link
Contributor

mdisibio commented Sep 7, 2021

Describe the bug
Tempo was running stable for some time on zstd compression. We decided to investigate snappy compression, switched the storage/trace/block/encoding field to 'snappy' and restarted ingesters. Afterwards, tempo started encountering many 'magic number' errors. This error is from a compactor but other components had the same error:

level=error ts=2021-09-07T15:57:10.601766281Z caller=compactor.go:110 msg="error during compaction cycle" err="error iterating input blocks: error reading objects for records, blockID: 4644fe19-1bc4-495f-99da-be0d1831543a, err: invalid input: magic number mismatch"

Inspecting the block, it seems like the data file was written with snappy as expected, but the meta.json still contains zstd. This lines up with the error message, which seems to be from zstd checking and failing for its magic number prefix.

A couple more things to note about the cli output below:
(1) Our ingesters cut blocks fairly consistently, and on zstd were ~350MB, and this block is ~600MB. This leads us to believe that the data file is indeed snappy.
(2) The size is also wrong. It is the ~350MB size from zstd and not the true data file size anymore.

$ gsutil ls -l gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-0
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-1
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-2
    100024  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-3
 611997809  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/data
    256000  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/index
       393  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json
TOTAL: 7 objects, 612654298 bytes (584.27 MiB)

$ gsutil cp gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json .
Copying gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json...
/ [1 files][  393.0 B/  393.0 B]                                                
Operation completed over 1 objects/393.0 B.       
                               
$ cat meta.json 
{"format":"v2","blockID":"4644fe19-1bc4-495f-99da-be0d1831543a","minID":"AAAAAAAAAAAAAAXhfU3CWw==","maxID":"7V8evRot612xwrcHbnhCcQ==",
"tenantID":"1","startTime":"2021-09-07T14:34:36.182817282Z","endTime":"2021-09-07T14:37:06.427198961Z",
"totalObjects":268962,"size":349737066,"compactionLevel":0,
"encoding":"zstd","indexPageSize":256000,"totalRecords":2329,"dataEncoding":"v1","bloomShards":4}

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo r7
  2. Create blocks with zstd encoding
  3. Change configmap to snappy
  4. Deploy r8 (which includes an ingester update and restart)
  5. Perform queries or let compactors run

Expected behavior
It should be fine to switch an ingester's encoding between restarts. Blocks written prior should keep the previous encoding, and new blocks should be written with the new encoding.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet

Additional context
Here are ingester logs for this block. After the restart it both replayed the WAL and rediscovered the complete block. It doesn't seem like it should have done both... only one needs to be done. Maybe the error is an edge case in this replay/rediscover logic.

2021-09-07 14:37:06 | level=info ts=2021-09-07T14:37:06.427374977Z caller=flush.go:157 msg="head block cut. enqueueing flush op" userid=1 block=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:37:08 | level=info ts=2021-09-07T14:37:08.833570107Z caller=flush.go:232 msg="completing block" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:17 | level=info ts=2021-09-07T14:38:17.626684172Z caller=wal.go:93 msg="beginning replay" file=4644fe19-1bc4-495f-99da-be0d1831543a:1:v2:snappy:v1 size=748082090
2021-09-07 14:38:21 | level=info ts=2021-09-07T14:38:21.974282518Z caller=wal.go:120 msg="replay complete" file=4644fe19-1bc4-495f-99da-be0d1831543a:1:v2:snappy:v1 duration=4.347593985s
2021-09-07 14:38:21 | level=info ts=2021-09-07T14:38:21.987396049Z caller=instance.go:595 msg="reloaded local block" tenantID=1 block=4644fe19-1bc4-495f-99da-be0d1831543a flushed=0001-01-01T00:00:00Z
2021-09-07 14:38:24 | level=info ts=2021-09-07T14:38:24.280352444Z caller=flush.go:232 msg="completing block" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:31 | level=info ts=2021-09-07T14:38:31.560074186Z caller=flush.go:288 msg="flushing block" userid=1 block=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:39 | level=info ts=2021-09-07T14:38:39.082371529Z caller=flush.go:239 msg="block completed" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a duration=14.802010543s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant