Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana tempo making recurring calls to ListBlob and Delete Blob Operation for each and every Tempo Blob in Azure Storage Account . #3264

Closed
SimranCode opened this issue Jan 2, 2024 · 4 comments

Comments

@SimranCode
Copy link

SimranCode commented Jan 2, 2024

Hi ,

We are using Grafana Tempo and Azure Storage Account (Blob) to store the traces. For some reason Grafana tempo makes List Blob operation Call for each and every Trace Blob created. Due to which our storage cost is increasing rapidly every day. Any one has any idea on why this is happening and how can we avoid this recurring calls.

Below is the tempo Config for reference:

apiVersion: v1
data:
overrides.yaml: "overrides:\n "*":\n ingestion_rate_strategy: "local"\n
\ ingestion_rate_limit_bytes: 20000000\n ingestion_burst_size_bytes: 20000000\n
\ max_traces_per_user: 10000\n max_global_traces_per_user: 0\n max_bytes_per_trace:
500000\n max_bytes_per_tag_values_query: 500000\n block_retention: 0s \n"
tempo.yaml: |
server:
http_listen_port: 3200

distributor:

  receivers:
    jaeger:
      protocols:
        thrift_http:
        grpc:
        thrift_binary:
        thrift_compact:
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318
        grpc:
          endpoint: 0.0.0.0:4317
    opencensus:

ingester:
  trace_idle_period: 10s
  max_block_bytes: 1_000_000
  max_block_duration: 5m

query_frontend:
  search:
    max_duration: 720h

compactor:
  compaction:
    compaction_window: 1h
    max_block_bytes: 100_000_000
    block_retention: 720h
    compacted_block_retention: 10m

storage:
  trace:
    backend: azure
    azure:
      storage_account_key: yyy
      storage_account_name: xxx
      container_name: tempo
    pool:
      max_workers: 100
      queue_depth: 10000

overrides:
  per_tenant_override_config: /conf/overrides.yaml

And here are the sample logs generated from Storage account :

02T07:50:37.7394122Z;ListBlobs;Success;200;4;3;authenticated;xxx;xxx;blob;"https://xxx.blob.core.windows.net:443/tempo?comp=list&prefix=7ee2028d-8e63-43b3-bfd4-a1e4cc60c91f%2F&restype=container&timeout=61";"/xxx/tempo";ea7c2889-401e-0070-2050-3d06f0000000;0;10.22.163.5:42296;2020-10-02;466;0;210;2948;0;;;;;;"Tempo Azure-Storage/0.15 (go1.21.3; linux)";;"d5fcd8d4-b901-4a51-5245-00e1d0352ece";;;;;;;;
2.0;2024-01-02T07:50:37.7643947Z;ListBlobs;Success;200;3;2;authenticated;xxx;xxx;blob;"https://xxx.blob.core.windows.net:443/tempo?comp=list&prefix=7f5cd117-254f-4f7e-84d2-d65594dc7461%2F&restype=container&timeout=61";"/xxx/tempo";ea7c2892-401e-0070-2950-3d06f0000000;0;10.22.163.5:42296;2020-10-02;466;0;210;2948;0;;;;;;"Tempo Azure-Storage/0.15 (go1.21.3; linux)";;"0cc2d2bb-d6d2-4a30-7276-99bc68c6532e";;;;;;;;
2.0;2024-01-02T07:50:37.7883283Z;ListBlobs;Success;200;4;3;authenticated;xxx;xxx;blob;"https://xxx.blob.core.windows.net:443/tempo?comp=list&prefix=7f62a2f3-c1e6-4561-b206-0383efc1e9d0%2F&restype=container&timeout=61";"/xxx/tempo";ea7c28a7-401e-0070-3c50-3d06f0000000;0;10.22.163.5:42296;2020-10-02;466;0;210;2948;0;;;;;;"Tempo Azure-Storage/0.15 (go1.21.3; linux)";;"8a5cea7c-d374-4bce-4f29-616314b6f4d5";;;;;;;;
2.0;2024-01-02T07:50:37.8133698Z;ListBlobs;Success;200;4;3;authenticated;xxx;xxx;blob;"https://xxx.blob.core.windows.net:443/tempo?comp=list&prefix=7f653737-ce16-4420-bd9a-f13dd8fecb08%2F&restype=container&timeout=61";"/xxx/tempo";ea7c28bb-401e-0070-4d50-3d06f0000000;0;10.22.163.5:42296;2020-10-02;466;0;210;2948;0;;;;;;"Tempo Azure-Storage/0.15 (go1.21.3; linux)";;"4153eb3a-b77d-4726-5e6c-a9d945b614af";;;;;;;;

Is there any issue with the above config. Any help would be appreciated.

Thanks in advance !!

@SimranCode SimranCode changed the title Grafana tempo making recurring calls to ListBlob and Delete Blob Operation in Azure Storage Account . Grafana tempo making recurring calls to ListBlob and Delete Blob Operation for each and every Tempo Blob in Azure Storage Account . Jan 2, 2024
@joe-elliott
Copy link
Member

This is potentially either due to polling or block deletion.

Polling
We have recently made some very nice improvements to reduce backend calls due to polling. Unfortunately these are not yet in a release, but perhaps if you try tip of main you can see if there is an improvement.

An option you can try now is to reduce blocklist_poll_tenant_index_builders to 1 (default is 2). This will cut the number of compactors that maintain the blocklist in half. Other options here:
https://grafana.com/docs/tempo/latest/configuration/polling/

Deletion
Since you mention both list and delete in the issue title I'm suspicious of excessive block deletion. The clear block logic first lists all objects beneath the block prefix and then calls delete on each one.

I've noticed that your max block size in the ingester is 1MB but 100MB in the backend. This might result in excessive block compaction which requires a lot of block deleting. Perhaps try raising max_block_bytes and max_block_duration in the ingester and lowering the max_block_bytes in the compactor?

@SimranCode
Copy link
Author

Hi @joe-elliott ,

Thanks for your response.

Below are the config changes I did in ingester, compactor and storage section :

ingester:
trace_idle_period: 1h
max_block_bytes: 50_000_000
max_block_duration: 1h
complete_block_timeout: 24h

query_frontend:
  search:
    max_duration: 336h

compactor:
  compaction:
    compaction_window: 12h
    max_block_bytes: 50_000_000
    block_retention: 720h
    compacted_block_retention: 336h

storage:
  trace:
    backend: azure
    azure:
      storage_account_key: ${STORAGE_KEY}
      storage_account_name: xxx
      container_name: tempo
    blocklist_poll: 12h
    blocklist_poll_concurrency: 150
    blocklist_poll_tenant_index_builders: 1
    pool:
      max_workers: 100
      queue_depth: 10000

Now the polling frequency is reduced, I do not see any issues so far while querying the data in Grafana tempo UI.
I am skeptical about the blocklist poll frequency (12h). Not sure If I may face any data loss related issues in future. Could you please suggest if this config looks good .

Thanks
Simran

@joe-elliott
Copy link
Member

Seeing some concerning settings in that config that I'll point out.

trace_idle_period: 1h - This will keep traces in memory for 1 hour after the last span is received. This will inflate Tempo memory usage and they will not be searchable with TraceQL until they are flushed. They will be retrievable by ID while in memory.

complete_block_timeout: 24h
blocklist_poll: 12h
compacted_block_retention: 336h - I think these settings will work, but they are quite extreme and we have never operated Tempo with such large polling times.

complete_block_timeout at 24h will also increase the amount of disk a Tempo ingester needs b/c it will keep complete blocks on disk for much longer, but is necessary with a poll timeout of 12h.

compacted_block_retention: 336h is overkill even for these settings. Tempo would keep blocks in object storage for 14 days with this setting. You can bring this down to 24h with your other settings.

compaction_window: 12h - This will consider any blocks in a 12h window for compaction. It might be fine on your system, but in a high volume Tempo install we keep this at 5m or less to allow more compactors to participate on the most recent blocks.

@SimranCode
Copy link
Author

Thanks for the response, This is my final working config :

ingester:
trace_idle_period: 10s
max_block_bytes: 5_000_000
max_block_duration: 5m
complete_block_timeout: 12h

query_frontend:
  search:
    max_duration: 336h

compactor:
  compaction:
    compaction_window: 1h
    max_block_bytes: 5_000_000
    block_retention: 720h
    compacted_block_retention: 12h

storage:
  trace:
    backend: azure
    azure:
      storage_account_key: ${STORAGE_KEY}
      storage_account_name: xxx
      container_name: tempo
    blocklist_poll: 6h
    blocklist_poll_concurrency: 100
    blocklist_poll_tenant_index_builders: 1
    blocklist_poll_jitter_ms: 500

Thanks again for help.!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants