[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

VijayPatil872 · 2024-08-26T12:21:42Z

Component(s)

connector/servicegraph

What happened?

Description

Currently we are observing high volume of metrics generated by servicegraph connector. We are using servicegraphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace. The servicegraph exporting the metrics to Victoria metrics with prometheusremotewrite exporter. To give explanation about issues if we consider span rate received approximate 6.95K mean then servicegraph produces near to 18K metrics.

Steps to Reproduce

Expected Result

Metrics generation less in number or what will be expected by service graph correctly.

Actual Result

Span rate recieved

Metric point rate

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:


    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500


  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
  service:


    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp
     
      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-26T12:21:59Z

Pinging code owners:

connector/servicegraph: @jpkrohling @mapno @JaredTan95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

JaredTan95 · 2024-08-27T14:06:43Z

I think it's possible that your span name is causing a high cardinality issue and you need to check the specific metric data

VijayPatil872 added bug Something isn't working needs triage New item requiring triage labels Aug 26, 2024

github-actions bot added the connector/servicegraph label Aug 26, 2024

github-actions bot mentioned this issue Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

open-telemetry deleted a comment Aug 27, 2024

github-actions bot mentioned this issue Sep 3, 2024

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Closed

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

github-actions bot mentioned this issue Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

atoulme removed the needs triage New item requiring triage label Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

VijayPatil872 commented Aug 26, 2024 •

edited

Loading

github-actions bot commented Aug 26, 2024

JaredTan95 commented Aug 27, 2024

[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

Comments

VijayPatil872 commented Aug 26, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Aug 26, 2024

JaredTan95 commented Aug 27, 2024

VijayPatil872 commented Aug 26, 2024 •

edited

Loading