Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Connector/Servicegraph] Servicegraph connector generating high volume of metrics #34843

Open
VijayPatil872 opened this issue Aug 26, 2024 · 2 comments
Labels
bug Something isn't working connector/servicegraph

Comments

@VijayPatil872
Copy link

VijayPatil872 commented Aug 26, 2024

Component(s)

connector/servicegraph

What happened?

Description

Currently we are observing high volume of metrics generated by servicegraph connector. We are using servicegraphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace. The servicegraph exporting the metrics to Victoria metrics with prometheusremotewrite exporter. To give explanation about issues if we consider span rate received approximate 6.95K mean then servicegraph produces near to 18K metrics.

Steps to Reproduce

Expected Result

Metrics generation less in number or what will be expected by service graph correctly.

Actual Result

Span rate recieved
5aduh9cj
Metric point rate
Screenshot_28-8-2024_144258_private-user-images githubusercontent com

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:


    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: 
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500


  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:4318
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
  service:


    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp
     
      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

@VijayPatil872 VijayPatil872 added bug Something isn't working needs triage New item requiring triage labels Aug 26, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@JaredTan95
Copy link
Member

I think it's possible that your span name is causing a high cardinality issue and you need to check the specific metric data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
Development

No branches or pull requests

4 participants
@atoulme @JaredTan95 @VijayPatil872 and others