Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[otel-targetallocator] metric datapoints attributed to two different ServiceMonitor jobs #2922

Closed
diranged opened this issue May 2, 2024 · 7 comments
Assignees
Labels
bug Something isn't working needs triage

Comments

@diranged
Copy link

diranged commented May 2, 2024

Component(s)

collector

What happened?

Component(s)

receiver/prometheus

What happened?

Description

We're setting up a single collector pod in a StatefulSet configured to monitor "all the rest" of our OTEL components ... this collector is called the otel-collector-cluster-agent, and it uses the TargetAllocator to monitor for ServiceMonitor jobs that have specific labels. We currently have two different ServiceMonitors - one for collecting otelcol.* metrics from the collectors, and one for collecting opentelemtry_.* metrics from the Target Allocators.

We are seeing metrics reported from the TargetAllocator pods duplicated into DataPoints that refer to the right, and wrong ServiceMonitor:

Metric #14
Descriptor:
     -> Name: opentelemetry_allocator_targets
     -> Description: Number of targets discovered.
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> container: Str(ta-container)
     -> endpoint: Str(targetallocation)
     -> job_name: Str(serviceMonitor/otel/otel-collector-collectors/0)  <<<< WRONG
     -> namespace: Str(otel)
     -> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
     -> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC
Value: 36.000000
NumberDataPoints #1
Data point attributes:
     -> container: Str(ta-container)
     -> endpoint: Str(targetallocation)
     -> job_name: Str(serviceMonitor/otel/otel-collector-target-allocators/0)   <<<< RIGHT
     -> namespace: Str(otel)
     -> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
     -> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC

When we delete the otel-collector-collectors ServiceMonitor, the behavior does not change... which is wild... however, if we delete the entire stack and namespace and then re-create it without the second ServiceMonitor, then the data is correct... until we create the second Service Monitor, then it goes bad again.

Steps to Reproduce

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-cluster-agent
  namespace: otel
spec:
  args:
    feature-gates: +processor.resourcedetection.hostCPUSteppingAsString
  config: |-
    exporters:
      debug:
        sampling_initial: 15
        sampling_thereafter: 60
      debug/verbose:
        sampling_initial: 15
        sampling_thereafter: 60
        verbosity: detailed
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: :1777
    receivers:
      prometheus:
        config:
          scrape_configs: []
        report_extra_scrape_metrics: true
    service:
      extensions:
      - health_check
      - pprof
      pipelines:
        metrics/debug:
          exporters:
          - debug/verbose
          receivers:
          - prometheus
      telemetry:
        logs:
          level: 'info'
  deploymentUpdateStrategy: {}
  env:
    - name: KUBE_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: OTEL_RESOURCE_ATTRIBUTES
      value: node.name=$(KUBE_NODE_NAME)
  image: otel/opentelemetry-collector-contrib:0.98.0
  ingress:
    route: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 60
    periodSeconds: 30
  managementState: managed
  mode: statefulset
  nodeSelector:
    kubernetes.io/os: linux
  observability:
    metrics: {}
  podDisruptionBudget:
    maxUnavailable: 1
  priorityClassName: otel-collector
  replicas: 1
  resources:
    limits:
      memory: 4Gi
    requests:
      cpu: "1"
      memory: 2Gi
  targetAllocator:
    allocationStrategy: consistent-hashing
    enabled: true
    filterStrategy: relabel-config
    image: otel/target-allocator:0.100.0-beta@sha256:fdd7e7f5f8f3903a3de229132ac0cf98c8857b7bdaf3451a764f550f7b037c26
    observability:
      metrics: {}
    podDisruptionBudget:
      maxUnavailable: 1
    prometheusCR:
      enabled: true
      podMonitorSelector:
        monitoring.xx.com/isolated: otel-cluster-agent
      scrapeInterval: 1m0s
      serviceMonitorSelector:
        monitoring.xx.com/isolated: otel-cluster-agent
    replicas: 1
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 256Mi
  updateStrategy: {}
  upgradeStrategy: automatic
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/instance: otel-collector
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: otel-collector
    app.kubernetes.io/version: ""
    monitoring.xx.com/isolated: otel-cluster-agent
  name: otel-collector-collectors
  namespace: otel
spec:
  endpoints:
  - interval: 15s
    metricRelabelings:
    - action: keep
      regex: otelcol.*
      sourceLabels:
      - __name__
    port: monitoring
    scrapeTimeout: 5s
  namespaceSelector:
    matchNames:
    - otel
  selector:
    matchLabels:
      app.kubernetes.io/component: opentelemetry-collector
      app.kubernetes.io/managed-by: opentelemetry-operator
      operator.opentelemetry.io/collector-monitoring-service: Exists
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/instance: otel-collector
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: otel-collector
    app.kubernetes.io/version: ""
    monitoring.xx.com/isolated: otel-cluster-agent
  name: otel-collector-target-allocators
  namespace: otel
spec:
  endpoints:
  - interval: 15s
    metricRelabelings:
    - action: keep
      regex: opentelemetry.*
      sourceLabels:
      - __name__
    port: targetallocation
    scrapeTimeout: 5s
  namespaceSelector:
    matchNames:
    - otel
  selector:
    matchLabels:
      app.kubernetes.io/component: opentelemetry-targetallocator
      app.kubernetes.io/managed-by: opentelemetry-operator

This creates a /scrape_configs that looks like this:

% curl --silent localhost:8080/scrape_configs | jq .
{
  "serviceMonitor/otel/otel-collector-collectors/0": {
    "enable_compression": true,
    "enable_http2": true,
    "follow_redirects": true,
    "honor_timestamps": true,
    "job_name": "serviceMonitor/otel/otel-collector-collectors/0",
    "kubernetes_sd_configs": [
      {
        "enable_http2": true,
        "follow_redirects": true,
        "kubeconfig_file": "",
        "namespaces": {
          "names": [
            "otel"
          ],
          "own_namespace": false
        },
        "role": "endpointslice"
      }
    ],
    "metric_relabel_configs": [
      {
        "action": "keep",
        "regex": "otelcol.*",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__name__"
        ]
      }
    ],
    "metrics_path": "/metrics",
    "relabel_configs": [
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "job"
        ],
        "target_label": "__tmp_prometheus_job_name"
      },
      {
        "action": "keep",
        "regex": "(opentelemetry-collector);true",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_label_app_kubernetes_io_component",
          "__meta_kubernetes_service_labelpresent_app_kubernetes_io_component"
        ]
      },
      {
        "action": "keep",
        "regex": "(opentelemetry-operator);true",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_label_app_kubernetes_io_managed_by",
          "__meta_kubernetes_service_labelpresent_app_kubernetes_io_managed_by"
        ]
      },
      {
        "action": "keep",
        "regex": "(Exists);true",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_label_operator_opentelemetry_io_collector_monitoring_service",
          "__meta_kubernetes_service_labelpresent_operator_opentelemetry_io_collector_monitoring_service"
        ]
      },
      {
        "action": "keep",
        "regex": "monitoring",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_port_name"
        ]
      },
      {
        "action": "replace",
        "regex": "Node;(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_address_target_kind",
          "__meta_kubernetes_endpointslice_address_target_name"
        ],
        "target_label": "node"
      },
      {
        "action": "replace",
        "regex": "Pod;(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_address_target_kind",
          "__meta_kubernetes_endpointslice_address_target_name"
        ],
        "target_label": "pod"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_namespace"
        ],
        "target_label": "namespace"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_name"
        ],
        "target_label": "service"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_name"
        ],
        "target_label": "pod"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_container_name"
        ],
        "target_label": "container"
      },
      {
        "action": "drop",
        "regex": "(Failed|Succeeded)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_phase"
        ]
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_name"
        ],
        "target_label": "job"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "monitoring",
        "separator": ";",
        "target_label": "endpoint"
      },
      {
        "action": "hashmod",
        "modulus": 1,
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__address__"
        ],
        "target_label": "__tmp_hash"
      },
      {
        "action": "keep",
        "regex": "$(SHARD)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__tmp_hash"
        ]
      }
    ],
    "scheme": "http",
    "scrape_interval": "15s",
    "scrape_protocols": [
      "OpenMetricsText1.0.0",
      "OpenMetricsText0.0.1",
      "PrometheusText0.0.4"
    ],
    "scrape_timeout": "5s",
    "track_timestamps_staleness": false
  },
  "serviceMonitor/otel/otel-collector-target-allocators/0": {
    "enable_compression": true,
    "enable_http2": true,
    "follow_redirects": true,
    "honor_timestamps": true,
    "job_name": "serviceMonitor/otel/otel-collector-target-allocators/0",
    "kubernetes_sd_configs": [
      {
        "enable_http2": true,
        "follow_redirects": true,
        "kubeconfig_file": "",
        "namespaces": {
          "names": [
            "otel"
          ],
          "own_namespace": false
        },
        "role": "endpointslice"
      }
    ],
    "metric_relabel_configs": [
      {
        "action": "keep",
        "regex": "opentelemetry.*",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__name__"
        ]
      }
    ],
    "metrics_path": "/metrics",
    "relabel_configs": [
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "job"
        ],
        "target_label": "__tmp_prometheus_job_name"
      },
      {
        "action": "keep",
        "regex": "(opentelemetry-targetallocator);true",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_label_app_kubernetes_io_component",
          "__meta_kubernetes_service_labelpresent_app_kubernetes_io_component"
        ]
      },
      {
        "action": "keep",
        "regex": "(opentelemetry-operator);true",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_label_app_kubernetes_io_managed_by",
          "__meta_kubernetes_service_labelpresent_app_kubernetes_io_managed_by"
        ]
      },
      {
        "action": "keep",
        "regex": "targetallocation",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_port_name"
        ]
      },
      {
        "action": "replace",
        "regex": "Node;(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_address_target_kind",
          "__meta_kubernetes_endpointslice_address_target_name"
        ],
        "target_label": "node"
      },
      {
        "action": "replace",
        "regex": "Pod;(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_endpointslice_address_target_kind",
          "__meta_kubernetes_endpointslice_address_target_name"
        ],
        "target_label": "pod"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_namespace"
        ],
        "target_label": "namespace"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_name"
        ],
        "target_label": "service"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_name"
        ],
        "target_label": "pod"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_container_name"
        ],
        "target_label": "container"
      },
      {
        "action": "drop",
        "regex": "(Failed|Succeeded)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_pod_phase"
        ]
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "${1}",
        "separator": ";",
        "source_labels": [
          "__meta_kubernetes_service_name"
        ],
        "target_label": "job"
      },
      {
        "action": "replace",
        "regex": "(.*)",
        "replacement": "targetallocation",
        "separator": ";",
        "target_label": "endpoint"
      },
      {
        "action": "hashmod",
        "modulus": 1,
        "regex": "(.*)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__address__"
        ],
        "target_label": "__tmp_hash"
      },
      {
        "action": "keep",
        "regex": "$(SHARD)",
        "replacement": "$1",
        "separator": ";",
        "source_labels": [
          "__tmp_hash"
        ]
      }
    ],
    "scheme": "http",
    "scrape_interval": "15s",
    "scrape_protocols": [
      "OpenMetricsText1.0.0",
      "OpenMetricsText0.0.1",
      "PrometheusText0.0.4"
    ],
    "scrape_timeout": "5s",
    "track_timestamps_staleness": false
  }
}

Expected Result

We should see datapoints for opentelemetry_.* metrics that only come from the target allocator pods and are attributed once .. meaning, one DataPoint per target pod, and that's it:

Metric #14
Descriptor:
     -> Name: opentelemetry_allocator_targets
     -> Description: Number of targets discovered.
     -> Unit: 
     -> DataType: Gauge
Value: 36.000000
NumberDataPoints #0
Data point attributes:
     -> container: Str(ta-container)
     -> endpoint: Str(targetallocation)
     -> job_name: Str(serviceMonitor/otel/otel-collector-target-allocators/0)   <<<< RIGHT
     -> namespace: Str(otel)
     -> pod: Str(otel-collector-cluster-agent-targetallocator-bdf46447d-6stx7)
     -> service: Str(otel-collector-cluster-agent-targetallocator)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-05-02 16:16:00.917 +0000 UTC

Collector version

0.98.0

Environment information

Environment

OS: BottleRocket 1.19.2

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

Kubernetes Version

1.28

Operator version

0.98.0

Collector version

0.98.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

No response

Additional context

No response

@diranged diranged added bug Something isn't working needs triage labels May 2, 2024
@diranged
Copy link
Author

diranged commented May 2, 2024

This is a duplicate of open-telemetry/opentelemetry-collector-contrib#32828 - but opened in the operator project in case this is an operator/allocator issue.

@jaronoff97
Copy link
Contributor

@diranged either myself or @swiatekm-sumo hopes to take a look at this next week, we have other priorities currently with getting the releases out.

@jaronoff97
Copy link
Contributor

we've both had difficult weeks and haven't gotten to this yet, still on our list!!

@diranged
Copy link
Author

Thanks @jaronoff97 - no worried... I appreciate you looking at it when you have time!

@swiatekm swiatekm self-assigned this Jun 18, 2024
@swiatekm swiatekm removed the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Jun 18, 2024
@swiatekm
Copy link
Contributor

swiatekm commented Jun 19, 2024

Finally got around to investigating this, thanks a lot for your patience. In essence, this is working as intended. This only happens for the opentelemetry_allocator_targets metric, because job_name is simply a label on this metric - each time series shows how many targets were allocated for a given job. Here is what the raw output from the Prometheus endpoint looks like:

# HELP opentelemetry_allocator_targets Number of targets discovered.
# TYPE opentelemetry_allocator_targets gauge
opentelemetry_allocator_targets{job_name="serviceMonitor/otel/otel-collector-collectors/0"} 2
opentelemetry_allocator_targets{job_name="serviceMonitor/otel/otel-collector-target-allocators/0"} 2

In this case, job_name refers to the job the metric is about. If you want the job that scraped it, you should look at the service.name resource attribute.

@diranged
Copy link
Author

Thank you for digging in - I will go back and re-test my setup and see if this aligns with what we're seeing.. It'll be a few weeks, I'm out on vacation right now. Thanks again though!

@jaronoff97
Copy link
Contributor

@diranged i'm going to close this for now, feel free to reopen if this is still an issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

3 participants