Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huge variations in Memory consumption by different pods -- Alloy daemonset #1992

Open
dili-pk opened this issue Oct 29, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@dili-pk
Copy link

dili-pk commented Oct 29, 2024

What's wrong?

We are on a testing phase with Grafana-Alloy on a very huge Infrastructure.
we have currently deployed Grafana-alloy on a cluster which is running with 30 nodes(including 3 control planes).
We noticed a strange behavior with Alloy where only 1 - 3 three pods are consuming more memory compared to other pods.
For example -- most of the pods are consuming below 600Mi. However, few pods(1 - 3 pods) are utilizing above 1Gi.
We are unable to find a valid explanation for this.
We also tried deploying the same on a smaller 4 node cluster and also noticed the same behavior where one single pod is consuming more memory(more than 50%) compared to others.
This is bit concerning for us to set the limits and requests for Grafana alloy ds as there is huge difference in memory utilization between few pods.
We noticed the clustering is happening perfectly fine.
There are no errors logs which shows the actual issue.

Is this expected? Or a bug?

grafana-alloy-5gmwj                             16m          540Mi
grafana-alloy-5wwhb                             20m          405Mi
grafana-alloy-6x4cw                             22m          518Mi
grafana-alloy-76s6r                             18m          423Mi
grafana-alloy-7jj4k                             19m          497Mi
grafana-alloy-8xl47                             21m          468Mi
grafana-alloy-bxbjd                             19m          586Mi
grafana-alloy-cjrb5                             24m          1116Mi
grafana-alloy-ckrqs                             19m          396Mi
grafana-alloy-cwwbd                             20m          589Mi

Only grafana-alloy-cjrb5 is utilizing more than the other pods.

Steps to reproduce

Deploy Grafana alloy ds in clustering mode.

System information

Kubernetes v1.29

Software version

Grafana alloy v1.4

Configuration

  config.alloy: |-
    logging {
    level  = "info"
    format = "logfmt"
    }
    prometheus.remote_write "amp" {
      endpoint {
        url = "remotewriteurl"
        proxy_url = "XXXXXXXXXX"
        sigv4 {
          region = "XXXXXXXX"
          access_key = "XXXXXXXXX"
          secret_key = "XXXXXXXXX"
        }
      }
    }
    // Node Exporter SECTION
    //Node-exporter Section
    prometheus.exporter.unix "nodeexporter" {
      set_collectors = ["cpu", "diskstats", "filesystem", "meminfo", "mountstats", "netclass", "netdev", "netstat", "pressure", "processes", "systemd"]
      enable_collectors = ["cpu", "diskstats", "filesystem", "meminfo", "mountstats", "netclass", "netdev", "netstat", "pressure", "processes", "systemd"]
      procfs_path = "/host/proc"
      rootfs_path = "/host"
      sysfs_path = "/host/sys"
      udev_data_path = "/host/run/udev/data"
      systemd {
        unit_include = "^(.+\\.service|var-lib\\.mount|var-log\\.mount)$"
      }
      ethtool {
        metrics_include = ".+_allowance_.+"
      }
      cpu {
        flags_include = ".+"
      }
      textfile {
        directory = "/text-metrics"
      }
    }
    prometheus.scrape "node_exporter" {
      targets = prometheus.exporter.unix.nodeexporter.targets
      scrape_interval = "60s"
      clustering {
        enabled = true
      }
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }
    
    // service endpoints section
    discovery.kubernetes "endpoints" {
      role = "endpoints"
    }
    discovery.relabel "honor_service_annotations" {
      targets = discovery.kubernetes.endpoints.targets
      rule {
        source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_scrape"]
        regex = "(true)"
        action = "keep"
      }
      rule {
        source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_path"]
        target_label = "__metrics_path__"
      }
      rule {
        source_labels = ["__address__", "__meta_kubernetes_service_annotation_prometheus_io_port"]
        regex = "([^:]+)(?::\\d+)?;(\\d+)"
        target_label = "__address__"
        replacement = "$1:$2"
      }
      rule {
        source_labels = ["__meta_kubernetes_service_annotation_prometheus_io_scheme"]
        regex = "(https?)"
        target_label = "__scheme__"
        action = "replace"
      }
      rule {
        source_labels = ["__meta_kubernetes_namespace"]
        regex         = "system"
        action        = "drop"
      }
    }
    prometheus.scrape "service_metrics" {
      targets = discovery.relabel.honor_service_annotations.output
      authorization {
        type = "Bearer"
        credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      }
      tls_config {
        ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
      }
      clustering {
        enabled = true
      }
      scrape_interval = "60s"
      forward_to = [prometheus.relabel.drop_unused_metrics.receiver]
    }

     prometheus.relabel "drop_unused_metrics" {
      // kube apiserver
      rule {
        source_labels = ["__name__"]
        regex = "^apiserver_.*_(total|bucket|count|sum)$"
        action = "drop"
      }
      // kube apiserver
      rule {
        source_labels = ["__name__"]
        regex = "etcd_request_duration_seconds_bucket|container_tasks_state"
        action = "drop"
      }
      // kube apiserver
      rule {
        source_labels = ["__name__"]
        regex = "rest_client_request_(duration|latency)_seconds_(bucket|count|sum)"
        action = "drop"
      }
      // etcd
      rule {
        source_labels = ["__name__"]
        regex = "grpc_server_handled_total"
        action = "drop"
      }
      rule {
        regex  = "beta_kubernetes_io_.*|failure_domain_beta_kubernetes_io_.*|topology_.*|node_kubernetes_io_.*|kubernetes_io_arch|kubernetes_io_os"
        action = "labeldrop"
      }
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }
   
    //Node Section
    // discover nodes
    discovery.kubernetes "nodes" {
      role = "node"
    }
    // scrape nodes
    prometheus.scrape "node_metrics" {
      targets = discovery.kubernetes.nodes.targets
      authorization {
        type = "Bearer"
        credentials_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      }
      tls_config {
        ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
      }
      clustering {
        enabled = true
      }
      scheme = "https"
      scrape_interval = "60s"
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }
   
    //Blackbox Exporter Section
    // blackbox Targets
    prometheus.exporter.blackbox "probes" {
      config_file = "/etc/blackbox-exporter/config.yaml"

      target {
        name = "openstackmetadataservice" 
        address = "http://xxx.xxx.xxx.xxx"
        module = "http_2xx"
      }
      target { 
        name = "registry"
        address = "{{ .Values.alloy.registry }}"
        module = "http_2xx"
      }
      target {
        name = "internaldnsresolution"
        address = "kube-dns.kube-system.svc"
        module = "cluster_dns"
      }
      target {
        name = "externaldnsresolution"
        address = "xx.xx.xxxx.xx"
        module = "external_dns"
      }
    }
    // scrape blackbox targets
    prometheus.scrape "blackbox_targets" {
      targets = prometheus.exporter.blackbox.probes.targets
      scrape_interval = "60s"
      clustering {
        enabled = true
      }
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }

    //Pod Section
    //Discover pods
    discovery.kubernetes "pods" {
      role = "pod"
    }
    // To discover pods with required annotaions set on the pods
    discovery.relabel "honor_pod_scrape_annotations" {
      targets = discovery.kubernetes.pods.targets
      rule {
        source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"]
        regex = "(true)"
        action = "keep"
      }
      rule {
        source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_path"]
        target_label = "__metrics_path__"
        regex = "(.+)"
        replacement = "$1"
      }
      rule {
        source_labels = ["__address__", "__meta_kubernetes_pod_annotation_prometheus_io_port"]
        target_label = "__address__"
        regex = "([^:]+)(?::\\d+)?;(\\d+)"
        replacement = "$1:$2"
      }
      rule {
        source_labels = ["__meta_kubernetes_pod_annotation_prometheus_io_scheme"]
        target_label = "__scheme__"
        regex = "(https?)"
        action = "replace"
      }
    }
    discovery.relabel "honor_pod_annotations" {
      targets = discovery.relabel.honor_pod_scrape_annotations.output
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_node_name"]
        target_label = "node_name"
      }
      rule {
        source_labels = ["__meta_kubernetes_namespace"]
        regex = "^c8.*"
        action = "drop"
      }
      rule {
        source_labels = ["__meta_kubernetes_namespace"]
        regex         = "system"
        action        = "drop"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_host_ip"]
        target_label = "node_ip"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_namespace"]
        target_label = "Kubernetes_namespace"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_exported_namespace"]
        target_label = "namespace"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_name"]
        target_label = "pod_name"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_ip"]
        target_label = "pod_ip"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_container_name"]
        target_label = "container_name"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_container_port_number"]
        target_label = "container_port"
      }
      rule {
        action = "replace"
        source_labels = ["__meta_kubernetes_pod_label_app"]
        target_label = "app"
     }
  
    }
    prometheus.scrape "pods" {
      targets    = discovery.relabel.honor_pod_annotations.output
      clustering {
        enabled = true
      }
      scrape_interval = "60s"
      forward_to = [prometheus.relabel.drop_unused_pod_metrics.receiver]
    }

    prometheus.relabel "drop_unused_pod_metrics" {
      rule {
        source_labels = ["app"]
        action = "drop"
        regex = "node-exporter"
      }
      rule {
        source_labels = ["app"]
        action = "drop"
        regex = "blackbox-exporter"
      }
      rule {
        source_labels = ["__name__"]
        action = "drop"
        regex = "goldpinger_peers_response_time_s.*"
      }
      rule {
        source_labels = ["__name__"]
        regex         = "apiserver_request_duration_seconds_bucket|apiserver_admission_controller_admission_latencies_seconds_bucket|rest_client_request_duration_seconds_bucket"
        action        = "drop"
      }
      rule {
        source_labels = ["__name__"]
        regex         = "etcd_request_duration_seconds_bucket|container_tasks_state"
        action        = "drop"
      }
      rule {
        source_labels = ["__name__"]
        regex         = "rest_client_rate_limiter_duration_seconds_.*"
        action        = "drop"
      }
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }
    
    //cadvisor section
    discovery.relabel "metrics_cadvisor" {
        targets = discovery.kubernetes.nodes.targets
        rule {
          action = "replace"
          target_label = "__address__"
          replacement = "kubernetes.default.svc.cluster.local:443"
        }
        rule {
          source_labels = ["__meta_kubernetes_node_name"]
          regex = "(.+)"
          action = "replace"
          replacement = "/api/v1/nodes/${1}/proxy/metrics/cadvisor"
          target_label = "__metrics_path__"
        }
    }

    prometheus.scrape "cadvisor" {
      scheme = "https"
      tls_config {
          server_name = "kubernetes"
          ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
          insecure_skip_verify = false
      }
      bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      targets = discovery.relabel.metrics_cadvisor.output
      scrape_interval = "60s"
      clustering {
        enabled = true
      }
      forward_to = [prometheus.relabel.add_cluster_labels.receiver]
    }
    
    prometheus.relabel "add_cluster_labels" {
      forward_to = [prometheus.remote_write.amp.receiver]
      rule {
        source_labels =  ["__address__"]
        target_label = "cluster"
        replacement = "{{ .Values.alloy.cluster }}"
        action = "replace"
      }
      rule {
        source_labels =  ["__address__"]
        target_label = "availability_zone"
        replacement = "{{ .Values.alloy.availaibilty_zone }}"
        action = "replace"
      }
      rule {
        source_labels =  ["__address__"]
        target_label = "stage"
        replacement = "{{ .Values.alloy.stage }}"
        action = "replace"
      }

      rule {
        source_labels =  ["__address__"]
        target_label = "version"
        replacement = "v1.29-0"
        action = "replace"
      }
      rule {
        source_labels =  ["__address__"]
        target_label = "business_criticality"
        replacement = "{{ .Values.alloy.business_criticality }}"
        action = "replace"
      }
    }

Logs

None..

@dili-pk dili-pk added the bug Something isn't working label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant