Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting Prometheus Remote_write Period default value to 1m #38553

Merged
merged 25 commits into from
Apr 10, 2024

Conversation

gizas
Copy link
Contributor

@gizas gizas commented Mar 22, 2024

  • Bug

Proposed commit message

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Install a local kind cluster
  2. Install prometheus in remote_write configuration in your k8s cluster
  3. Checkout this branch
  4. Navigate to beats/x-pack/metricbeat
  5. Build a new image with command PLATFORMS=linux/arm64 PACKAGES=docker mage package
  6. Load your image in your k8s cluster with
    kind load docker-image docker.elastic.co/beats/metricbeat:8.14.0 --name kind-cluster
  7. Install following metricbeat manifest
metricbeat manifest for prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
  labels:
    k8s-app: metricbeat
rules:
- apiGroups: [""]
  resources:
  - nodes
  - namespaces
  - events
  - pods
  - services
  - persistentvolumes
  - persistentvolumeclaims
  verbs: ["get", "list", "watch"]
# Enable this rule only if planing to use Kubernetes keystore
#- apiGroups: [""]
#  resources:
#  - secrets
#  verbs: ["get"]
- apiGroups: ["extensions"]
  resources:
  - replicasets
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  - deployments
  - replicasets
  - daemonsets
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources:
  - jobs
  - cronjobs
  verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources:
    - storageclasses
  verbs: ["get", "list", "watch"]
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- nonResourceURLs:
  - "/metrics"
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat
  # should be the namespace where metricbeat is running
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
- kind: ServiceAccount
  name: metricbeat
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: metricbeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: metricbeat
    namespace: kube-system
roleRef:
  kind: Role
  name: metricbeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-config
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  metricbeat.yml: |-
    metricbeat.config.modules:
      # Mounted `metricbeat-daemonset-modules` configmap:
      path: ${path.config}/modules.d/*.yml
      # Reload module configs as they change:
      reload.enabled: false

    processors:
      - add_cloud_metadata:

    cloud.id: ${ELASTIC_CLOUD_ID}
    cloud.auth: ${ELASTIC_CLOUD_AUTH}

    logging.level: debug

    output.elasticsearch:
      hosts: ['https://elasticsearch:9200']
      ssl.verification_mode: none
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-daemonset-modules
  namespace: kube-system
  labels:
    k8s-app: metricbeat
data:
  prometheus.yml: |-
    - module: prometheus
      metricsets: ["remote_write"]
      host: "0.0.0.0"
      port: "9201"
      use_types: true
      rate_counters: true
---
# Deploy a Metricbeat instance per node for node metrics retrieval
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: metricbeat
  namespace: kube-system
  labels:
    k8s-app: metricbeat
spec:
  selector:
    matchLabels:
      k8s-app: metricbeat
  template:
    metadata:
      labels:
        k8s-app: metricbeat
    spec:
      serviceAccountName: metricbeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: metricbeat
        #image: docker.elastic.co/beats/metricbeat:8.14.0-SNAPSHOT
        image: docker.elastic.co/beats/metricbeat:8.14.0
        imagePullPolicy: Never
        args: [
          "-c", "/etc/metricbeat.yml",
          "-e",
          "-system.hostfs=/hostfs",
        ]
        env:
        - name: ELASTICSEARCH_HOST
          value: elasticsearch
        - name: ELASTICSEARCH_PORT
          value: "9200"
        - name: ELASTICSEARCH_USERNAME
          value: elastic
        - name: ELASTICSEARCH_PASSWORD
          value: changeme
        - name: ELASTIC_CLOUD_ID
          value:
        - name: ELASTIC_CLOUD_AUTH
          value:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
        resources:
          limits:
            memory: 1500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: config
          mountPath: /etc/metricbeat.yml
          readOnly: true
          subPath: metricbeat.yml
        - name: data
          mountPath: /usr/share/metricbeat/data
        - name: modules
          mountPath: /usr/share/metricbeat/modules.d
          readOnly: true
        - name: proc
          mountPath: /hostfs/proc
          readOnly: true
        - name: cgroup
          mountPath: /hostfs/sys/fs/cgroup
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: cgroup
        hostPath:
          path: /sys/fs/cgroup
      - name: config
        configMap:
          defaultMode: 0640
          name: metricbeat-daemonset-config
      - name: modules
        configMap:
          defaultMode: 0640
          name: metricbeat-daemonset-modules
      - name: data
        hostPath:
          # When metricbeat runs as non-root user, this directory needs to be writable by group (g+w)
          path: /var/lib/metricbeat-data
          type: DirectoryOrCreate
---

Note: In case of dropping metrics you might need to increase mapping limit from inside Dev Console:

PUT .ds-metricbeat-8.14.0-2024.03.28-000001/_settings
{
  "index.mapping.total_fields.limit": 99999
}

Related issues

Screenshots

Screenshot 2024-03-28 at 2 43 15 PM

Logs

With no period set: (default applies that is period:60s)

k logs -n kube-system metricbeat-nmq8d -f | grep -i 'Period for counter'
{"log.level":"info","@timestamp":"2024-03-28T12:38:39.557Z","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":52},"message":”Period for counter: 1m0s","service.name":"metricbeat","ecs.version":"1.6.0"}

With period: 108s

k logs -n kube-system metricbeat-nr68p -f | grep -i 'Period for counter'
{"log.level":"info","@timestamp":"2024-03-28T12:41:45.434Z","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":52},"message":”Period for counter: 1m48s","service.name":"metricbeat","ecs.version":"1.6.0"}

With period: 35m

k logs -n kube-system metricbeat-k5ssl -f | grep -i 'Period for counter'
{"log.level":"info","@timestamp":"2024-03-28T12:39:36.342Z","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":52},"message":”Period for counter: 35m0s","service.name":"metricbeat","ecs.version":"1.6.0"}

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 22, 2024
@mergify mergify bot assigned gizas Mar 22, 2024
Copy link
Contributor

mergify bot commented Mar 22, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gizas? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 22, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 74 min 33 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @gizas

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @gizas

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
x-pack/metricbeat/metricbeat.reference.yml Outdated Show resolved Hide resolved
x-pack/metricbeat/module/prometheus/remote_write/data.go Outdated Show resolved Hide resolved
if duration < 60*time.Second || err != nil {
duration = time.Second * 60
}
logp.Debug("Period for counter cache for remote_write", duration.String())
Copy link
Contributor

@tetianakravchenko tetianakravchenko Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the log is:

{"log.level":"debug","@timestamp":"2024-03-28T14:49:07.238Z","log.logger":"Period for counter cache for remote_write","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":52},"message":"1m0s","service.name":"metricbeat","ecs.version":"1.6.0"}
Suggested change
logp.Debug("Period for counter cache for remote_write", duration.String())
logp.Debug("prometheus.remote_write.cache", "Period for counter cache for remote_write: %v", duration.String())

I used the same logger as in https://github.com/elastic/beats/blob/main/x-pack/metricbeat/module/prometheus/remote_write/data.go#L89

this code is executed twice, we need to check if it is expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logger !

For the number of calls now: I see that our function is called here in the init.

I dont see a second place for call. So seems that init() is called twice? Still looking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found it:

First call happens here and second call happens here . So we read config initially and then after reload period we check again.

So it is something that happens only on config load time.

Logs to prove above:
{"log.level":"info","@timestamp":"2024-03-29T09:11:56.648Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch","file.name":"instance/beat.go","file.line":520},"message":"metricbeat start running.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.648Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Check","file.name":"cfgfile/reload.go","file.line":131},"message":"Checking module configs from: /usr/share/metricbeat/modules.d/*.yml","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.649Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.LoadList","file.name":"cfgfile/cfgfile.go","file.line":204},"message":"Load config from file: /usr/share/metricbeat/modules.d/prometheus.yml","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.649Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Check","file.name":"cfgfile/reload.go","file.line":145},"message":"Number of module configs found: 1","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.649Z","log.logger":"prometheus.remote_write.cache","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":47},"message":"Period for counter cache for remote_write: 3m0s","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-29T09:11:56.649Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Run","file.name":"cfgfile/reload.go","file.line":163},"message":"Config reloader started","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.649Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Run","file.name":"cfgfile/reload.go","file.line":193},"message":"Scan for new config files","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.649Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.LoadList","file.name":"cfgfile/cfgfile.go","file.line":204},"message":"Load config from file: /usr/share/metricbeat/modules.d/prometheus.yml","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"cfgfile","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Run","file.name":"cfgfile/reload.go","file.line":212},"message":"Number of module configs found: 1","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"reload","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*RunnerList).Reload","file.name":"cfgfile/list.go","file.line":92},"message":"Starting reload procedure, current runners: 0","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"reload","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*RunnerList).Reload","file.name":"cfgfile/list.go","file.line":110},"message":"Start list: 1, Stop list: 0","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"prometheus.remote_write.cache","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/metricbeat/module/prometheus/remote_write.remoteWriteEventsGeneratorFactory","file.name":"remote_write/data.go","file.line":47},"message":"Period for counter cache for remote_write: 3m0s","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"reload","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*RunnerList).Reload","file.name":"cfgfile/list.go","file.line":154},"message":"Starting runner: RunnerGroup{prometheus [metricsets=1]}","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-03-29T09:11:56.650Z","log.logger":"module","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start","file.name":"module/wrapper.go","file.line":129},"message":"Starting Wrapper[name=prometheus, len(metricSetWrappers)=1]","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-03-29T09:11:56.650Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cfgfile.(*Reloader).Run","file.name":"cfgfile/reload.go","file.line":223},"message":"Loading of config files completed.","service.name":"metricbeat","ecs.version":"1.6.0"}

(I will leave this comment as unresolved just for reference)

@@ -21,6 +25,7 @@ var defaultConfig = config{
TypesPatterns: TypesPatterns{
CounterPatterns: nil,
HistogramPatterns: nil},
Period: time.Second * 60,
}

func (c *config) Validate() error {
Copy link
Contributor

@tetianakravchenko tetianakravchenko Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this function be a better fit for the period validation? check can be moved here with the info log message regarding the min value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to the Validate() 4cfcf01

@gizas gizas requested a review from a team as a code owner March 29, 2024 10:16
return err
} else if duration < 60*time.Second {
// by default prometheus push data with the interval 60s, in order to calculate counter rate we are setting Period to 60secs accordingly
c.Period = time.Second * 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log a warning saying the period needs to be at least 60s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say for now that debug log here and the comment here are enough. Afterall this is an internal cache so not many users will want to change it

@gizas
Copy link
Contributor Author

gizas commented Apr 9, 2024

@rdner , @fearful-symmetry can you please have a review here please?

@rdner
Copy link
Member

rdner commented Apr 9, 2024

@gizas looks like you still need to run make update to regenerate docs.

@gizas gizas merged commit e798bb1 into main Apr 10, 2024
64 of 65 checks passed
@gizas gizas deleted the remote_write_ratefix branch April 10, 2024 08:56
mergify bot pushed a commit that referenced this pull request Apr 10, 2024
* Correcting period for Prometheus remote_write

* Update CHANGELOG.next.asciidoc

(cherry picked from commit e798bb1)
gizas added a commit that referenced this pull request Apr 10, 2024
…38811)

* Correcting period for Prometheus remote_write

* Update CHANGELOG.next.asciidoc

(cherry picked from commit e798bb1)

Co-authored-by: Andrew Gizas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.13.0 Automated backport with mergify Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants