failed to unregister all metrics from previous promtail. THIS IS A BUG #2754

kuberxy · 2023-01-16T01:42:44Z

I want to use metrics pipeline_stages count the number of error rows in the log. Below is my configuration

apiVersion: monitoring.grafana.com/v1alpha1
kind: PodLogs
metadata:
  labels:
    app.kubernetes.io/instance: loki
    app.kubernetes.io/name: loki
  name: mysql
  namespace: loki
spec:
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      app: mysql
  relabelings:
  - action: replace
    sourceLabels:
    - __meta_kubernetes_pod_node_name
    targetLabel: __host__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    replacement: mysql
    targetLabel: job
  - action: replace
    sourceLabels:
    - __meta_kubernetes_pod_container_name
    targetLabel: container
  - action: replace
    replacement: mysql
    targetLabel: cluster
  pipelineStages:
  - docker: {}
  - match:
      selector: '{container="mysql"} |~ ".*[ERROR].*"'
      stages: |
        - regex:
            expression: ".*(?P<error>[ERROR]).*"
        - metrics:
            error_log_lines_total:
              type: Counter
              description: The total number of error log lines
              source: error
              action: inc

After the loki-logs pod restarts, I get an error

$ kubectl -n loki logs -f loki-logs-2snl9 grafana-agent 
ts=2023-01-12T02:03:50.543250808Z caller=server.go:195 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2023-01-12T02:03:50.579742291Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2023-01-12T02:03:50.579889027Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2023-01-12T02:03:50.616783411Z caller=kubernetes.go:313 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-01-12T02:03:50.973238948Z caller=kubernetes.go:313 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-01-12T02:03:50Z level=info caller=traces/traces.go:143 msg="Traces Logger Initialized" component=traces
ts=2023-01-12T02:03:51.151720301Z caller=integrations.go:138 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2023-01-12T02:03:51.220365818Z caller=reporter.go:107 level=info msg="running usage stats reporter"
ts=2023-01-12T02:03:55.336275999Z caller=entrypoint.go:246 level=info msg="reload of config file requested"
ts=2023-01-12T02:03:55.520499461Z caller=entrypoint.go:161 level=error msg="failed to update loki" err="failed to unregister all metrics from previous promtail. THIS IS A BUG"
ts=2023-01-12T02:03:55.520617191Z caller=entrypoint.go:257 level=error msg="failed to reload config file" err="changes did not apply successfully"
ts=2023-01-12T02:04:00.336698444Z caller=entrypoint.go:246 level=info msg="reload of config file requested"

When I delete metrics pipeline_stages, everything is normal. My deployment method is helm

helm install --namespace=loki --values values.yaml loki loki-3.8.0.tgz

Please help me with this, thanks

The text was updated successfully, but these errors were encountered:

jcreixell · 2023-01-31T17:07:18Z

@kuberxy could you please try v0.31.0 (which was released today) and let us know if the issue is gone for you?

kuberxy · 2023-02-02T08:04:51Z

@kuberxy could you please try v0.31.0 (which was released today) and let us know if the issue is gone for you?

this issue is not gone.

$ kubectl -n loki logs loki-logs-sbfp8 grafana-agent
ts=2023-02-02T07:51:04.895353776Z caller=server.go:191 level=info msg="server listening on addresses" http=[::]:8080 grpc=127.0.0.1:12346 http_tls_enabled=false grpc_tls_enabled=false
ts=2023-02-02T07:51:04.896050123Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2023-02-02T07:51:04.896177889Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2023-02-02T07:51:04.906062725Z caller=promtail.go:123 level=info component=logs logs_config=loki/loki msg="Reloading configuration file" md5sum=6e7026368fd78b062c8c20e5ce9e9b8a
ts=2023-02-02T07:51:04.907204839Z caller=kubernetes.go:326 level=info component=logs logs_config=loki/loki component=discovery discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-02-02T07:51:04.908589284Z caller=zapadapter.go:78 level=info component=traces msg="Traces Logger Initialized"
ts=2023-02-02T07:51:04.908651471Z caller=integrations.go:156 level=warn msg="integrations-next is enabled. integrations-next is subject to change"
ts=2023-02-02T07:51:04.911183301Z caller=reporter.go:103 level=info msg="running usage stats reporter"
ts=2023-02-02T07:51:08.808643551Z caller=entrypoint.go:337 level=info msg="reload of config file requested"
ts=2023-02-02T07:51:08.846228489Z caller=entrypoint.go:167 level=error msg="failed to update loki" err="failed to unregister all metrics from previous promtail. THIS IS A BUG"
ts=2023-02-02T07:51:08.846331922Z caller=entrypoint.go:348 level=error msg="failed to reload config file" err="changes did not apply successfully"
ts=2023-02-02T07:51:13.805653552Z caller=entrypoint.go:337 level=info msg="reload of config file requested"

$ kubectl -n loki describe pod loki-grafana-agent-operator-5df4cf9c46-ttf6r
...
Containers:
  grafana-agent-operator:
    Container ID:  docker://517a2247c825b95d8e67fb0e79d8f5ceac8b44817c2aba9da4951e9cd1c703c5
    Image:         docker.io/grafana/agent-operator:v0.31.0
    Image ID:      docker-pullable://grafana/agent-operator@sha256:01aace5fa7943dfaf1edc55e6562cd12e6888fa569ebb5a45962d3fef81407ac
    Port:          <none>
    Host Port:     <none>
    Args:
      --kubelet-service=default/kubelet
    State:          Running
      Started:      Thu, 02 Feb 2023 15:30:00 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wgzs8 (ro)
Conditions:
...

ptodev · 2024-01-25T18:40:17Z

I had a look at this today. Apparently Prometheus has a feature called "unchecked collectors". It is mentioned in their docs:

Alternatively, you could return no Desc at all, which will mark the Collector “unchecked”. No checks are performed at registration time, but metric consistency will still be ensured at scrape time, i.e. any inconsistencies will lead to scrape errors. Thus, with unchecked Collectors, the responsibility to not collect metrics that lead to inconsistencies in the total scrape result lies with the implementer of the Collector.

The underlying loki library uses unchecked collectors because the Desc function doesn't return anything. One problem with unchecked collectors is that even though Register() doesn't return any errors for them, Unregister() does return an error.

I tried to enhance agent/pkg/util/unregisterer.go with a function like this, so that it doesn't unregister unchecked collectors:

func isUncheckedCollector(c prometheus.Collector) bool {
	var (
		descChan = make(chan *prometheus.Desc, 10)
	)
	go func() {
		c.Describe(descChan)
		close(descChan)
	}()

	i := 0
	for range descChan {
		i += 1
	}

	return i == 0
}

This got rid of the error, but unfortunately the metric stopped working properly. It was still visible on /metrics, but it wouldn't increment properly even if new log lines came in. I suspect this is because the Prometheus registerer is still holding on to the old unchecked collector, and it is using it in Gather instead of the new one.

I suspect the reason unchecked collectors are used here is that due to a max_idle_duration config entry, the metric might disappear during the runtime of the Agent. A Desc implementation should always return the same description, and so wouldn't work well in this case.

I would like to try to get rid of the unchecked collectors in the loki code, and instead switch to normal collectors. There are several reasons:

It will make it much easier to wipe out all old metrics during a config reload.
It will be safer, because we will be able to ensure that an unchecked metric doesn't clash with a normal one.
There are some edge cases with max_idle_duration. What happens if the metric is almost at the end of its max_idle_duration, but then the config is reloaded? It would be logical to reset our max_idle_duration. This might be harder to do when we use unchecked collectors.
It's really not clear to me how to make Prometheus forget about the old unchecked Collector. The only solution I can think of is to somehow make it noop in our code when the config is refreshed, but this sounds really messy.

ptodev · 2024-02-01T12:22:58Z

I tried using normal collectors such as prometheus.Counter, but it wasn't possible. Apparently, Prometheus registries don't allow registering the same metric multiple times if it has different const labels¹. I will keep using unchecked collectors, but I'll add a function like this to the metricvec:

func (c *metricVec) DeleteAll() {
	c.mtx.Lock()
	defer c.mtx.Unlock()
	c.metrics = map[model.Fingerprint]prometheus.Metric{}
}

This will cause the Collect to collect nothing, because all the metrics will be deleted.

I just need to find a good enough place to call DeleteAll(). Using the Stage interface's Run(chan Entry) chan Entry function is not enough, because Run could be called multiple times for the lifetime of the Pipeline. In fact, it has to be, because the output is only made available in the return statement of Run. A better design would have been to have Run accept input and output channels, run it once async for the process lifetime, and have it clean up its state when the input channel is canceled.

We need to persist the metrics between calls to Run. My plan for now is to:

Add a new Cleanup function to the Stage interface:

type Stage interface {
    Name() string
    Run(chan Entry) chan Entry
    Cleanup()
}

Add a new Cleanup function to the Pipeline struct in loki/clients/pkg/logentry/stages/pipeline.go.
Call the Pipeline Cleanup function whenever we create a new loki.NewEntryHandler in the Agent repo². This will then be called when we call promtail.Shutdown().

Also, I found another bug with the way these metrics are collected. Apparently, a Collect needs to be called in order for the max_idle_duration cleanup to kick in. All metrics are collected, even the ones which were meant to expire a long time ago. So if your scrape interval is long compared to max_idle_duration, you will still end up collecting (once) all those metrics that were supposed to be deleted. I will fix this as well, but probably on a separate PR.

I could have sworn that I've been able to register identical metrics with differing const labels in the past. But now I realise that back then the registry was returning errors, and I hadn't caught the error at first. ↩
I'm not sure if I will also update Promtail with this fix. While I was browsing the Promtail code, I think I came across a feature for reloading the config - so it might be an issue for Promtail too. For now I might just fix the Agent. ↩

kuberxy changed the title ~~failed to update loki" err="failed to unregister all metrics from previous promtail. THIS IS A BUG~~ failed to unregister all metrics from previous promtail. THIS IS A BUG Jan 16, 2023

tpaschalis added operator Grafana Agent Operator related type/operator bug Something isn't working labels Jan 16, 2023

rfratto added type/core and removed operator Grafana Agent Operator related type/operator labels Jan 17, 2023

tpaschalis added this to the v0.34.0 milestone Feb 28, 2023

darxriggs mentioned this issue Apr 10, 2023

Promtail: Error on reload when using "metrics" stage #3408

Closed

rfratto modified the milestones: v0.34.0, v0.35.0 Jun 12, 2023

mattdurham removed this from the v0.35.0 milestone Jul 25, 2023

tpaschalis added this to the v0.37.0 milestone Sep 5, 2023

tpaschalis modified the milestones: v0.37.0, v0.38.0 Oct 6, 2023

rfratto removed the type/signals label Nov 2, 2023

rfratto removed this from the v0.38.0 milestone Nov 2, 2023

ptodev self-assigned this Jan 24, 2024

ptodev mentioned this issue Feb 6, 2024

fix: promtail; clean up metrics generated from logs after a config reload. grafana/loki#11882

Merged

rfratto added variant/static Related to Grafana Agent Static. variant/operator Related to Grafana Agent Static Operator. labels Apr 9, 2024

cstyan closed this as completed in grafana/loki#11882 Apr 24, 2024

grafanabot mentioned this issue May 10, 2024

chore: [k190] fix: promtail; clean up metrics generated from logs after a config reload. grafana/loki#12938

Merged

ptodev reopened this May 14, 2024

ptodev mentioned this issue May 14, 2024

Update Loki dependency #6905

Merged

ptodev mentioned this issue Jun 27, 2024

Fix issue with config reload when using a log pipeline with a metric stage #6971

Merged

4 tasks

ptodev closed this as completed in #6971 Jul 15, 2024

ptodev mentioned this issue Jul 15, 2024

Fix bug with loki.process metrics stage during config reload grafana/alloy#1292

Merged

4 tasks

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Aug 15, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to unregister all metrics from previous promtail. THIS IS A BUG #2754

failed to unregister all metrics from previous promtail. THIS IS A BUG #2754

kuberxy commented Jan 16, 2023

jcreixell commented Jan 31, 2023

kuberxy commented Feb 2, 2023

ptodev commented Jan 25, 2024

ptodev commented Feb 1, 2024

failed to unregister all metrics from previous promtail. THIS IS A BUG #2754

failed to unregister all metrics from previous promtail. THIS IS A BUG #2754

Comments

kuberxy commented Jan 16, 2023

jcreixell commented Jan 31, 2023

kuberxy commented Feb 2, 2023

ptodev commented Jan 25, 2024

ptodev commented Feb 1, 2024

Footnotes