Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get any kafka_consumergroup metrics from Kafka exporter #409

Closed
Mallikarjunradware opened this issue Sep 16, 2023 · 6 comments · Fixed by #441
Closed

Unable to get any kafka_consumergroup metrics from Kafka exporter #409

Mallikarjunradware opened this issue Sep 16, 2023 · 6 comments · Fixed by #441

Comments

@Mallikarjunradware
Copy link

We are running kafka exporter and prometheus-to-sd containers on single pod on GKE. Till 13th Sep, exporter was working fine, suddenly it stopped exporting consumer related metrics(consumer lag, consumer member etc).

Below results without passing argument " --group.filter='.+' "
image

Below results after adding argument " --group.filter='.+' " but it dont have consumer metrics
image

Please find the exporter details.

  • Kafka exporter version: 1.6.0 (tried 1.7.0 also)
  • kafka version: 1.1.1

Please note, same exporter is working on other 3 kafka clusters and getting lag details.

So far I have taken below actions:

  • Added args "--group.filter='.+' " (without this getting error "was collected before with the same name and label values")
  • Done preferred replica election
  • Done Re partition of consumer_offset topic

Please help me if anyone has come across the same issue and able to overcome.

@hellorill
Copy link

We have the same problem. Previously, everything worked correctly, until the partitions of most topics were rebalanced. It looks like the processing of all responses is not entirely correct.

ch <- prometheus.MustNewConstMetric(
consumergroupMembers, prometheus.GaugeValue, float64(len(group.Members)), group.GroupId,
)

If you look at the values of the group.GroupId/group.GroupMembers/group variables, you can see, that their values may differ. Some of these values in our case:

...
Group id: test-group
Group members: map[]
Full group: &{0 kafka server: Request was for a consumer group that is not coordinated by this broker 16 test-group    map[] 0}
...
Group id: test-group
Group members: map[]
Full group: &{0 kafka server: Request was for a consumer group that is not coordinated by this broker 16 test-group    map[] 0}
...
Group id: test-group
Group members: map[<some valid data>]
Full group: &{0 kafka server: Not an error, why are you printing me? <some valid data>}
...

Most likely this leads to the following errors:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup" value:"test-group"} gauge:{value:8}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup" value:"test-group"} gauge:{value:0}} was collected before with the same name and label values

@hellorill
Copy link

Looked more detail, in the GroupDescription structure there are Err/ErrorCode fields, that are not checked by the exporter for errors in the Kafka response. Therefore, the exporter always believes, that the answer is correct, which sometimes leads to collisions in the metric.

for _, group := range describeGroups.Groups {
offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: 1}
if e.offsetShowAll {
for topic, partitions := range offset {
for partition := range partitions {
offsetFetchRequest.AddPartition(topic, partition)
}
}
} else {
for _, member := range group.Members {
assignment, err := member.GetMemberAssignment()
if err != nil {
klog.Errorf("Cannot get GetMemberAssignment of group member %v : %v", member, err)
return
}
for topic, partions := range assignment.Topics {
for _, partition := range partions {
offsetFetchRequest.AddPartition(topic, partition)
}
}
}
}
ch <- prometheus.MustNewConstMetric(
consumergroupMembers, prometheus.GaugeValue, float64(len(group.Members)), group.GroupId,
)

// Err contains the describe error as the KError type.
Err KError
// ErrorCode contains the describe error, or 0 if there was no error.
ErrorCode int16

@hellorill
Copy link

Actually, adding the following check resolves the metric error.

...
		for _, group := range describeGroups.Groups {
			if group.Err != 0 {
				continue
			}

			offsetFetchRequest := sarama.OffsetFetchRequest{ConsumerGroup: group.GroupId, Version: 1}
			if e.offsetShowAll {
				for topic, partitions := range offset {
...

@danielqsj
Copy link
Owner

Closed by #441

@PedroOrona
Copy link

We had the same problem, and even when we upgraded the exporter to version 1.8.0 (which incorporates the fixes on PR #441) we continued not getting the kafka_consumergroup metrics. The temporary fix was to restart Kafka cluster.

Can someone help identify the real problem here?

@glerma
Copy link

glerma commented Oct 18, 2024

Same issue for me... Getting

An error has occurred while serving metrics:

209 error(s) occurred:
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup"  value:"gid_sys_qc_086_du_818"}  gauge:{value:1}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup"  value:"gid_ad_qc_051_du_526"}  gauge:{value:0}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup"  value:"gid_ad_qc_196_du_963"}  gauge:{value:1}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_members" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  gauge:{value:0}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_current_offset" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"2"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:7.0516547e+07}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_lag" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"2"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:265389}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_current_offset" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"0"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:5.0154756e+07}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_lag" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"0"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:273863}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_current_offset" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"8"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:9.8748326e+07}} was collected before with the same name and label values
* collected metric "kafka_consumergroup_lag" { label:{name:"consumergroup"  value:"gid_sys_qc_454_du_1946"}  label:{name:"partition"  value:"8"}  label:{name:"topic"  value:"sys_qc_454_du_1946"}  gauge:{value:264952}} was collected before with the same name and label values

Running v1.8.0
Kafka 2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants