Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Prometheus metric types parser #39743

Merged

Conversation

gpop63
Copy link
Contributor

@gpop63 gpop63 commented May 27, 2024

Overview

When processing a group of metrics of this format, the metric type variable is set to the latest metric type in the group, even if their types might differ. This happens because the parser thinks this is only 1 metric. This change stores the metric types for each base metric name to avoid this issue.

# HELP nginx_sts_server_bytes_total The request/response bytes
# TYPE nginx_sts_server_bytes_total counter
# HELP nginx_sts_server_connects_total The connects counter
# TYPE nginx_sts_server_connects_total counter
# HELP nginx_sts_server_session_seconds_total The session duration time
# TYPE nginx_sts_server_session_seconds_total counter
# HELP nginx_sts_server_session_seconds The average of session duration time in seconds
# TYPE nginx_sts_server_session_seconds gauge
# HELP nginx_sts_server_session_duration_seconds The histogram of session duration in seconds
# TYPE nginx_sts_server_session_duration_seconds histogram
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="in"} 0
nginx_sts_server_bytes_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",direction="out"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="1xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="2xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="3xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="4xx"} 0
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="5xx"} 171
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="total"} 171
nginx_sts_server_session_seconds_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.016
nginx_sts_server_session_seconds{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.000

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 27, 2024
@mergify mergify bot assigned gpop63 May 27, 2024
Copy link
Contributor

mergify bot commented May 27, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @gpop63? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@gpop63
Copy link
Contributor Author

gpop63 commented May 27, 2024

/test

@ycombinator ycombinator added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 28, 2024
@gpop63
Copy link
Contributor Author

gpop63 commented May 29, 2024

/test

@gpop63 gpop63 marked this pull request as ready for review June 11, 2024 11:18
@gpop63 gpop63 requested a review from a team as a code owner June 11, 2024 11:18
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@gpop63
Copy link
Contributor Author

gpop63 commented Jun 11, 2024

/test


result, err := ParseMetricFamilies([]byte(input), ContentTypeTextFormat, time.Now(), nil)
if err != nil {
t.Fatalf("ParseMetricFamilies for content type %s returned an error.", ContentTypeTextFormat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Blocker]
Please log the error as well. Otherwise if this test fail it'll be difficult to understand why it failed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpop63 could you please apply the requested change here?

nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="5xx"} 171
nginx_sts_server_connects_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP",code="total"} 171
nginx_sts_server_session_seconds_total{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.016
nginx_sts_server_session_seconds{listen="TCP:8091:127.0.0.1",port="8091",protocol="TCP"} 0.000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion]

2 things:

  • from this the metric name of nginx_sts_server_session_seconds_total would be nginx_sts_server_session_seconds and the metric name of nginx_sts_server_session_seconds would be nginx_sts_server_session. Just checking, is my understand correct and is it the expected behaviour?
  • wouldn't be good to add 2 metrics with the same base name, like in your example summary_metric_count and summary_metric_sum?

Copy link
Contributor Author

@gpop63 gpop63 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My example was for the metric family below (we have a test for this). The metric family is called summary_metric and it includes various metrics such as summary_metric_sum, summary_metric_count, etc. We only want to store the metric type for the metric family summary_metric since all its metrics share the same type, which is summary.

# TYPE summary_metric summary
summary_metric{quantile="0.5"} 29735
summary_metric{quantile="0.9"} 47103
summary_metric{quantile="0.99"} 50681
summary_metric{noquantile="0.2"} 50681
summary_metric_sum 234892394
summary_metric_count 44000
summary_metric_impossible 123
# EOF

buf, t := parser.Type()
s := string(buf)
fam, ok = metricFamiliesByName[s]
if !ok {
fam = &MetricFamily{Name: &s, Type: t}
metricFamiliesByName[s] = fam
} else {
fam.Type = t
}
// Store the metric type for each base metric name.
metricTypes[s] = t

s represents the metric family name (e.g., summary_metric). Only the metric family name is stored in the map, without its individual metrics.

For the examples you mentioned, nginx_sts_server_session_seconds_total and nginx_sts_server_session_seconds, these metrics are already the metric family names (they don't have any child metrics) and thus would not trigger the !ok condition. The !ok condition was added as an edge case for families that have multiple child metrics with different suffixes.

@pierrehilbert pierrehilbert requested review from VihasMakwana, fearful-symmetry and AndersonQ and removed request for leehinman June 24, 2024 15:35
@pierrehilbert
Copy link
Collaborator

@fearful-symmetry / @VihasMakwana could you please review this PR?

Copy link
Contributor

@fearful-symmetry fearful-symmetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have much context for Prometheus specifically, but the code seems fine.

@pierrehilbert
Copy link
Collaborator

@gpop63 are we expecting to backport this in 8.14?

@pierrehilbert
Copy link
Collaborator

@gpop63 could you have a look here please?

@gpop63
Copy link
Contributor Author

gpop63 commented Jul 2, 2024

@pierrehilbert Sorry for the late reply, we could backport it to 8.14, yes.

@gpop63 gpop63 added the backport-v8.14.0 Automated backport with mergify label Jul 2, 2024
@pierrehilbert pierrehilbert merged commit 35158d8 into elastic:main Jul 2, 2024
19 checks passed
mergify bot pushed a commit that referenced this pull request Jul 2, 2024
* store metric types in a map

* add test case

* include error in fatal log

* add changelog entry

(cherry picked from commit 35158d8)
pierrehilbert pushed a commit that referenced this pull request Jul 2, 2024
* store metric types in a map

* add test case

* include error in fatal log

* add changelog entry

(cherry picked from commit 35158d8)

Co-authored-by: Gabriel Pop <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.14.0 Automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus parser incorrectly classifies counters as histograms
7 participants