-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for untyped metrics from Ops Agent #668
Conversation
Can we put this behind an alpha feature gate? Otherwise, I would considering it a breaking change when we remove the behavior |
Codecov Report
@@ Coverage Diff @@
## main #668 +/- ##
==========================================
+ Coverage 68.56% 68.99% +0.42%
==========================================
Files 36 36
Lines 4559 4647 +88
==========================================
+ Hits 3126 3206 +80
- Misses 1280 1288 +8
Partials 153 153
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
We can, but the mechanism (the attribute) isn't user-facing, so I think changing it wouldn't be breaking. Of course there's nothing to stop users from adding this attribute themselves anyway and hacking it. |
dc07395
to
678526b
Compare
Should this be specific to the GMP Exporter? I'm OK with this version being in the googlecloud exporter, but I think we want it in GMP parts eventually. |
678526b
to
d8d80d0
Compare
exporter/collector/metrics.go
Outdated
Value: value, | ||
}}, | ||
Metric: &metricpb.Metric{ | ||
Type: t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the GMP suffixing handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it looks like GMP uses a special unknown
suffix when it duplicates the metric. Maybe we should change our metric name function to check for this label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it has different suffixes for the gauge portion of the unknown metric (unknown
and the counter portion of the unknown metric (unknown:counter
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to refactor this, a better approach would be to copy the untyped Gauge pmetric entry into an extra Sum metric, and let the existing path after that handle everything. Then GetMetricName()
can handle both.
I think using an actual Sum from pdata will also make it easier to handle @ridwanmsharif's question #668 (comment). @dashpole wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, but I'm not convinced the logic should live in the base exporter. If someone used the prometheus receiver with our base exporter, they would get duplicate timeseries errors. We need the GMP logic of adding suffixes to prevent duplicate timeseries problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can at least move the gate to the GMP exporter so the logic is only active there. Not sure if we could move the logic itself. Maybe through the "extra metrics" approach we did for target and scope info metrics actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to move everything using the extra metrics approach. Ptal, thanks!
exporter/collector/metrics.go
Outdated
ValueType: valueType, | ||
Points: []*monitoringpb.Point{{ | ||
Interval: &monitoringpb.TimeInterval{ | ||
EndTime: timestamppb.New(point.Timestamp().AsTime()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to confirm this, but not setting the start time will make this normalize the same way GMP does with these counters?
That is, when the value goes down, treat it as a reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this, but it's still doing basically the same thing in the new code so to answer your question: yes it should. This is the normalization function:
opentelemetry-operations-go/exporter/collector/internal/normalization/standard_normalizer.go
Line 261 in a19d9ea
func (s *standardNormalizer) NormalizeNumberDataPoint(point pmetric.NumberDataPoint, identifier string) (pmetric.NumberDataPoint, bool) { |
You can see this now in the fixtures, where I had to add 2 new untyped input data points just to get one output unknown cumulative. This is because the first data point gets normalized as a reset point and dropped by our exporter.
As a side note, I had to update our fixture generators to test this because they currently normalize all timestamps to be rightNow() (in order to work with testing against the actual GCM api). But I needed to bypass that for this test to get the cumulative normalization to show up.
|
}, | ||
{ | ||
"metric": { | ||
"type": "prometheus.googleapis.com/fake_untyped_metric_total/unknown:counter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the expected counter... of note:
_total
is appended (I think this comes from the upstream prom normalization that's currently on be default)- 2 data points for
fake_untyped_metric
as a gauge, because there are 2 input points (first Sum is dropped for normalization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comes from the upstream prom normalization that's currently on be default
Should we update our collector dependency? I don't think that is supposed to be the case anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can leave it how it is, and update later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test bump on this branch and re-ran the fixture and the _total
/units suffixes are dropped in the new version. I'll open a separate bump PR after I merge this
"type": "workload.googleapis.com/fake_untyped_metric", | ||
"labels": { | ||
"ex_com_lemons_untyped": "13", | ||
"prometheus_untyped_metric": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the fixture for testing the featuregate. this label would be dropped in the gmp exporter if the gate was enabled
Final update now that all tests are green: using 2 gauge entries to get the normalized cumulative didn't work with the actual GCM integration test, because the 2nd gauge caused a duplicate timeseries error (see failure here). Added 7830343 to disable cumulative normalization, use 1 gauge entry, and remove the "skip timestamp update" option I added earlier. However the lack of startTimestamp on the input gauge caused it to be initialized to the 0 time by this line, which caused an invalid timeseries error from GCM (see that failure here). To fix that I pushed 505a1cf to add a startTimestamp to the input gauge and more importantly copy that startTimestamp to the new point. I think this is valid for use cases where untyped metrics might want to manually register a reset point. Now that this is green, I'm merging. Will open another issue to check on our other cumulative tests to make sure we're not dropping points in those. |
This adds functionality to automatically parse the attribute set by the Ops Agent's prometheus receiver to indicate untyped prometheus metrics (GoogleCloudPlatform/opentelemetry-operations-collector#170). When this attribute is found in a Gauge metric, the time series will be double-exported as a Gauge and a Cumulative. The special attribute will be dropped from metric labels.
This behavior is
on by defaultbehind a featuregategcp.untyped_double_export
and is only meant for internal use until an upstream approach to untyped metrics is implemented.