Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Otlp Delta Aggregation with support for max and Histogram. #3749

Merged
merged 9 commits into from
Apr 28, 2023

Conversation

lenin-jaganathan
Copy link
Contributor

@lenin-jaganathan lenin-jaganathan commented Apr 10, 2023

This PR is a follow-up PR for #3625 where the capability to have Delta Aggregation Temporality was introduced. This aims to change some of the behaviors of the OTLP Delta Registry and try to stick the meters to the standards mentioned here(https://opentelemetry.io/docs/reference/specification/metrics/data-model/#metric-points)

Changes introduced in this PR

Core:

  • Abstract Timer can accept a custom histogram implementation via the constructor. (This can be applied to other registry implementations that have custom Histogram implementation and avoid duplicate code. But I have not included that as I want this to focus only on OTLP.)
  • Move basic DistributionStatisticConfig validations to the Builder. (Only basic validations such as non-negatives, and percentile ranges are validated.)
  • Extract FixedBoundaryHistogram to a package-private class under distribution. (this was an inner class in TimeWindowFixedBoundaryHistogram earlier.)
  • Introduce StepMax which reports the max for the step.
  • Introduce StepBucketHistogram which records the bucket values for the step and reset on the next step boundary.

OTLP:

  • OtlpStepTimer and OtlpStepDistributionSummary now extend AbstractTimer. So, that it can have custom max implementation.
  • Apply MeterRegistryCompatibilityKit to both DELTA and CUMULATIVE variants of OtlpMeterRegistry.
  • Abstract some of the common test cases of OtlpMeter Registry into a base test class.

Known Issues

MeterRegistryCompatibilityKit tests have some known failures that I want to solve in this PR discussion,

  • There was a deprecated test in MeterRegistryCompatibilityKit that validates histogram counts. Since OTLP uses Step Histogram it will return 0 for the uncompleted step which fails.

TODO

  • Add unit tests for StepMax, StepBucketHistogram, and any other classes that lack unit testing
  • Probably add a "Recordable" (or similar) interface that can be extended by any class that supports record() and poll() operations. The objective behind this is to have a High-level interface that defines an object that supports and recording and polling of data. This can be used to construct NOOP instances where redundant null checks can be avoided, and future support for custom recordings (min is a valid statistic in OTLP).

Notes:

  • Would have loved to extract the refactoring and features as multiple commits for a better reviewing experience but somehow I messed up the git history

Closes gh-3772
Closes gh-3771

@sonatype-lift
Copy link
Contributor

sonatype-lift bot commented Apr 10, 2023

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3749.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3749.diff | git apply

Once you're satisfied, commit and push your changes in your project.

Footnotes

  1. You can preview the patch by opening the patch URL in the browser.

@lenin-jaganathan lenin-jaganathan force-pushed the improve_otlp_delta branch 2 times, most recently from ba901e5 to 3ede90f Compare April 11, 2023 04:48
@lenin-jaganathan
Copy link
Contributor Author

@shakuzen / @jonatan-ivanov Can you guys have a look at this? Really appreciate any feedback on this. It would be good to have these get in time for 1.11.0

@shakuzen
Copy link
Member

  • CUMULATIVE aggregation does not support max but the earlier version still paid the cost of recording max. This PR doesn't record max so it will return 0.0 if at all queried that makes the test fail

It feels a bit wider scope for this pull request than necessary. See some previous discussion in #3144. I think we should not change that behavior here. It's generally part of the contract of a timer/summary that max in some form be supported. That we don't export it currently is somewhat of a tangential issue. I would probably opt for publishing our TimeWindowMax as a separate gauge in the case of cumulative temporality. A cumulative max as specified by OTLP is not generally useful, as far as I can tell. But let's tackle any such change separately so we don't block other things and get distracted.

@shakuzen
Copy link
Member

  • There was a deprecated test in MeterRegistryCompatibilityKit that validates histogram counts. Since OTLP uses Step Histogram it will return 0 for the uncompleted step which fails.

I'm not immediately sure what we should do about the tests, but the @Deprecated annotation was, I believe, a bad alternative to suppressing deprecation warnings for using deprecated APIs in the tests, rather than an indication the tests themselves were deprecated. We should probably rewrite the tests avoiding deprecated API if we can, but there's going to be a timing issue to get the test to pass with OTLP delta and our other Step registries.

Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request, as always. I'm leaving some initial thoughts. I'll do a more thorough review tomorrow.

@shakuzen
Copy link
Member

StepMaxTest, OtlpCumulativeMeterRegistryCompatibilityTest, and OtlpDeltaMeterRegistryCompatibilityTest have test failures now. I guess we still need to update the TCK code for the last one, but the first two should be passing, right?

@lenin-jaganathan
Copy link
Contributor Author

I fixed StepMaxTest. OtlpCumulativeMeterRegistryCompatibilityTest fails for unavailability of max on the meter when it is cumulative which I am going to fix. But before that I wanted to see what approach we take for having CumulativeTIme and DeltaTimer vs a single Abstract timer behaving based on aggregation temporality

@shakuzen
Copy link
Member

I updated the TCK code with a bit of a hack so it works with both time window and step histograms.


@Override
protected CountAtBucket[] noValue() {
if (buckets == null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buckets might be empty but it is never null. I wonder if it would be worth it to store an instance field with the zero'd CountAtBucket array versus making it each time noValue() is called. It depends how often noValue() will be called in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal world, I don't expect "noValue" to be called during the app's lifecycle except during the start-up time. That's the reason I decided against adding an additional long-lived (actually an idle) object in there. This might quickly get concerning when there are 1000's timers with ~50 buckets.

Another thing I considered for noValue is to return an empty histogram which is already a static variable but that might not be good since the bucket information gets dropped in that case.

Copy link
Contributor Author

@lenin-jaganathan lenin-jaganathan Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buckets might be empty but it is never null

That's true but except for the fact that the StepValue calls noValue() during object creation by which point buckets is not yet initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. This isn't ideal. I left // TODO comments in the tests since we can't check the buckets that we expect to be there until a step has passed. We can try to figure this out post merge if it is worth fixing and we can come up with a solution.

Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some tests for the StepHistogram. There are some things I will probably polish post-merge, but I think this is functionally in a good state. Thank you for all of the work on this.

@shakuzen
Copy link
Member

Regarding this comment and the unresolved part of the comment thread, I've left it as something to consider outside of this pull request so we can get this merged and at least make progress to a better state overall.

@shakuzen shakuzen merged commit aaa6fc2 into micrometer-metrics:main Apr 28, 2023
@lenin-jaganathan
Copy link
Contributor Author

lenin-jaganathan commented Apr 28, 2023

@shakuzen Also, it is important to make the OtlpStepTimer to rotate count, total, max and histogram on reading any of these, which will be fairly less costly as we will do this only on rotation and repeated call in the same step has almost nil effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants