-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submit histogram aggregates globally when "veneurglobalonly" is set #390
Submit histogram aggregates globally when "veneurglobalonly" is set #390
Conversation
I added benchmarks for the serialization and deserialization of Histo's, so that we can measure the performance impact of serializing JSON rather than just the digest.
The other types don't have any exported variables relating to their value (Guage, Counter), which I think is better encapsulated. I just changed all of the aggregates and the tDigest to be private rather than public.
Histo's now serialize as a JSON value including it's other data, rather than just the tdigest. This definitely comes with a performance penalty, and I'm trying to measure now exactly what's causing the slowdown.
After reseraching a little more, it seems like gob is the best option for serializing the Histo value over the wire. The data is specifically just being passed to another Go program, and its effectively just an RPC within the project. Not to mention the tdigest type already has Gob encoding built-in, so there are really minimal changes.
The worker now includes both `globalHistograms` and `globalTimers` maps, in addition to the others. These maps are read from the global Veneur's only while flushing. I also made appropriate changes to the `generateInterMetrics` function so that global Veneur's wont emit aggregate metrics for normal histograms. TODO: Correctly forward global histograms and timers.
Global histograms and timers are now correctly forwarded by local instances. I also added some type aliases for the maps of metric keys to metric types, so that I could genericise forwarding both normal and global histograms. Some integrations tests are still failing, which I need to look at and fix.
This adds some basic tests that the new implementation of `combine` works as expected. It also tests an edge case, where one of the histograms doesn't have any data.
Since the change to the format of the Histogram value introduces a breaking change, it won't be possible to roll this out without downtime. To work around this, I just added back the logic to parse the value as a tdigest if it couldn't parse the new format. I added a test for this behavior, and an accompanying benchmark. On my machine, there doesn't seem to be any significant slowdown if local Veneur's are still sending the old wire format.
This fixes a test case that forwards through to a test server, and then checks that the output metric is correct. I mostly just had to fix the test case to now expect a `samplers.HistoValue` rather than just the encoded tdigest.
This adds a couple test cases for the following behaviors, and fixes the Worker such that they pass: * Test that when a local, mixed, and global histogram are sent to a local Veneur as UDPMetrics, the correct metrics are flushed and forwarded. * Test that when a local, mixed, and global histogram are sent to a global Veneur over HTTP (imported), the correct metrics are flushed.
At the time of the initial deployment the Veneur proxies and local collectors won't be sending the `Scope` parameter with JSONMetric. This change brings back the logic that handles global Counters and Gauges.
This modifies the previously added test cases to also cover timers, and fixes bugs with them to bring them to feature parity with histograms.
While testing some other changes I realized that these benchmarks could easily fail and return incorrect results. To minimize CPU time, I used just `if` statements to do this rather than the `assert` package, which I figure probably does some slow reflection.
After looking back at the type aliases for the various metric map types, I'm really not sure that they make the code any clearer to read. I think I'm just going to back them back out for now, since the alternative is still pretty readable.
The handling of global counters and gauges needs to be backwards compatible with older local Veneur's to ensure a zero-downtime upgrade. This just adds a clarifying comment so it's clear when that branch can be removed.
0e112a4
to
478be40
Compare
Gerald Rule: Copy Observability on Veneur and Unilog pull requestscc @stripe/observability |
I just rebased this onto master to fix conflicts. |
We recently tested this branch specifically, and found that attempting to do gob-decoding multiple times caused serious performance issues. Our Veneur setup is currently processing about 1.1 million metrics per minute (across all hosts) on 6 c4.2xlarge's, and after canary'ing this branch we saw that certain worker threads would slow down significantly. From looking at the profiler, it appeared that they were getting stuck decoding histogram gob-encoded values, as with this change it first needs to try to decode the new format and then fall back to the older one. The stuck worker threads would then cause We merged in my changes to serialize the histogram values in Unfortunately I think the protobuf serialization might be pretty important to include as well with this change, if that makes sense to you guys 😄 I'm happy to merge it into this PR if you'd like, or I can keep it separate since this one is already pretty large. |
Hey @noahgoldman! Indeed, you've found a snafu that I want to correct. The gob encoding is lame and we should remove it. Specifically, i'd love to add a new grpc endpoint to replace the rather kludgy `/import'… (We've discussed what you've got here internally, but still haven't reviewed closely.) What do you think of adding a new grpc import endpoint and some way to run them side by side for a version? That seem reasonable in your world? |
That sounds good to me! It had occurred to me before that grpc would be a great fit for Really appreciate you guys taking a look at this! |
No problem! Performance aside, there've been some other changes we'd like to make that this would unlock. Would you like to take a stab at a new endpoint? |
Sure, I'm happy to give it a shot! Thanks again 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty reasonable to me @noahgoldman. Are you still interested in landing this independent of #407? All it needs is a rebase. (The test failures appear to be a flake which was fixed in #421.)
@@ -355,9 +355,6 @@ func (td *MergingDigest) Merge(other *MergingDigest) { | |||
} | |||
} | |||
|
|||
var _ gob.GobEncoder = &MergingDigest{} | |||
var _ gob.GobDecoder = &MergingDigest{} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason these lines are being removed? this is a compile-time assertion that MergingDigest satisfies the GobEncoder/GobDecoder interfaces, which is still relevant since gob is still being used (at least, as of this PR)
Value *tdigest.MergingDigest | ||
Name string | ||
Tags []string | ||
tDigest *tdigest.MergingDigest | ||
// these values are computed from only the samples that came through this | ||
// veneur instance, ignoring any histograms merged from elsewhere | ||
// we separate them because they're easy to aggregate on the backend without | ||
// loss of granularity, and having host-local information on them might be | ||
// useful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you renamed the variables to reflect the fact that they're not always local-only, comment should probably be updated as well?
Hey @sjung-stripe! I think it makes more sense to wait on this until #407 is merged, as we had some pretty big performance issues trying to deploy this change previously (versioning the gob-encoded I do have a set of commits that convert this PR to only support the new gRPC format. I can pull those in, but unfortunately it'll probably make the diff look insane until #407 makes it into master 😦 Let me know what you think! |
ah you mean the double Decode in Histo.Combine? Fair enough, I think we can wait. I can think of some gross ways to hack around the double decode (eg putting a magic byte at the start of the buffer) but there's no point going to a lot of effort if grpc forwarding will solve the problem. |
Yeah, exactly. If you guys are okay with waiting, I think implementing this with gRPC would be the safest bet. The magic byte would've been a pretty funny way to go about this though 😄 |
we wound up implementing this ourselves #569 but thanks again for the grpc contributions @noahgoldman |
Summary
This implements #155, making histogram aggregates submit from a global Veneur rather than flushing locally to Datadog when
veneurglobalonly
is set. There are a couple different important changes that I made here to get this to work. I tried to steer clear of introducing performance regressions, and in some cases tried to reduce some code duplication where possible as well.globalHistograms
andglobalTimers
inWorker
The
Worker
now includesglobalHistograms
andglobalTimers
maps. The main purpose of these is to denote to the Global and Local Veneur's if or if not they should be flushing aggregates. For both of these maps, the (indended 😄) behavior is as follows:Additionally, these code changes modify the behavior of the existing
histograms
andtimers
maps as well to explicitly disable them flushing aggregates. Currently, Global Veneur's never flush aggregates becauseHistogram
fields such asmin
andmax
are always set to their default values. Now they will be explicitly disabled, with the following behavior.One potential bug now present is that intermediate Veneur's (where a histogram is forwarded to, but also forwards again) will now output aggregates (rolled up from multiple imports). I'm not sure if this is actually bad behavior, and I'm interested in your thoughts there.
Adding a
Scope
field toJSONMetric
I added the
Scope
field to make it possible for a Global Veneur to know if a histogram was originally inteded to beMixedScope
orGlobalOnly
. This previously wasn't necessary for Counters and Gauges, as the fact that they were being forwarded indicated that they should be global. I added new logic to initialize this field to the proper value as necessary.New serialization format for Histogram values
Currently, the
Value
field of aJSONMetric
for histograms is just a gob-encodedMergingDigest
. In order to perform the aggregations, I created a newHistogramValue
struct that will be gob-encoded as well, and contains aMergingDigest
. I liked this option the best (compared to a couple others, including directly serializing theHisto
) as it clearly defines the content of the(JSONMetric).Value
. I'm definitely open to suggestions on other options for this.Profiling Histogram serilization
I was a little worried about the performance of doing even more Gob-decoding, as profiling our global Veneur's shows that gob-decoding is more or less the hottest area of the code (we don't use SSF). I added some benchmarks for the
(Histo).Combine
and(Histo).Export
functions, and ran them both with the new and old serialization formats. See the results in this Gist, but it's pretty clear that the new format is slower.I've also been playing around with converting the value to be protobuf-encoded, and the results look pretty good performance-wise. Serialization is more or less an order of magnitude faster, and deserialization is much faster for Histograms with less than 100 samples (and slightly faster for larger histograms as well). I decided to leave these changes out of this PR to make this one a little easier to review, but I figured that since the Histogram value is undergoing a backwards-incompatible change already it might make sense to at least prevent performance regressions! These changes are in my histo-value-protobuf branch, which I'm happy to pull in as well if there's interest!
Reducing duplication
I made some pretty big changes to the
(Worker).flushForward
function, to get rid of a ton of repetitive code handling all of the different maps of metrics. The diff looks pretty ugly, but I think the result is a whole lot more readable.Motivation
This change will allow clients that know they don't need host-local histogram aggregations to reduce the number of custom metrics in Datadog (the metric cardinality) by setting the histograms as global. I also personally believe that this behavior is more intuitive considering the behavior of the "veneurglobalonly" flag with other metric types.
Test plan
Rollout/monitoring/revert plan
An ordered rollout will be required for these changes. It should be as follows:
(JSONMetric).Scope
parameter and decode the new encoding for theValue
field for histograms. I tried to be careful about retaining backwards compatibility with older local Veneurs once the new version is deployed.I'm not sure if there's a good rollback plan unfortunately, local Veneur's with these changes will be incompatible with older Global instances.