Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit histogram aggregates globally when "veneurglobalonly" is set #390

Conversation

noahgoldman
Copy link
Contributor

Summary

This implements #155, making histogram aggregates submit from a global Veneur rather than flushing locally to Datadog when veneurglobalonly is set. There are a couple different important changes that I made here to get this to work. I tried to steer clear of introducing performance regressions, and in some cases tried to reduce some code duplication where possible as well.

globalHistograms and globalTimers in Worker

The Worker now includes globalHistograms and globalTimers maps. The main purpose of these is to denote to the Global and Local Veneur's if or if not they should be flushing aggregates. For both of these maps, the (indended 😄) behavior is as follows:

  • Local: Forward and don't flush.
  • Global: Flush all percentiles and aggregates.

Additionally, these code changes modify the behavior of the existing histograms and timers maps as well to explicitly disable them flushing aggregates. Currently, Global Veneur's never flush aggregates because Histogram fields such as min and max are always set to their default values. Now they will be explicitly disabled, with the following behavior.

  • Local: Flush aggregates but not percentiles, and forward.
  • Global: Flush percentiles and explicitly disable flushing aggregates.

One potential bug now present is that intermediate Veneur's (where a histogram is forwarded to, but also forwards again) will now output aggregates (rolled up from multiple imports). I'm not sure if this is actually bad behavior, and I'm interested in your thoughts there.

Adding a Scope field to JSONMetric

I added the Scope field to make it possible for a Global Veneur to know if a histogram was originally inteded to be MixedScope or GlobalOnly. This previously wasn't necessary for Counters and Gauges, as the fact that they were being forwarded indicated that they should be global. I added new logic to initialize this field to the proper value as necessary.

New serialization format for Histogram values

Currently, the Value field of a JSONMetric for histograms is just a gob-encoded MergingDigest. In order to perform the aggregations, I created a new HistogramValue struct that will be gob-encoded as well, and contains a MergingDigest. I liked this option the best (compared to a couple others, including directly serializing the Histo) as it clearly defines the content of the (JSONMetric).Value. I'm definitely open to suggestions on other options for this.

Profiling Histogram serilization

I was a little worried about the performance of doing even more Gob-decoding, as profiling our global Veneur's shows that gob-decoding is more or less the hottest area of the code (we don't use SSF). I added some benchmarks for the (Histo).Combine and (Histo).Export functions, and ran them both with the new and old serialization formats. See the results in this Gist, but it's pretty clear that the new format is slower.

I've also been playing around with converting the value to be protobuf-encoded, and the results look pretty good performance-wise. Serialization is more or less an order of magnitude faster, and deserialization is much faster for Histograms with less than 100 samples (and slightly faster for larger histograms as well). I decided to leave these changes out of this PR to make this one a little easier to review, but I figured that since the Histogram value is undergoing a backwards-incompatible change already it might make sense to at least prevent performance regressions! These changes are in my histo-value-protobuf branch, which I'm happy to pull in as well if there's interest!

Reducing duplication

I made some pretty big changes to the (Worker).flushForward function, to get rid of a ton of repetitive code handling all of the different maps of metrics. The diff looks pretty ugly, but I think the result is a whole lot more readable.

Motivation

This change will allow clients that know they don't need host-local histogram aggregations to reduce the number of custom metrics in Datadog (the metric cardinality) by setting the histograms as global. I also personally believe that this behavior is more intuitive considering the behavior of the "veneurglobalonly" flag with other metric types.

Test plan

  • I wrote a pair of large integration-style test cases that check the forwarding and flushing behavior of histograms with different scope, for both local and global Veneurs. Existing tests seem to already cover a lot of the code paths I touched.
  • I ran some small-scale tests on a running set of global, proxy, and local Veneurs, and plan to do more early next week.

Rollout/monitoring/revert plan

An ordered rollout will be required for these changes. It should be as follows:

  1. Proxy and Global Veneurs
    • They need to be able to accept the (JSONMetric).Scope parameter and decode the new encoding for the Value field for histograms. I tried to be careful about retaining backwards compatibility with older local Veneurs once the new version is deployed.
  2. Local Veneurs

I'm not sure if there's a good rollback plan unfortunately, local Veneur's with these changes will be incompatible with older Global instances.

I added benchmarks for the serialization and deserialization of Histo's,
so that we can measure the performance impact of serializing JSON rather
than just the digest.
The other types don't have any exported variables relating to their
value (Guage, Counter), which I think is better encapsulated.  I just
changed all of the aggregates and the tDigest to be private rather than
public.
Histo's now serialize as a JSON value including it's other data, rather
than just the tdigest.  This definitely comes with a performance
penalty, and I'm trying to measure now exactly what's causing the
slowdown.
After reseraching a little more, it seems like gob is the best option
for serializing the Histo value over the wire.  The data is specifically
just being passed to another Go program, and its effectively just an RPC
within the project.  Not to mention the tdigest type already has Gob
encoding built-in, so there are really minimal changes.
The worker now includes both `globalHistograms` and `globalTimers` maps,
in addition to the others.  These maps are read from the global
Veneur's only while flushing.  I also made appropriate changes to the
`generateInterMetrics` function so that global Veneur's wont emit
aggregate metrics for normal histograms.

TODO: Correctly forward global histograms and timers.
Global histograms and timers are now correctly forwarded by local
instances.  I also added some type aliases for the maps of metric keys
to metric types, so that I could genericise forwarding both normal and
global histograms.

Some integrations tests are still failing, which I need to look at and
fix.
This adds some basic tests that the new implementation of `combine`
works as expected.  It also tests an edge case, where one of the
histograms doesn't have any data.
Since the change to the format of the Histogram value introduces a
breaking change, it won't be possible to roll this out without downtime.
To work around this, I just added back the logic to parse the value as a
tdigest if it couldn't parse the new format.

I added a test for this behavior, and an accompanying benchmark.  On my
machine, there doesn't seem to be any significant slowdown if local
Veneur's are still sending the old wire format.
This fixes a test case that forwards through to a test server, and then
checks that the output metric is correct. I mostly just had to fix the
test case to now expect a `samplers.HistoValue` rather than just the
encoded tdigest.
This adds a couple test cases for the following behaviors, and fixes the
Worker such that they pass:
* Test that when a local, mixed, and global histogram are sent to a
local Veneur as UDPMetrics, the correct metrics are flushed and
forwarded.
* Test that when a local, mixed, and global histogram are sent to a
global Veneur over HTTP (imported), the correct metrics are flushed.
At the time of the initial deployment the Veneur proxies and local
collectors won't be sending the `Scope` parameter with JSONMetric.  This
change brings back the logic that handles global Counters and Gauges.
This modifies the previously added test cases to also cover timers, and
fixes bugs with them to bring them to feature parity with histograms.
While testing some other changes I realized that these benchmarks could
easily fail and return incorrect results.  To minimize CPU time, I used
just `if` statements to do this rather than the `assert` package, which
I figure probably does some slow reflection.
After looking back at the type aliases for the various metric map types,
I'm really not sure that they make the code any clearer to read.  I
think I'm just going to back them back out for now, since the
alternative is still pretty readable.
The handling of global counters and gauges needs to be backwards
compatible with older local Veneur's to ensure a zero-downtime upgrade.
This just adds a clarifying comment so it's clear when that branch can
be removed.
@noahgoldman noahgoldman force-pushed the feature/global-histogram-aggregations branch from 0e112a4 to 478be40 Compare February 22, 2018 19:29
@stripe-ci
Copy link

Gerald Rule: Copy Observability on Veneur and Unilog pull requests

cc @stripe/observability
cc @stripe/observability-stripe

@noahgoldman
Copy link
Contributor Author

I just rebased this onto master to fix conflicts.

@noahgoldman
Copy link
Contributor Author

noahgoldman commented Feb 26, 2018

We recently tested this branch specifically, and found that attempting to do gob-decoding multiple times caused serious performance issues. Our Veneur setup is currently processing about 1.1 million metrics per minute (across all hosts) on 6 c4.2xlarge's, and after canary'ing this branch we saw that certain worker threads would slow down significantly. From looking at the profiler, it appeared that they were getting stuck decoding histogram gob-encoded values, as with this change it first needs to try to decode the new format and then fall back to the older one. The stuck worker threads would then cause /import requests to block for long periods of time, eventually resulting in the process constantly building up memory until it was killed.

We merged in my changes to serialize the histogram values in JSONMetric as protobufs (this histo-value-protobuf) branch, and the issues went away. This makes sense with the benchmarks I performed previously.

Unfortunately I think the protobuf serialization might be pretty important to include as well with this change, if that makes sense to you guys 😄 I'm happy to merge it into this PR if you'd like, or I can keep it separate since this one is already pretty large.

@cory-stripe
Copy link
Contributor

Hey @noahgoldman! Indeed, you've found a snafu that I want to correct. The gob encoding is lame and we should remove it. Specifically, i'd love to add a new grpc endpoint to replace the rather kludgy `/import'…

(We've discussed what you've got here internally, but still haven't reviewed closely.)

What do you think of adding a new grpc import endpoint and some way to run them side by side for a version? That seem reasonable in your world?

@noahgoldman
Copy link
Contributor Author

That sounds good to me! It had occurred to me before that grpc would be a great fit for /import, especially performance-wise from looking at the profiler (we'd love to run Veneur for less 😄). Adding a separate endpoint also seems a lot cleaner than what I'm doing here!

Really appreciate you guys taking a look at this!

@cory-stripe
Copy link
Contributor

No problem! Performance aside, there've been some other changes we'd like to make that this would unlock. Would you like to take a stab at a new endpoint?

@noahgoldman
Copy link
Contributor Author

Sure, I'm happy to give it a shot! Thanks again 😄

@noahgoldman noahgoldman mentioned this pull request Mar 8, 2018
1 task
@sjung-stripe sjung-stripe self-assigned this Mar 26, 2018
Copy link
Contributor

@sjung-stripe sjung-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty reasonable to me @noahgoldman. Are you still interested in landing this independent of #407? All it needs is a rebase. (The test failures appear to be a flake which was fixed in #421.)

@@ -355,9 +355,6 @@ func (td *MergingDigest) Merge(other *MergingDigest) {
}
}

var _ gob.GobEncoder = &MergingDigest{}
var _ gob.GobDecoder = &MergingDigest{}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason these lines are being removed? this is a compile-time assertion that MergingDigest satisfies the GobEncoder/GobDecoder interfaces, which is still relevant since gob is still being used (at least, as of this PR)

Value *tdigest.MergingDigest
Name string
Tags []string
tDigest *tdigest.MergingDigest
// these values are computed from only the samples that came through this
// veneur instance, ignoring any histograms merged from elsewhere
// we separate them because they're easy to aggregate on the backend without
// loss of granularity, and having host-local information on them might be
// useful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you renamed the variables to reflect the fact that they're not always local-only, comment should probably be updated as well?

@noahgoldman
Copy link
Contributor Author

Hey @sjung-stripe! I think it makes more sense to wait on this until #407 is merged, as we had some pretty big performance issues trying to deploy this change previously (versioning the gob-encoded Histo was really slow).

I do have a set of commits that convert this PR to only support the new gRPC format. I can pull those in, but unfortunately it'll probably make the diff look insane until #407 makes it into master 😦 Let me know what you think!

@sjung-stripe
Copy link
Contributor

ah you mean the double Decode in Histo.Combine? Fair enough, I think we can wait. I can think of some gross ways to hack around the double decode (eg putting a magic byte at the start of the buffer) but there's no point going to a lot of effort if grpc forwarding will solve the problem.

@noahgoldman
Copy link
Contributor Author

Yeah, exactly. If you guys are okay with waiting, I think implementing this with gRPC would be the safest bet.

The magic byte would've been a pretty funny way to go about this though 😄

@sjung-stripe
Copy link
Contributor

we wound up implementing this ourselves #569 but thanks again for the grpc contributions @noahgoldman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants