-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement aggregated datacenter metrics #153
Implement aggregated datacenter metrics #153
Conversation
@leklund Would you be able to do a preliminary review, please? |
@mikelorant Sorry this has been sitting for so long. I'll take a look at this early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikelorant -- apologies for letting this sit so long. Thank you for the submission! Just FYI I'll be out of the office for the next 10 days but I will check back in when I return.
I've left a few comments with some minor suggestions including a naming change. I'm not convinced that -without-datacenter-dimension
is the best name for the flag. Since it's removing the label maybe it could be -without-datacenter-label
? I'm open to suggestions!
This will also need to be added to domain/(metrics|process).go
and origin/(metrics|process).go
as well so that the flag affects all the subsystems. Once the implentation is complete this can be added to origins/domains.
@leklund Thanks for the all the feedback! All your points make sense and I am happy to make all the changes you requested. I'll expect to have your requested changes implemented by the time you get back. |
Breaking this pull request up into smaller pieces. |
@mikelorant Thanks for the new PR! I spent some time looking at this today and had some thought/ideas. Left a summary at the top because it got long. SummaryI had a realization that instead of adding complexity to remove the datacenter label you could instead just use a constant for the datacenter and it will have the same desired effect of reducing the /metrics response size. It could also skip the datacenter iteration completely and just use the DetailsI was curious about performance with all the extra map allocations. I wrote a simple benchmark for Process():
I haven't tested out the real-world implications but it could be noticeable if you happen to use the exporter with a large number of service. I did a quick pass to use a slices instead of the map and it's better:
That last benchmark used this code:
I also tried a version where I used the
It looked like this:
While exploring ways to optimize this I had a realization. This code is still iterating over the per-datacenter metrics and incrementing counters. Instead I think it could just use the
or using
|
Since my last comment was very long and rambling I thought it might help to put my idea into code: https://github.com/fastly/fastly-exporter/compare/agg-only?expand=1 |
On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics? I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again. This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits. |
While it does seem a little off to use "aggregated" as the datacenter I do like the simplicity of that implementation. I also like that it doesn't break existing dashboards that might rely on the datacenter label. I agree that the ideal case would probably be to remove the label completely but I try to be pragmatic about these things.
We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.
It's definitely a high-cardinality set of metrics and I'm sure there are other users like yourself that would benefit from the smaller set of aggregate-only metrics. But I also know users that rely on the per datacenter information so I am reluctant to make this breaking change. Even if this is not the default I think we can add a prominent note in the README on using the new flag to reduce cardinality. |
Sounds like this is the path forward then. Happy to help with copying the implementation and adding the same code to both origin and domain. |
Thank you, that would be great! Feel free to copy/modify what I quickly wrote and add it origins/domains. |
Have created the initial pull requests to prepare for the feature: I was then able to take your branch and extend it to add support for the domain and origin systems. I did some minor refactoring to reduce indentation and remove the global. You can see the results on my feature/aggregate-only branch. Be aware, this branch by itself does not pass unit tests until we merge #168. My plan is that once both #168 and #169 are merged, is to update this existing pull request with the code from that branch and NOT open a new pull request. |
This is now the new implementation as we discussed. Note that there is no documentation yet, if you are happy with this approach, I will add them. |
Outstanding tasks:
|
Preliminary documentation (modified version of what we previously had) was added. I think it would be wise to educate the users of the benefit of enabling this feature and how much of an impact this can have on reducing the number of time series emitted. |
pkg/rt/subscriber.go
Outdated
@@ -71,6 +72,13 @@ func WithPostprocess(f func()) SubscriberOption { | |||
return func(s *Subscriber) { s.postprocess = f } | |||
} | |||
|
|||
// WithAggregateOnly sets whether aggregate metrics are emitted instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leklund Is this wording acceptable? Feel free to provide an update if you feel this can be explained better (I am not a wordsmith 😄).
README.md
Outdated
@@ -119,6 +119,11 @@ Different flags (for the same filter target) combine with AND semantics. For | |||
example, `-metric-allowlist 'bytes_total$' -metric-blocklist imgopto` would only | |||
export metrics whose names ended in bytes_total, but didn't include imgopto. | |||
|
|||
### Aggregate Datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leklund Recommendations for the type of content we want mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to your revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggested wording changes to the docs but this looks good
Add flag to use aggregated datacenter metrics instead of per-datacenter. Closes: #152 Co-authored-by: Lukas Eklund <[email protected]> Helped-by: Matt Hope <[email protected]> Signed-off-by: Michael Lorant <[email protected]>
Wording has been changed to your recommendations. Been a long road to get this pull request to where we want it, but it seems we finally have something worthy of merging! |
@mikelorant Thanks for your patience and all the work you did in getting this merged! I'll cut a new release asap. |
Thanks so much for your efforts in getting this implemented. Been a great learning process and the solution implemented will really benefit users will large number of services. |
Add flag to use aggregated datacenter metrics instead of per-datacenter.
Closes: #152