Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement aggregated datacenter metrics #153

Merged
merged 1 commit into from
May 29, 2024
Merged

Implement aggregated datacenter metrics #153

merged 1 commit into from
May 29, 2024

Conversation

mikelorant
Copy link
Contributor

@mikelorant mikelorant commented Nov 11, 2023

Add flag to use aggregated datacenter metrics instead of per-datacenter.

Closes: #152

@mikelorant
Copy link
Contributor Author

@leklund Would you be able to do a preliminary review, please?

@leklund leklund self-assigned this Jan 11, 2024
@leklund
Copy link
Member

leklund commented Jan 11, 2024

@mikelorant Sorry this has been sitting for so long. I'll take a look at this early next week.

Copy link
Member

@leklund leklund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikelorant -- apologies for letting this sit so long. Thank you for the submission! Just FYI I'll be out of the office for the next 10 days but I will check back in when I return.

I've left a few comments with some minor suggestions including a naming change. I'm not convinced that -without-datacenter-dimension is the best name for the flag. Since it's removing the label maybe it could be -without-datacenter-label? I'm open to suggestions!

This will also need to be added to domain/(metrics|process).go and origin/(metrics|process).go as well so that the flag affects all the subsystems. Once the implentation is complete this can be added to origins/domains.

README.md Outdated Show resolved Hide resolved
cmd/fastly-exporter/main.go Outdated Show resolved Hide resolved
pkg/rt/subscriber.go Outdated Show resolved Hide resolved
pkg/realtime/process.go Outdated Show resolved Hide resolved
pkg/realtime/process.go Outdated Show resolved Hide resolved
pkg/realtime/metrics.go Outdated Show resolved Hide resolved
pkg/prom/registry_test.go Outdated Show resolved Hide resolved
pkg/realtime/metrics.go Outdated Show resolved Hide resolved
@mikelorant
Copy link
Contributor Author

@leklund Thanks for the all the feedback!

All your points make sense and I am happy to make all the changes you requested. I'll expect to have your requested changes implemented by the time you get back.

@mikelorant
Copy link
Contributor Author

Breaking this pull request up into smaller pieces.

@leklund
Copy link
Member

leklund commented May 20, 2024

@mikelorant Thanks for the new PR! I spent some time looking at this today and had some thought/ideas. Left a summary at the top because it got long.

Summary

I had a realization that instead of adding complexity to remove the datacenter label you could instead just use a constant for the datacenter and it will have the same desired effect of reducing the /metrics response size. It could also skip the datacenter iteration completely and just use the Aggreagate field from the Response struct (that's the last sample code at the very bottom).

Details

I was curious about performance with all the extra map allocations. I wrote a simple benchmark for Process():

main branch:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12    	    2840	    419057 ns/op	    1883 B/op	      57 allocs/op

mikelorant:refactor/labels using mergeLabels:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12    	    1012	    999219 ns/op	  264528 B/op	    1640 allocs/op

I haven't tested out the real-world implications but it could be noticeable if you happen to use the exporter with a large number of service.

I did a quick pass to use a slices instead of the map and it's better:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12    	    2558	    465463 ns/op	   74551 B/op	     815 allocs/op

That last benchmark used this code:

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics) {
	labels := make([]string, 3, 3)

	labels[0] = serviceID
	labels[1] = serviceName

	for _, d := range response.Data {
		for datacenter, stats := range d.Datacenter {
			labels[2] = datacenter
			process(labels, stats, m)
		}
	}
}

func process(labels []string, stats Datacenter, m *Metrics) {
	...
	m.StatusCodeTotal.WithLabelValues(append(labels, "200")...).Add(float64(stats.Status200))
	...
}

I also tried a version where I used the prometheus.Labels map and would add the extra labels and then delete the key as after use. It's not very elegant but was slightly better memory-wise than the slices.

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12    	    1513	    787019 ns/op	   31202 B/op	     241 allocs/op

It looked like this:

labels["status_group"] = "1xx"
m.ComputeRespStatusTotal.With(labels).Add(float64(stats.ComputeRespStatus1xx))
... etc.
delete(labels, "status_group")

While exploring ways to optimize this I had a realization. This code is still iterating over the per-datacenter metrics and incrementing counters. Instead I think it could just use the Aggregated metrics that are returned by the real-time API and only increment once. The datacenter label could still exist just with the value of "aggregated". This means not adding complexity to the metric initialization and we can retain the same Process() func. In fact, the simplest approach would be to just change the datacenter name to "aggregated" and keep the label.

const aggdc = "aggregated"

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
	for _, d := range response.Data {
		for datacenter, stats := range d.Datacenter {
			if aggOnly {
				datacenter = aggdc
			}
		// increment all the metrics
	}
}

or using Aggregate: (this would be my suggestion as we skip all the iteration)

const aggdc = "aggregated"

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
 	 for _, d := range response.Data {
		if aggOnly {
			process(serviceID, serviceName, aggdc, stats, m)
		} else {
			for datacenter, stats := range d.Datacenter {
				process(serviceID, serviceName, datacenter, stats, m)
			}
		}
	}
}

func process(serviceID, serviceName, datacenter string, stats Datacenter, m *Metrics) {
	// increment all the things 
	 m.ComputeRespStatusTotal.WithLabelValues(serviceID, serviceName, datacenter, "1xx").Add(float64(stats.ComputeRespStatus1xx))
	// etc
}

@leklund
Copy link
Member

leklund commented May 20, 2024

Since my last comment was very long and rambling I thought it might help to put my idea into code:

https://github.com/fastly/fastly-exporter/compare/agg-only?expand=1

@mikelorant
Copy link
Contributor Author

mikelorant commented May 21, 2024

On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to aggregated but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.

How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?

I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.

This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.

@leklund
Copy link
Member

leklund commented May 21, 2024

On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to aggregated but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.

While it does seem a little off to use "aggregated" as the datacenter I do like the simplicity of that implementation. I also like that it doesn't break existing dashboards that might rely on the datacenter label. I agree that the ideal case would probably be to remove the label completely but I try to be pragmatic about these things.

How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?

We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.

I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.

This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.

It's definitely a high-cardinality set of metrics and I'm sure there are other users like yourself that would benefit from the smaller set of aggregate-only metrics. But I also know users that rely on the per datacenter information so I am reluctant to make this breaking change.

Even if this is not the default I think we can add a prominent note in the README on using the new flag to reduce cardinality.

@mikelorant
Copy link
Contributor Author

We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.

Sounds like this is the path forward then.

Happy to help with copying the implementation and adding the same code to both origin and domain.

@leklund
Copy link
Member

leklund commented May 21, 2024

Happy to help with copying the implementation and adding the same code to both origin and domain.

Thank you, that would be great! Feel free to copy/modify what I quickly wrote and add it origins/domains.

@mikelorant
Copy link
Contributor Author

Have created the initial pull requests to prepare for the feature:

I was then able to take your branch and extend it to add support for the domain and origin systems. I did some minor refactoring to reduce indentation and remove the global.

You can see the results on my feature/aggregate-only branch.

Be aware, this branch by itself does not pass unit tests until we merge #168. My plan is that once both #168 and #169 are merged, is to update this existing pull request with the code from that branch and NOT open a new pull request.

@mikelorant
Copy link
Contributor Author

This is now the new implementation as we discussed.

Note that there is no documentation yet, if you are happy with this approach, I will add them.

@mikelorant
Copy link
Contributor Author

mikelorant commented May 27, 2024

Outstanding tasks:

  • Add documentation.

@mikelorant
Copy link
Contributor Author

Preliminary documentation (modified version of what we previously had) was added. I think it would be wise to educate the users of the benefit of enabling this feature and how much of an impact this can have on reducing the number of time series emitted.

pkg/rt/subscriber.go Outdated Show resolved Hide resolved
@@ -71,6 +72,13 @@ func WithPostprocess(f func()) SubscriberOption {
return func(s *Subscriber) { s.postprocess = f }
}

// WithAggregateOnly sets whether aggregate metrics are emitted instead of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leklund Is this wording acceptable? Feel free to provide an update if you feel this can be explained better (I am not a wordsmith 😄).

README.md Outdated
@@ -119,6 +119,11 @@ Different flags (for the same filter target) combine with AND semantics. For
example, `-metric-allowlist 'bytes_total$' -metric-blocklist imgopto` would only
export metrics whose names ended in bytes_total, but didn't include imgopto.

### Aggregate Datacenter
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leklund Recommendations for the type of content we want mentioned here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to your revision.

Copy link
Member

@leklund leklund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggested wording changes to the docs but this looks good

README.md Outdated Show resolved Hide resolved
pkg/rt/subscriber.go Outdated Show resolved Hide resolved
Add flag to use aggregated datacenter metrics instead of per-datacenter.

Closes: #152
Co-authored-by: Lukas Eklund <[email protected]>
Helped-by: Matt Hope <[email protected]>
Signed-off-by: Michael Lorant <[email protected]>
@mikelorant
Copy link
Contributor Author

Wording has been changed to your recommendations.

Been a long road to get this pull request to where we want it, but it seems we finally have something worthy of merging!

@leklund
Copy link
Member

leklund commented May 29, 2024

@mikelorant Thanks for your patience and all the work you did in getting this merged! I'll cut a new release asap.

@leklund leklund merged commit 11aba6d into fastly:main May 29, 2024
6 checks passed
@leklund leklund added the enhancement New feature or request label May 29, 2024
@mikelorant
Copy link
Contributor Author

Thanks so much for your efforts in getting this implemented. Been a great learning process and the solution implemented will really benefit users will large number of services.

@mikelorant mikelorant deleted the feature/aggregate-datacenter-combined branch May 29, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce output size of metrics endpoint
2 participants