Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path-based sampling #124

Open
neilstuartcraig opened this issue Apr 23, 2020 · 6 comments
Open

Path-based sampling #124

neilstuartcraig opened this issue Apr 23, 2020 · 6 comments

Comments

@neilstuartcraig
Copy link
Contributor

Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.

Since NEL policies are applied on a per-origin basis, we have difficulty deciding on a suitable sample rate (failure_fraction) because we have significant variation in the popularity of sections of our website, which we route to backend services by url path. For example, /news and /sport are very high traffic, other sections may be less so. If we set a sample rate to achieve a practical volume of NEL reports for the busier sections of our website, we end up with few to no reports for the quieter sections.

Of course, we could set a high sample rate which is suitable for the quieter sections, and then rate-limit reports from the busier sections. The problem with this is that receiving and parsing the report sufficiently to apply a rate limit is the bulk of the work and cost in processing reports, so we actually gain very little. It also introduces additional complexity at the reporting endpoint.

An ideal for us would be to have a mechanism for defining a baseline per-origin sample rate that could then be overridden on a per-path basis.

+@chrisn

@aaronpeters
Copy link

This is indeed a real problem.
You need a sufficient amount of reports to be able to derive insights with confidence, but you also don't want too many reports to come in because that puts a burden on the systems without adding any value.

Of course, we could set a high sample rate which is suitable for the quieter sections, and then rate-limit reports from the busier sections. The problem with this is that receiving and parsing the report sufficiently to apply a rate limit is the bulk of the work and cost in processing reports, so we actually gain very little. It also introduces additional complexity at the reporting endpoint.

I'd definitely go for the high sample rate in the reporting policy and at the report collection endpoint do another round of sampling so the pipeline behind the endpoint does not need to process too many reports.
Yes, this adds complexity, but is this more complex than managing the per-path sample rate in the report-to policy? Honest question and very interested to learn more about this part.

@neilstuartcraig
Copy link
Contributor Author

We considered using a high sample rate and imposing extra sampling at the reporting endpoint but this is not feasible for us, we'd get way too many reports so it would cost a lot even just to do that sampling. The issue is, by the time you've received the reports and parsed them sufficiently to downsample, you may as well just store them all.

As an illustration, our "normal" daily traffic varies between about 10k and 25k RPS on www.bbc.co.uk and www.bbc.com, we see roughly an additional 150k-350k RPS on our assets domain - varying by time of day, news events etc.
Right now, we are receiving something like 50-100 reports per second, some higher peaks, especially when we have a code release which might have an inadvertent problem, of course we see many more reports then.
If we scaled to 100% (i.e. no) sampling, that's see us receiving something in the order of 10k RPS in reports at times, and we'd need to be scaled way higher than that to cope with peaks if we don't want to lose data (which is arguable).
Current costs are a few hundred dollars per month which is OK, the value we get is worth that. But if I try to spend many, possibly tens of thousands of dollars per month on this, i'll be told "no" (and frankly, that would not represent good value for our license fee payers).
This is the issue really, cost.

Hope that helps. Maybe there's a middle ground and maybe that's something we should explore further in some way.

Ideally, i'd basically split our estate in two - high and low traffic sections. That's crude but would do the job. The high traffic sections would stay at 1% but the low would be perhaps 10% or 20%. This would be great from our PoV, though I recognise that it could introduce significant complexity in implementation on clients.

One idea I have is to create a mechanism like the way cookies and service workers are scoped - we could define the scope (path base) for a NEL policy and the default would be "/" (i.e. no change from now). An example would be:

{"report_to":"default","max_age":2592000,"include_subdomains":true,"failure_fraction":0.01,"path_scope":["/news", "/sport", "/weather"]}

Which would scope this policy to paths which begin with /news, /sport or /weather.

@aaronpeters
Copy link

The issue is, by the time you've received the reports and parsed them sufficiently to downsample, you may as well just store them all

Yes, the collection endpoint still gets all the requests and those need to be handled.
I'd assume simply dropping many of the received reports is more efficient than storing them, but point taken.

Thanks for putting in those RPS numbers, very helpful and I understand the situation now better.

Which would scope this policy to paths which begin with /news, /sport or /weather.

And you'd then send another NEL policy for other sections of the site?
What if there are conflicts between two NEL policies for the same origin? More specific path_scope wins? Ah, details.

I actually like the concept!

@dcreager
Copy link
Member

Thanks for starting this conversation, @neilstuartcraig, I love hearing about people's experiences deploying NEL!

My main concern with this approach is the amount of complexity that it adds in the user agent. We purposefully tried to keep that part as simple as possible. That makes it easier to standardize, and also makes it easier to "trust" that each user agent's implementation of the spec is correct. We've already had several bug reports about Chrome's implementation regressing in the reports that it generates, even with the currently simple client-side logic. I suspect that would be a much worse problem if we ask each user agent to implement more complex logic.

The balance that we settled on was scoping the configurations to an origin (which lines up well with other configurable Web standards), and having separate success and failure sampling rates. That gives you two levers (which you've already described) if you need more fine-grained policy rules: (a) use separate domains for your different classes of traffic, with separate configurations for each, or (b) putting the more complex logic in your collector.

In my experience the bulk of the cost of running collectors stems from having diversity of coverage — so that errors that affect your actual site don't also affect your collectors. I've had success with collectors based on the Go reference implementation we developed when I was at Google (google/nel-collector). There's definitely some CPU cost to applying additional filtering at the collector, but compared to running the collectors in many locations, and the long-term storage costs of the records we decided to keep, that CPU cost wasn't a deal-breaker.

(We published a paper about NEL at this year's NSDI, and §5 goes into the various deployment challenges that we encountered running a fleet of collectors at Google.)

@neilstuartcraig, it sounds like your experience is different, and I'd love to hear more. How are you implementing your collector? (Also, just to double-check, you're using separate success and failure sampling rates, right?)

@neilstuartcraig
Copy link
Contributor Author

My apologies for taking so long in replying, i missed the notification. I think I need some email filtering. Thanks for your interest and feedback, here's my thoughts so far:

@aaronpeters:

And you'd then send another NEL policy for other sections of the site?
What if there are conflicts between two NEL policies for the same origin? More specific path_scope wins? Ah, details.

In some fashion, yes. My initial thought is to do something like this (suggestion on new line for clarity) so we can keep current behaviour as-is:

nel: {"report_to":"default","max_age":2592000,"include_subdomains":true,"failure_fraction":0.01,
"path_based_sampling:[{"paths":["/path-a","/path-b" ],"failure_fraction":0.05},{"paths":["/path-c" ],"failure_fraction":0.5}]"
}

I would imagine most people wouldn't use the extended syntax but if it/similar was palatable then i'd value it for sure.
It'd probably be best to use a path matching algorithm which is the same as e.g. cookie paths I guess (i.e. /some-path would match /some-path*). Would that make sense?

@dcreager:

My main concern with this approach is the amount of complexity that it adds in the user agent. We purposefully tried to keep that part as simple as possible. That makes it easier to standardize, and also makes it easier to "trust" that each user agent's implementation of the spec is correct. We've already had several bug reports about Chrome's implementation regressing in the reports that it generates, even with the currently simple client-side logic. I suspect that would be a much worse problem if we ask each user agent to implement more complex logic.

I definitely see that. As a counter though, keeping the UA side simple seems to me like it is then making the report endpoint side more complex and since that will be implemented distinctly many more times, I feel like this results in more work and more risk of bugs overall. The complexity has to live somewhere.

In my experience the bulk of the cost of running collectors stems from having diversity of coverage — so that errors that affect your actual site don't also affect your collectors. I've had success with collectors based on the Go reference implementation we developed when I was at Google (google/nel-collector). There's definitely some CPU cost to applying additional filtering at the collector, but compared to running the collectors in many locations, and the long-term storage costs of the records we decided to keep, that CPU cost wasn't a deal-breaker.
GCP / node
@neilstuartcraig, it sounds like your experience is different, and I'd love to hear more. How are you implementing your collector? (Also, just to double-check, you're using separate success and failure sampling rates, right?)

Yep, so my experience is specific to how i built our report endpoint - which may have been some poor choices on my side.
Our endpoint is a little Node app living on Google Cloud Functions - this is the HTTP endpoint. Since Functions are just containers running Node, wrapped by Express, they ought to be stateless. I kept them simple and (relatively) fast and low-cost by doing some simple in-memory report batching on each container. The received reports are written out every X seconds or Y reports (to limit any losses) to Google Cloud Storage. From there, our BigQuery ingest pipeline picks them up, processes, enriches and loads them into BigQuery. We then have a visualisation layer in Influx/Grafana. It costs very little and is pretty fast (usually < 5ms per report). I don't care that much about potential data loss since we're dropping ~99% of reports anyway due to our sampling.
Initially, i had a realtime DB in there but it was way too slow and way too expensive. This simplified method cut ~95% off the costs. Costs are critical to us since we're spending UK license fee payers money so we have a very strong duty to spend that wisely and constructively.

WRT sampling, we only collect failures currently since we have a really nice access log collection method (the bit i mentioned above which takes data and enriches it then writes to BigQuery).

I guess maybe this comes down to a question of how many users of NEL might want to use some form of path-based sampling. If it's a small number then added complexity in the report endpoint feels like the right choice, the opposite if not. Any ideas how to gauge this, and/or am I missing anything?

@zarifis
Copy link

zarifis commented Jul 17, 2020

Hi all,

I hopped on here to ask exactly the same question, and then noticed it being discussed in this thread. We're exploring NEL as a monitoring mechanism at a CDN, where the aggregate RPS for all customers, or even any large subset that we'd turn it on for, is really high. What I too had in mind and was hoping would exist was something like a 'longest prefix match' for more specific path policies overwriting the more general ones for a domain. While I appreciate the trade-offs of moving the complexity to the client, I just wanted to also voice that having a knob like that would be helpful.

I wonder if a limit of N (a few?) overlapping rules per domain would make sense, such that the policies list on the UA doesn't explode by everyone using many for their domain. Although I understand that a) the length of the policy list isn't the only added complexity and b) this would raise other implementation details, like which rule should the client preempt when it receives the (N+1)th rule, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants