-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path-based sampling #124
Comments
This is indeed a real problem.
I'd definitely go for the high sample rate in the reporting policy and at the report collection endpoint do another round of sampling so the pipeline behind the endpoint does not need to process too many reports. |
We considered using a high sample rate and imposing extra sampling at the reporting endpoint but this is not feasible for us, we'd get way too many reports so it would cost a lot even just to do that sampling. The issue is, by the time you've received the reports and parsed them sufficiently to downsample, you may as well just store them all. As an illustration, our "normal" daily traffic varies between about 10k and 25k RPS on www.bbc.co.uk and www.bbc.com, we see roughly an additional 150k-350k RPS on our assets domain - varying by time of day, news events etc. Hope that helps. Maybe there's a middle ground and maybe that's something we should explore further in some way. Ideally, i'd basically split our estate in two - high and low traffic sections. That's crude but would do the job. The high traffic sections would stay at 1% but the low would be perhaps 10% or 20%. This would be great from our PoV, though I recognise that it could introduce significant complexity in implementation on clients. One idea I have is to create a mechanism like the way cookies and service workers are scoped - we could define the scope (path base) for a NEL policy and the default would be "/" (i.e. no change from now). An example would be:
Which would scope this policy to paths which begin with /news, /sport or /weather. |
Yes, the collection endpoint still gets all the requests and those need to be handled. Thanks for putting in those RPS numbers, very helpful and I understand the situation now better.
And you'd then send another NEL policy for other sections of the site? I actually like the concept! |
Thanks for starting this conversation, @neilstuartcraig, I love hearing about people's experiences deploying NEL! My main concern with this approach is the amount of complexity that it adds in the user agent. We purposefully tried to keep that part as simple as possible. That makes it easier to standardize, and also makes it easier to "trust" that each user agent's implementation of the spec is correct. We've already had several bug reports about Chrome's implementation regressing in the reports that it generates, even with the currently simple client-side logic. I suspect that would be a much worse problem if we ask each user agent to implement more complex logic. The balance that we settled on was scoping the configurations to an origin (which lines up well with other configurable Web standards), and having separate success and failure sampling rates. That gives you two levers (which you've already described) if you need more fine-grained policy rules: (a) use separate domains for your different classes of traffic, with separate configurations for each, or (b) putting the more complex logic in your collector. In my experience the bulk of the cost of running collectors stems from having diversity of coverage — so that errors that affect your actual site don't also affect your collectors. I've had success with collectors based on the Go reference implementation we developed when I was at Google (google/nel-collector). There's definitely some CPU cost to applying additional filtering at the collector, but compared to running the collectors in many locations, and the long-term storage costs of the records we decided to keep, that CPU cost wasn't a deal-breaker. (We published a paper about NEL at this year's NSDI, and §5 goes into the various deployment challenges that we encountered running a fleet of collectors at Google.) @neilstuartcraig, it sounds like your experience is different, and I'd love to hear more. How are you implementing your collector? (Also, just to double-check, you're using separate success and failure sampling rates, right?) |
My apologies for taking so long in replying, i missed the notification. I think I need some email filtering. Thanks for your interest and feedback, here's my thoughts so far:
In some fashion, yes. My initial thought is to do something like this (suggestion on new line for clarity) so we can keep current behaviour as-is:
I would imagine most people wouldn't use the extended syntax but if it/similar was palatable then i'd value it for sure.
I definitely see that. As a counter though, keeping the UA side simple seems to me like it is then making the report endpoint side more complex and since that will be implemented distinctly many more times, I feel like this results in more work and more risk of bugs overall. The complexity has to live somewhere.
Yep, so my experience is specific to how i built our report endpoint - which may have been some poor choices on my side. WRT sampling, we only collect failures currently since we have a really nice access log collection method (the bit i mentioned above which takes data and enriches it then writes to BigQuery). I guess maybe this comes down to a question of how many users of NEL might want to use some form of path-based sampling. If it's a small number then added complexity in the report endpoint feels like the right choice, the opposite if not. Any ideas how to gauge this, and/or am I missing anything? |
Hi all, I hopped on here to ask exactly the same question, and then noticed it being discussed in this thread. We're exploring NEL as a monitoring mechanism at a CDN, where the aggregate RPS for all customers, or even any large subset that we'd turn it on for, is really high. What I too had in mind and was hoping would exist was something like a 'longest prefix match' for more specific path policies overwriting the more general ones for a domain. While I appreciate the trade-offs of moving the complexity to the client, I just wanted to also voice that having a knob like that would be helpful. I wonder if a limit of N (a few?) overlapping rules per domain would make sense, such that the policies list on the UA doesn't explode by everyone using many for their domain. Although I understand that a) the length of the policy list isn't the only added complexity and b) this would raise other implementation details, like which rule should the client preempt when it receives the (N+1)th rule, etc. |
Hi all.
We're using NEL with the Reporting API across our main 2 websites, www.bbc.co.uk and www.bbc.com (plus their apexes), along with most of our asset domains. @chrisn and I have been working to provide some feedback based on our experiences, we hope it's useful and constructive.
Since NEL policies are applied on a per-origin basis, we have difficulty deciding on a suitable sample rate (
failure_fraction
) because we have significant variation in the popularity of sections of our website, which we route to backend services by url path. For example, /news and /sport are very high traffic, other sections may be less so. If we set a sample rate to achieve a practical volume of NEL reports for the busier sections of our website, we end up with few to no reports for the quieter sections.Of course, we could set a high sample rate which is suitable for the quieter sections, and then rate-limit reports from the busier sections. The problem with this is that receiving and parsing the report sufficiently to apply a rate limit is the bulk of the work and cost in processing reports, so we actually gain very little. It also introduces additional complexity at the reporting endpoint.
An ideal for us would be to have a mechanism for defining a baseline per-origin sample rate that could then be overridden on a per-path basis.
+@chrisn
The text was updated successfully, but these errors were encountered: