-
-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes generating 33 ARI requests in a single second #297
Comments
Oh yeah that makes sense. If you get like 33 requests within a second, then all 33 of them might be triggering ARI via on-demand maintenance. I think certmagic needs to use https://pkg.go.dev/sync#WaitGroup to make sure it only gets fired off a single time per window. |
I think this is a thundering herd issue. We probably need https://pkg.go.dev/golang.org/x/sync/singleflight with the domain as key.
Is this zoned per queried hostname or per client? |
Neither, it's per certificate. A single certificate may have multiple hostnames, but also a single client may manage multiple certificates. |
I'll take a look into this when I'm back at my desk |
CertMagic does honor the Retry-After header, if present, by calling I do agree this is likely a thundering herd, where many calls to update ARI come in before the first one finishes, since it lacks synchronization. We can synchronize ARI fetching using the configured storage plugin. This will prevent any more than 1 instance in a cluster from fetching ARI at the same time, and after the first one does, the others will load and use its result. Depending on the storage plugin, it's possible that this locking will be more expensive than actually fetching ARI, but it only happens once in a while, so maybe it's OK. |
Thanks for the report! This should fix it but without an offending client to test with I can only guess, but it makes sense to me. I've synchronized the ARI fetching by the ARI UniqueIdentifier. |
Thank you for the investigation and fix! We plan to keep an intermittent eye on ARI traffic patterns for a while, so I'll let you know if I see anything else jump out at me. |
What version of the package are you using?
User Agent: "CertMagic acmez (linux; 386)"
What are you trying to do?
Look at outliers in Let's Encrypt's ARI request data.
What steps did you take?
Queried LE's observability database for how many times ARI is queried for each certificate.
What did you expect to happen, and what actually happened instead?
I expect CertMagic to query ARI on a regular basis (e.g. every 6 hours, if respecting the Retry-After header), perhaps with some jitter to prevent clustering.
Instead, I see that CertMagic clients are requesting ARI data for a single serial 60+ times in a single day, including bursts of up to 33 requests in a single second.
How do you think this should be fixed?
I suspect this may be related to querying ARI in the "on-demand TLS" code path, and the misbehaving servers may be getting crawled and generating many requests due to that.
The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.
Please link to any related issues, pull requests, and/or discussion
#286
The text was updated successfully, but these errors were encountered: