Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes generating 33 ARI requests in a single second #297

Closed
aarongable opened this issue Jun 26, 2024 · 15 comments
Closed

Sometimes generating 33 ARI requests in a single second #297

aarongable opened this issue Jun 26, 2024 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@aarongable
Copy link

aarongable commented Jun 26, 2024

What version of the package are you using?

User Agent: "CertMagic acmez (linux; 386)"

What are you trying to do?

Look at outliers in Let's Encrypt's ARI request data.

What steps did you take?

Queried LE's observability database for how many times ARI is queried for each certificate.

What did you expect to happen, and what actually happened instead?

I expect CertMagic to query ARI on a regular basis (e.g. every 6 hours, if respecting the Retry-After header), perhaps with some jitter to prevent clustering.

Instead, I see that CertMagic clients are requesting ARI data for a single serial 60+ times in a single day, including bursts of up to 33 requests in a single second.

How do you think this should be fixed?

I suspect this may be related to querying ARI in the "on-demand TLS" code path, and the misbehaving servers may be getting crawled and generating many requests due to that.

The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.

Please link to any related issues, pull requests, and/or discussion

#286

@francislavoie
Copy link
Member

Oh yeah that makes sense. If you get like 33 requests within a second, then all 33 of them might be triggering ARI via on-demand maintenance. I think certmagic needs to use https://pkg.go.dev/sync#WaitGroup to make sure it only gets fired off a single time per window.

@mohammed90
Copy link
Member

I think this is a thundering herd issue. We probably need https://pkg.go.dev/golang.org/x/sync/singleflight with the domain as key.

every 6 hours, if respecting the Retry-After header

Is this zoned per queried hostname or per client?

@aarongable
Copy link
Author

Is this zoned per queried hostname or per client?

Neither, it's per certificate. A single certificate may have multiple hostnames, but also a single client may manage multiple certificates.

@mholt
Copy link
Member

mholt commented Jun 26, 2024

I'll take a look into this when I'm back at my desk

@mholt mholt self-assigned this Jun 26, 2024
@mholt
Copy link
Member

mholt commented Jun 28, 2024

The ARI suggestedWindow should be cached for the duration provided by the Retry-After header in the ARI response.

CertMagic does honor the Retry-After header, if present, by calling acme.RenewalInfo.NeedsRefresh().

I do agree this is likely a thundering herd, where many calls to update ARI come in before the first one finishes, since it lacks synchronization.

We can synchronize ARI fetching using the configured storage plugin. This will prevent any more than 1 instance in a cluster from fetching ARI at the same time, and after the first one does, the others will load and use its result.

Depending on the storage plugin, it's possible that this locking will be more expensive than actually fetching ARI, but it only happens once in a while, so maybe it's OK.

@mholt mholt closed this as completed in 16e2e0b Jun 28, 2024
@mholt
Copy link
Member

mholt commented Jun 28, 2024

Thanks for the report! This should fix it but without an offending client to test with I can only guess, but it makes sense to me.

I've synchronized the ARI fetching by the ARI UniqueIdentifier.

@mholt mholt added the bug Something isn't working label Jun 28, 2024
@aarongable
Copy link
Author

Thank you for the investigation and fix! We plan to keep an intermittent eye on ARI traffic patterns for a while, so I'll let you know if I see anything else jump out at me.

@Zenexer

This comment has been minimized.

@francislavoie

This comment has been minimized.

@Zenexer

This comment has been minimized.

@Zenexer

This comment has been minimized.

@mholt

This comment has been minimized.

@Zenexer

This comment has been minimized.

@mholt

This comment has been minimized.

@Zenexer

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants