-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy Sidecar Proxies not renewing certificate chains #10213
Comments
We had this happen again over the weekend, through CA verification failures. Here's a dump of a gRPC call to the sidecar that was failing:
|
So in deeper investigation, this looks to be an issue where entire envoy sidecar clusters are not getting updated. I can open a separate issue, but, this is related. This is on Consul 1.10 and Nomad 1.1.2, fwiw.
Furthermore:
We have a tool we built to inspect Envoy's clusters endpoint, and (ip ranges redacted except last 2):
Note here the
You can see how this sidecar is not getting XDS updates for this upstream.
Other sidecars have the same LDS/CDS config versions, but with correct cluster instance IPs. We noticed this in errors:
and
Turning on debug logging for Envoy shows this:
Result: Outlier Detection TripsRequests to stale clusters trip Envoy's outlier detection (expectedly so), and throw
|
Was this issue fixed with 1.10.3? |
@leonardobsjr As far as we can tell, this issue has been resolved in 1.10.3+. |
Overview of the Issue
We're experiencing an issue where Envoy sidecar proxies are not renewing their certificate chains, causing all further requests to that sidecar to fail. For some reason, the cluster update event to update the cert_chain is not making it to the Envoy sidecar.
We are alerted of this by failing traffic, with the corresponding
envoy_listener_ssl_connection_error
prometheus metric raised for the service that is failing to update its certificate chain.Reproduction Steps
We cannot consistently reproduce this issue, however, it occurs with a regularity of about 1-3 days in a mesh with over 1600 Nomad tasks, and around ~75 Nomad Clients.
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Debian Buster:
Docker:
Nomad:
Envoy:
General Information
We're using Nomad 1.0.3/4 in these environments (slowly updating to 1.0.4), with Consul 1.9.3. We are using the standard Nomad Consul Connect + Envoy setup. We're not overriding any TLS or CA certificate chain settings (using only the defaults). We can provide more detailed information if necessary via email.
Connect CA Configuration:
We use the defaults:
Nomad Job
The Connect job stanza looks like:
Connect Stanza
Certs
The certs endpoint in the Envoy sidecar shows this (this was captured on May 6th).
As you can see, the certificate had already expired (since this was May 6th, and the expiration time was 5:30pm on May 5th). For some reason, the sidecar never received the new certificate.
Script for Comparison
We wrote a small script to compare the dates in the
/certs
endpoint in Envoy to the/v1/agent/connect/leaf/:svc
endpoint in Consul agent:This will output results like so:
This output shows that the /certs endpoint in Envoy is returning different results than the Consul Agent. We wondered if this might be related to #9862, but aren't sure. I further checked
0.0.0.0:8500/v1/agent/connect/ca/leaf/leaf-cert
to ensure the Agent was always getting a renewed cert; it was:CDS Notes
When investigating further, we see that there are no explicit failures from the cluster discovery service; however, the attempts are 1 more than the successes:
As noted, 31912 > 31911. Looking further, we see this, though we're unsure what this means:
We're continuing to experience this issue, and will update this ticket with more information as we get it.
The text was updated successfully, but these errors were encountered: