Envoy Sidecar Proxies not renewing certificate chains #10213

splittingred · 2021-05-10T14:55:45Z

Overview of the Issue

We're experiencing an issue where Envoy sidecar proxies are not renewing their certificate chains, causing all further requests to that sidecar to fail. For some reason, the cluster update event to update the cert_chain is not making it to the Envoy sidecar.

We are alerted of this by failing traffic, with the corresponding envoy_listener_ssl_connection_error prometheus metric raised for the service that is failing to update its certificate chain.

Reproduction Steps

We cannot consistently reproduce this issue, however, it occurs with a regularity of about 1-3 days in a mesh with over 1600 Nomad tasks, and around ~75 Nomad Clients.

Consul info for both Client and Server

Client info

agent:
	check_monitors = 1
	check_ttls = 39
	checks = 145
	services = 107
build:
	version = 1.9.3
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 12
	goroutines = 7278
	max_procs = 12
	os = linux
	version = go1.15.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 102
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 3
	member_time = 119087
	members = 221
	query_queue = 0
	query_time = 1

Server info

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 5
	services = 4
build:
	version = 1.9.3
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr = REDACTED:8300
	server = true
raft:
	applied_index = 401710728
	commit_index = 401710728
	fsm_pending = 0
	last_contact = 59.817814ms
	last_log_index = 401710728
	last_log_term = 73570
	last_snapshot_index = 401694782
	last_snapshot_term = 73570
	latest_configuration = [{Suffrage:Voter ID:9b5e5016-a89e-a420-e7db-f51e7883ddd0 Address:REDACTED:8300} {Suffrage:Voter ID:147003f5-f498-a4ff-c6cd-2767b98fc8ca Address:REDACTED:8300} {Suffrage:Voter ID:72da8b90-45c2-3078-03db-d4c9999db483 Address:REDACTED:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 73570
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 4830
	max_procs = 2
	os = linux
	version = go1.15.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 96
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 57185
	members = 223
	query_queue = 0
	query_time = 4
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 321
	members = 6
	query_queue = 0
	query_time = 4

Operating system and Environment details

Debian Buster:

# cat /etc/debian_version
10.8

Docker:

# docker -v
Docker version 20.10.5, build 55c4c88

Nomad:

# nomad -v
Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)

Envoy:

# envoy --version
envoy  version: e98e41a8e168af7acae8079fc0cd68155f699aa3/1.16.2/Clean/RELEASE/BoringSSL

General Information

We're using Nomad 1.0.3/4 in these environments (slowly updating to 1.0.4), with Consul 1.9.3. We are using the standard Nomad Consul Connect + Envoy setup. We're not overriding any TLS or CA certificate chain settings (using only the defaults). We can provide more detailed information if necessary via email.

Connect CA Configuration:

We use the defaults:

{
  "Provider": "consul",
  "Config": {
    "IntermediateCertTTL": "8760h",
    "LeafCertTTL": "72h",
    "RotationPeriod": "2160h"
  },
  "State": null,
  "ForceWithoutCrossSigning": false,
  "CreateIndex": 124844361,
  "ModifyIndex": 124844361
}

Nomad Job

The Connect job stanza looks like:

Connect Stanza

{"Connect": {
            "Native": false,
            "SidecarService": {
              "Tags": null,
              "Port": "",
              "Proxy": {
                "LocalServiceAddress": "",
                "LocalServicePort": 0,
                "Upstreams": [
                  {
                    "DestinationName": "service2",
                    "LocalBindPort": 7000,
                    "Datacenter": ""
                  },
                  {
                    "DestinationName": "service3",
                    "LocalBindPort": 8003,
                    "Datacenter": ""
                  }
                ],
                "Expose": null,
                "Config": {
                  "local_connect_timeout_ms": 5000
                }
              }
            },
            "SidecarTask": {
              "Name": "",
              "Driver": "docker",
              "User": "",
              "Config": {
                "image": "${meta.connect.sidecar_image}",
                "args": [
                  "-c",
                  "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
                  "--log-level",
                  "${meta.connect.log_level}",
                  "--component-log-level",
                  "misc:error,upstream:warning",
                  "--disable-hot-restart"
                ]
              },
              "Env": null,
              "Resources": {
                "CPU": 64,
                "MemoryMB": 64,
                "DiskMB": 0,
                "IOPS": 0,
                "Networks": null,
                "Devices": null
              },
              "Meta": null,
              "KillTimeout": 5000000000,
              "LogConfig": {
                "MaxFiles": 10,
                "MaxFileSizeMB": 10
              },
              "ShutdownDelay": 0,
              "KillSignal": ""
            },
            "Gateway": null
          },
          "Meta": {
            "envoy-stats-port": "${NOMAD_HOST_PORT_envoy_stats}"
          },
          "CanaryMeta": null
        }
}}

Certs

The certs endpoint in the Envoy sidecar shows this (this was captured on May 6th).

   "cert_chain": [
    {
     "path": "\u003cinline\u003e",
     "serial_number": "redacted",
     "subject_alt_names": [
      {
       "uri": "spiffe://7e225fe1-9399-9945-8548-77c97d033ff2.consul/ns/default/dc/dc-central1/svc/service1"
      }
     ],
     "days_until_expiration": "0",
     "valid_from": "2021-05-02T17:30:10Z",
     "expiration_time": "2021-05-05T17:30:10Z"
    }
   ]

As you can see, the certificate had already expired (since this was May 6th, and the expiration time was 5:30pm on May 5th). For some reason, the sidecar never received the new certificate.

Script for Comparison

We wrote a small script to compare the dates in the /certs endpoint in Envoy to the /v1/agent/connect/leaf/:svc endpoint in Consul agent:

set -e
source "/usr/local/bin/.envoy.util"

ENVOY_ADDR=${ENVOY_ADDR:-"0.0.0.0:19001"}
CONSUL_ADDR=${CONSUL_ADDR:-"0.0.0.0:8500"}

ok "Expired Certificate Chains:\n"

for envoy_pid in $(ps -C envoy -o pid --no-headers); do
  envoy_certs_output=$(nsenter -t "$envoy_pid" -n curl -s "$ENVOY_ADDR/certs")
  envoy_certs=$(echo "$envoy_certs_output" | jq -r '.certificates[].cert_chain[] | [.days_until_expiration, .expiration_time, (.subject_alt_names[].uri | split("/") | last)] | join("|")' | sort | uniq)
  for cert in $envoy_certs; do
    expiration=$(echo -n "$cert" | cut -d'|' -f2)
    days=$(echo -n "$cert" | cut -d'|' -f1)
    svc=$(echo -n "$cert" | cut -d'|' -f3)
    consul_expiration=$(curl -s "$CONSUL_ADDR/v1/agent/connect/ca/leaf/$svc" | jq -r '.ValidBefore')
    if [[ "$expiration" != "$consul_expiration" ]]; then
      info "$svc"
      info "  Envoy: $expiration ($days)"
      info "  Consul: $consul_expiration"
    fi
  done
done

This will output results like so:

~# ./envoy-certs-compare
service4
  Envoy: 2021-05-07T16:53:32Z (0)
  Consul: 2021-05-09T23:48:20Z
service5
  Envoy: 2021-05-08T18:59:06Z (0)
  Consul: 2021-05-11T03:11:40Z
~# date
Sat 08 May 2021 06:03:47 AM UTC

This output shows that the /certs endpoint in Envoy is returning different results than the Consul Agent. We wondered if this might be related to #9862, but aren't sure. I further checked 0.0.0.0:8500/v1/agent/connect/ca/leaf/leaf-cert to ensure the Agent was always getting a renewed cert; it was:

{
  "SerialNumber": "redacted",
  "CertPEM": "redacted",
  "PrivateKeyPEM": "redacted",
  "Service": "leaf-cert",
  "ServiceURI": "spiffe://7e225fe1-9399-9945-8548-77c97d033ff2.consul/ns/default/dc/dc-central1/svc/leaf-cert",
  "ValidAfter": "2021-05-06T16:29:08Z",
  "ValidBefore": "2021-05-09T16:29:08Z",
  "CreateIndex": 302613406,
  "ModifyIndex": 302613406
}

CDS Notes

When investigating further, we see that there are no explicit failures from the cluster discovery service; however, the attempts are 1 more than the successes:

# curl 0.0.0.0:19001/stats | grep cds
cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 31912
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 31911
cluster_manager.cds.update_time: 1619424457611
cluster_manager.cds.version: 13145068504206875241

As noted, 31912 > 31911. Looking further, we see this, though we're unsure what this means:

cluster_manager.update_merge_cancelled: 53
cluster_manager.update_out_of_merge_window: 3430

We're continuing to experience this issue, and will update this ticket with more information as we get it.

The text was updated successfully, but these errors were encountered:

splittingred · 2021-06-08T13:44:09Z

We had this happen again over the weekend, through CA verification failures. Here's a dump of a gRPC call to the sidecar that was failing:

14:upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED. debug_error_string:{"created":"@1623079883.837989025","description":"Error received from peer ipv4:127.0.0.1:8003","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED","grpc_status":14}

splittingred · 2021-07-06T17:07:44Z

So in deeper investigation, this looks to be an issue where entire envoy sidecar clusters are not getting updated. I can open a separate issue, but, this is related.

This is on Consul 1.10 and Nomad 1.1.2, fwiw.

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.upstream_cx_connect_fail: 26
# ...
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_active: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_consecutive_5xx: 5
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_detected_consecutive_5xx: 5
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_detected_consecutive_gateway_failure: 5

Furthermore:

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_change: 12
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_degraded: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_excluded: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_healthy: 1
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_total: 2

We have a tool we built to inspect Envoy's clusters endpoint, and (ip ranges redacted except last 2):

---------------------
- Cluster Instances -
---------------------
         HOST         | CX ACTIVE | CX FAILED | CX TOTAL | REQ ACTIVE | REQ TIMEOUT | REQ SUCCESS | REQ ERROR | REQ TOTAL | SUCCESS RATE | LOCAL SUCCESS RATE |    HEALTH FLAGS    | REGION | ZONE | SUBZONE | CANARY
----------------------+-----------+-----------+----------+------------+-------------+-------------+-----------+-----------+--------------+--------------------+--------------------+--------+------+---------+---------
   xx.xx.64.88:21936 |         0 |        27 |       30 |          0 |           0 |           6 |        27 |         6 |         -1.0 |               -1.0 |            healthy |        |      |         |  false
  xx.xx.64.194:21753 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 | /failed_eds_health |        |      |         |  false

Note here the failed_eds_health. However, in other sidecars with the same upstream:

---------------------
- Cluster Instances -
---------------------
         HOST         | CX ACTIVE | CX FAILED | CX TOTAL | REQ ACTIVE | REQ TIMEOUT | REQ SUCCESS | REQ ERROR | REQ TOTAL | SUCCESS RATE | LOCAL SUCCESS RATE | HEALTH FLAGS | REGION | ZONE | SUBZONE | CANARY
----------------------+-----------+-----------+----------+------------+-------------+-------------+-----------+-----------+--------------+--------------------+--------------+--------+------+---------+---------
   xx.xx.64.88:23637 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 |      healthy |        |      |         |  false
  xx.xx.64.194:23829 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 |      healthy |        |      |         |  false

You can see how this sidecar is not getting XDS updates for this upstream.

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_attempt: 29
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_empty: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_failure: 8
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_no_rebuild: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_rejected: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_success: 16
# ...
listener_manager.lds.init_fetch_timeout: 0
listener_manager.lds.update_attempt: 12
listener_manager.lds.update_failure: 8
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 3
listener_manager.lds.update_time: 1625348489181
listener_manager.lds.version: 17241709254077376921
listener_manager.lds.version_text: ""

Other sidecars have the same LDS/CDS config versions, but with correct cluster instance IPs.

We noticed this in errors:

[2021-07-03 12:24:42.078][1][warning][config] [source/common/config/grpc_subscription_impl.cc:119] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
[2021-07-03 21:40:39.863][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 13,
[2021-07-03 21:40:40.206][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

and

Jul  6 16:40:13 nomad-client-xxxx consul [13260]: 2021-07-06T16:40:13.066Z [ERROR] agent.envoy: Error receiving new DeltaDiscoveryRequest; closing request channel: error="rpc error: code = Canceled desc = context canceled"

Turning on debug logging for Envoy shows this:

[2021-07-06 18:20:36.678][15][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0
[2021-07-06 18:20:36.678][19][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0
[2021-07-06 18:20:36.678][16][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0

Result: Outlier Detection Trips

Requests to stale clusters trip Envoy's outlier detection (expectedly so), and throw upstream connect error or disconnect/reset before headers. reset reason: connection failure errors:

[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/conn_manager_impl.cc:261] [C35337] new stream
[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/conn_manager_impl.cc:882] [C35337][S16870535376721757378] request headers complete (end_stream=true):
':authority', '0.0.0.0:8100'
':path', '/'
':method', 'POST'
'user-agent', 'curl/7.64.0'
'accept', '*/*'
'content-type', 'application/grpc'

[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/filter_manager.cc:779] [C35337][S16870535376721757378] request end stream
[2021-07-06 18:22:39.680][16][debug][router] [source/common/router/router.cc:445] [C35337][S16870535376721757378] cluster 'svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul' match for URL '/'
[2021-07-06 18:22:39.680][16][debug][router] [source/common/router/router.cc:631] [C35337][S16870535376721757378] router decoding headers:
':authority', '0.0.0.0:8100'
':path', '/'
':method', 'POST'
':scheme', 'http'
'user-agent', 'curl/7.64.0'
'accept', '*/*'
'content-type', 'application/grpc'
'x-forwarded-proto', 'http'
'x-request-id', 'abcd1234'

[2021-07-06 18:22:39.680][16][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2021-07-06 18:22:39.680][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:241] trying to create new connection
[2021-07-06 18:22:39.680][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:143] creating a new connection
[2021-07-06 18:22:39.680][16][debug][client] [source/common/http/codec_client.cc:43] [C35338] connecting
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:860] [C35338] connecting to xx.xx.xx.88:21936
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:876] [C35338] connection in progress
[2021-07-06 18:22:39.680][16][debug][http2] [source/common/http/http2/codec_impl.cc:1222] [C35338] updating connection-level initial window size to 268435456
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:674] [C35338] delayed connection error: 111
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:243] [C35338] closing socket: 0
[2021-07-06 18:22:39.681][16][debug][client] [source/common/http/codec_client.cc:101] [C35338] disconnect. resetting 0 pending requests
[2021-07-06 18:22:39.681][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:393] [C35338] client disconnected, failure reason:
[2021-07-06 18:22:39.681][16][debug][router] [source/common/router/router.cc:1091] [C35337][S16870535376721757378] upstream reset: reset reason: connection failure, transport failure reason:
[2021-07-06 18:22:39.681][16][debug][http] [source/common/http/filter_manager.cc:883] [C35337][S16870535376721757378] Sending local reply with details upstream_reset_before_response_started{connection failure}
[2021-07-06 18:22:39.681][16][debug][http] [source/common/http/conn_manager_impl.cc:1469] [C35337][S16870535376721757378] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'
'date', 'Tue, 06 Jul 2021 18:22:39 GMT'
'server', 'envoy'
[2021-07-06 18:22:39.681][16][debug][connection] [source/common/network/connection_impl.cc:633] [C35337] remote close
[2021-07-06 18:22:39.681][16][debug][connection] [source/common/network/connection_impl.cc:243] [C35337] closing socket: 0
[2021-07-06 18:22:39.681][16][debug][conn_handler] [source/server/active_tcp_listener.cc:76] [C35337] adding to cleanup list

leonardobsjr · 2021-10-19T18:01:22Z

Was this issue fixed with 1.10.3?

splittingred · 2021-10-19T18:36:16Z

@leonardobsjr As far as we can tell, this issue has been resolved in 1.10.3+.

jsosulska added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-nomad Consul & Nomad shared usability theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected labels May 10, 2021

splittingred mentioned this issue Jul 6, 2021

Connect: Envoy sidecars enter state where they fail to receive xds updates from Consul #10563

Closed

splittingred closed this as completed Oct 19, 2021

snyk-bot mentioned this issue Mar 4, 2023

[Snyk] Security upgrade ember-cli from 3.20.2 to 4.11.0 ekmixon/consul#507

Open

ekmixon mentioned this issue Mar 25, 2024

[Snyk] Security upgrade ember-cli from 3.20.2 to 4.11.0 ekmixon/consul#543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy Sidecar Proxies not renewing certificate chains #10213

Envoy Sidecar Proxies not renewing certificate chains #10213

splittingred commented May 10, 2021 •

edited

Loading

splittingred commented Jun 8, 2021

splittingred commented Jul 6, 2021 •

edited

Loading

leonardobsjr commented Oct 19, 2021

splittingred commented Oct 19, 2021

Envoy Sidecar Proxies not renewing certificate chains #10213

Envoy Sidecar Proxies not renewing certificate chains #10213

Comments

splittingred commented May 10, 2021 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

General Information

Connect CA Configuration:

Nomad Job

Certs

Script for Comparison

CDS Notes

splittingred commented Jun 8, 2021

splittingred commented Jul 6, 2021 • edited Loading

Result: Outlier Detection Trips

leonardobsjr commented Oct 19, 2021

splittingred commented Oct 19, 2021

splittingred commented May 10, 2021 •

edited

Loading

splittingred commented Jul 6, 2021 •

edited

Loading