Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy Sidecar Proxies not renewing certificate chains #10213

Closed
splittingred opened this issue May 10, 2021 · 4 comments
Closed

Envoy Sidecar Proxies not renewing certificate chains #10213

splittingred opened this issue May 10, 2021 · 4 comments
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-nomad Consul & Nomad shared usability theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected

Comments

@splittingred
Copy link

splittingred commented May 10, 2021

Overview of the Issue

We're experiencing an issue where Envoy sidecar proxies are not renewing their certificate chains, causing all further requests to that sidecar to fail. For some reason, the cluster update event to update the cert_chain is not making it to the Envoy sidecar.

We are alerted of this by failing traffic, with the corresponding envoy_listener_ssl_connection_error prometheus metric raised for the service that is failing to update its certificate chain.

Reproduction Steps

We cannot consistently reproduce this issue, however, it occurs with a regularity of about 1-3 days in a mesh with over 1600 Nomad tasks, and around ~75 Nomad Clients.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 1
	check_ttls = 39
	checks = 145
	services = 107
build:
	version = 1.9.3
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 12
	goroutines = 7278
	max_procs = 12
	os = linux
	version = go1.15.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 102
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 3
	member_time = 119087
	members = 221
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 1
	check_ttls = 0
	checks = 5
	services = 4
build:
	version = 1.9.3
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr = REDACTED:8300
	server = true
raft:
	applied_index = 401710728
	commit_index = 401710728
	fsm_pending = 0
	last_contact = 59.817814ms
	last_log_index = 401710728
	last_log_term = 73570
	last_snapshot_index = 401694782
	last_snapshot_term = 73570
	latest_configuration = [{Suffrage:Voter ID:9b5e5016-a89e-a420-e7db-f51e7883ddd0 Address:REDACTED:8300} {Suffrage:Voter ID:147003f5-f498-a4ff-c6cd-2767b98fc8ca Address:REDACTED:8300} {Suffrage:Voter ID:72da8b90-45c2-3078-03db-d4c9999db483 Address:REDACTED:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 73570
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 4830
	max_procs = 2
	os = linux
	version = go1.15.6
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 96
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 57185
	members = 223
	query_queue = 0
	query_time = 4
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 321
	members = 6
	query_queue = 0
	query_time = 4

Operating system and Environment details

Debian Buster:

# cat /etc/debian_version
10.8

Docker:

# docker -v
Docker version 20.10.5, build 55c4c88

Nomad:

# nomad -v
Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)

Envoy:

# envoy --version
envoy  version: e98e41a8e168af7acae8079fc0cd68155f699aa3/1.16.2/Clean/RELEASE/BoringSSL

General Information

We're using Nomad 1.0.3/4 in these environments (slowly updating to 1.0.4), with Consul 1.9.3. We are using the standard Nomad Consul Connect + Envoy setup. We're not overriding any TLS or CA certificate chain settings (using only the defaults). We can provide more detailed information if necessary via email.

Connect CA Configuration:

We use the defaults:

{
  "Provider": "consul",
  "Config": {
    "IntermediateCertTTL": "8760h",
    "LeafCertTTL": "72h",
    "RotationPeriod": "2160h"
  },
  "State": null,
  "ForceWithoutCrossSigning": false,
  "CreateIndex": 124844361,
  "ModifyIndex": 124844361
}

Nomad Job

The Connect job stanza looks like:

Connect Stanza
{"Connect": {
            "Native": false,
            "SidecarService": {
              "Tags": null,
              "Port": "",
              "Proxy": {
                "LocalServiceAddress": "",
                "LocalServicePort": 0,
                "Upstreams": [
                  {
                    "DestinationName": "service2",
                    "LocalBindPort": 7000,
                    "Datacenter": ""
                  },
                  {
                    "DestinationName": "service3",
                    "LocalBindPort": 8003,
                    "Datacenter": ""
                  }
                ],
                "Expose": null,
                "Config": {
                  "local_connect_timeout_ms": 5000
                }
              }
            },
            "SidecarTask": {
              "Name": "",
              "Driver": "docker",
              "User": "",
              "Config": {
                "image": "${meta.connect.sidecar_image}",
                "args": [
                  "-c",
                  "${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
                  "--log-level",
                  "${meta.connect.log_level}",
                  "--component-log-level",
                  "misc:error,upstream:warning",
                  "--disable-hot-restart"
                ]
              },
              "Env": null,
              "Resources": {
                "CPU": 64,
                "MemoryMB": 64,
                "DiskMB": 0,
                "IOPS": 0,
                "Networks": null,
                "Devices": null
              },
              "Meta": null,
              "KillTimeout": 5000000000,
              "LogConfig": {
                "MaxFiles": 10,
                "MaxFileSizeMB": 10
              },
              "ShutdownDelay": 0,
              "KillSignal": ""
            },
            "Gateway": null
          },
          "Meta": {
            "envoy-stats-port": "${NOMAD_HOST_PORT_envoy_stats}"
          },
          "CanaryMeta": null
        }
}}

Certs

The certs endpoint in the Envoy sidecar shows this (this was captured on May 6th).

   "cert_chain": [
    {
     "path": "\u003cinline\u003e",
     "serial_number": "redacted",
     "subject_alt_names": [
      {
       "uri": "spiffe://7e225fe1-9399-9945-8548-77c97d033ff2.consul/ns/default/dc/dc-central1/svc/service1"
      }
     ],
     "days_until_expiration": "0",
     "valid_from": "2021-05-02T17:30:10Z",
     "expiration_time": "2021-05-05T17:30:10Z"
    }
   ]

As you can see, the certificate had already expired (since this was May 6th, and the expiration time was 5:30pm on May 5th). For some reason, the sidecar never received the new certificate.

Script for Comparison

We wrote a small script to compare the dates in the /certs endpoint in Envoy to the /v1/agent/connect/leaf/:svc endpoint in Consul agent:

set -e
source "/usr/local/bin/.envoy.util"

ENVOY_ADDR=${ENVOY_ADDR:-"0.0.0.0:19001"}
CONSUL_ADDR=${CONSUL_ADDR:-"0.0.0.0:8500"}

ok "Expired Certificate Chains:\n"

for envoy_pid in $(ps -C envoy -o pid --no-headers); do
  envoy_certs_output=$(nsenter -t "$envoy_pid" -n curl -s "$ENVOY_ADDR/certs")
  envoy_certs=$(echo "$envoy_certs_output" | jq -r '.certificates[].cert_chain[] | [.days_until_expiration, .expiration_time, (.subject_alt_names[].uri | split("/") | last)] | join("|")' | sort | uniq)
  for cert in $envoy_certs; do
    expiration=$(echo -n "$cert" | cut -d'|' -f2)
    days=$(echo -n "$cert" | cut -d'|' -f1)
    svc=$(echo -n "$cert" | cut -d'|' -f3)
    consul_expiration=$(curl -s "$CONSUL_ADDR/v1/agent/connect/ca/leaf/$svc" | jq -r '.ValidBefore')
    if [[ "$expiration" != "$consul_expiration" ]]; then
      info "$svc"
      info "  Envoy: $expiration ($days)"
      info "  Consul: $consul_expiration"
    fi
  done
done

This will output results like so:

~# ./envoy-certs-compare
service4
  Envoy: 2021-05-07T16:53:32Z (0)
  Consul: 2021-05-09T23:48:20Z
service5
  Envoy: 2021-05-08T18:59:06Z (0)
  Consul: 2021-05-11T03:11:40Z
~# date
Sat 08 May 2021 06:03:47 AM UTC

This output shows that the /certs endpoint in Envoy is returning different results than the Consul Agent. We wondered if this might be related to #9862, but aren't sure. I further checked 0.0.0.0:8500/v1/agent/connect/ca/leaf/leaf-cert to ensure the Agent was always getting a renewed cert; it was:

{
  "SerialNumber": "redacted",
  "CertPEM": "redacted",
  "PrivateKeyPEM": "redacted",
  "Service": "leaf-cert",
  "ServiceURI": "spiffe://7e225fe1-9399-9945-8548-77c97d033ff2.consul/ns/default/dc/dc-central1/svc/leaf-cert",
  "ValidAfter": "2021-05-06T16:29:08Z",
  "ValidBefore": "2021-05-09T16:29:08Z",
  "CreateIndex": 302613406,
  "ModifyIndex": 302613406
}

CDS Notes

When investigating further, we see that there are no explicit failures from the cluster discovery service; however, the attempts are 1 more than the successes:

# curl 0.0.0.0:19001/stats | grep cds
cluster_manager.cds.init_fetch_timeout: 0
cluster_manager.cds.update_attempt: 31912
cluster_manager.cds.update_failure: 0
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 31911
cluster_manager.cds.update_time: 1619424457611
cluster_manager.cds.version: 13145068504206875241

As noted, 31912 > 31911. Looking further, we see this, though we're unsure what this means:

cluster_manager.update_merge_cancelled: 53
cluster_manager.update_out_of_merge_window: 3430

We're continuing to experience this issue, and will update this ticket with more information as we get it.

@jsosulska jsosulska added theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-nomad Consul & Nomad shared usability theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected labels May 10, 2021
@splittingred
Copy link
Author

We had this happen again over the weekend, through CA verification failures. Here's a dump of a gRPC call to the sidecar that was failing:

14:upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED. debug_error_string:{"created":"@1623079883.837989025","description":"Error received from peer ipv4:127.0.0.1:8003","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED","grpc_status":14}

ca-verification-failures

@splittingred
Copy link
Author

splittingred commented Jul 6, 2021

So in deeper investigation, this looks to be an issue where entire envoy sidecar clusters are not getting updated. I can open a separate issue, but, this is related.

This is on Consul 1.10 and Nomad 1.1.2, fwiw.

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.upstream_cx_connect_fail: 26
# ...
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_active: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_consecutive_5xx: 5
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_detected_consecutive_5xx: 5
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.outlier_detection.ejections_detected_consecutive_gateway_failure: 5

Furthermore:

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_change: 12
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_degraded: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_excluded: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_healthy: 1
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.membership_total: 2

We have a tool we built to inspect Envoy's clusters endpoint, and (ip ranges redacted except last 2):

---------------------
- Cluster Instances -
---------------------
         HOST         | CX ACTIVE | CX FAILED | CX TOTAL | REQ ACTIVE | REQ TIMEOUT | REQ SUCCESS | REQ ERROR | REQ TOTAL | SUCCESS RATE | LOCAL SUCCESS RATE |    HEALTH FLAGS    | REGION | ZONE | SUBZONE | CANARY
----------------------+-----------+-----------+----------+------------+-------------+-------------+-----------+-----------+--------------+--------------------+--------------------+--------+------+---------+---------
   xx.xx.64.88:21936 |         0 |        27 |       30 |          0 |           0 |           6 |        27 |         6 |         -1.0 |               -1.0 |            healthy |        |      |         |  false
  xx.xx.64.194:21753 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 | /failed_eds_health |        |      |         |  false

Note here the failed_eds_health. However, in other sidecars with the same upstream:

---------------------
- Cluster Instances -
---------------------
         HOST         | CX ACTIVE | CX FAILED | CX TOTAL | REQ ACTIVE | REQ TIMEOUT | REQ SUCCESS | REQ ERROR | REQ TOTAL | SUCCESS RATE | LOCAL SUCCESS RATE | HEALTH FLAGS | REGION | ZONE | SUBZONE | CANARY
----------------------+-----------+-----------+----------+------------+-------------+-------------+-----------+-----------+--------------+--------------------+--------------+--------+------+---------+---------
   xx.xx.64.88:23637 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 |      healthy |        |      |         |  false
  xx.xx.64.194:23829 |         0 |         0 |        0 |          0 |           0 |           0 |         0 |         0 |         -1.0 |               -1.0 |      healthy |        |      |         |  false

You can see how this sidecar is not getting XDS updates for this upstream.

cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_attempt: 29
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_empty: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_failure: 8
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_no_rebuild: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_rejected: 0
cluster.svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul.update_success: 16
# ...
listener_manager.lds.init_fetch_timeout: 0
listener_manager.lds.update_attempt: 12
listener_manager.lds.update_failure: 8
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 3
listener_manager.lds.update_time: 1625348489181
listener_manager.lds.version: 17241709254077376921
listener_manager.lds.version_text: ""

Other sidecars have the same LDS/CDS config versions, but with correct cluster instance IPs.

We noticed this in errors:

[2021-07-03 12:24:42.078][1][warning][config] [source/common/config/grpc_subscription_impl.cc:119] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
[2021-07-03 21:40:39.863][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 13,
[2021-07-03 21:40:40.206][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

and

Jul  6 16:40:13 nomad-client-xxxx consul [13260]: 2021-07-06T16:40:13.066Z [ERROR] agent.envoy: Error receiving new DeltaDiscoveryRequest; closing request channel: error="rpc error: code = Canceled desc = context canceled"

Turning on debug logging for Envoy shows this:

[2021-07-06 18:20:36.678][15][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0
[2021-07-06 18:20:36.678][19][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0
[2021-07-06 18:20:36.678][16][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1222] membership update for TLS cluster svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul added 0 removed 0

Result: Outlier Detection Trips

Requests to stale clusters trip Envoy's outlier detection (expectedly so), and throw upstream connect error or disconnect/reset before headers. reset reason: connection failure errors:

[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/conn_manager_impl.cc:261] [C35337] new stream
[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/conn_manager_impl.cc:882] [C35337][S16870535376721757378] request headers complete (end_stream=true):
':authority', '0.0.0.0:8100'
':path', '/'
':method', 'POST'
'user-agent', 'curl/7.64.0'
'accept', '*/*'
'content-type', 'application/grpc'

[2021-07-06 18:22:39.680][16][debug][http] [source/common/http/filter_manager.cc:779] [C35337][S16870535376721757378] request end stream
[2021-07-06 18:22:39.680][16][debug][router] [source/common/router/router.cc:445] [C35337][S16870535376721757378] cluster 'svc-b.default.stg.internal.f73c09a4-ccb5-28bc-dc9e-1b5beff4b219.consul' match for URL '/'
[2021-07-06 18:22:39.680][16][debug][router] [source/common/router/router.cc:631] [C35337][S16870535376721757378] router decoding headers:
':authority', '0.0.0.0:8100'
':path', '/'
':method', 'POST'
':scheme', 'http'
'user-agent', 'curl/7.64.0'
'accept', '*/*'
'content-type', 'application/grpc'
'x-forwarded-proto', 'http'
'x-request-id', 'abcd1234'

[2021-07-06 18:22:39.680][16][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2021-07-06 18:22:39.680][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:241] trying to create new connection
[2021-07-06 18:22:39.680][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:143] creating a new connection
[2021-07-06 18:22:39.680][16][debug][client] [source/common/http/codec_client.cc:43] [C35338] connecting
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:860] [C35338] connecting to xx.xx.xx.88:21936
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:876] [C35338] connection in progress
[2021-07-06 18:22:39.680][16][debug][http2] [source/common/http/http2/codec_impl.cc:1222] [C35338] updating connection-level initial window size to 268435456
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:674] [C35338] delayed connection error: 111
[2021-07-06 18:22:39.680][16][debug][connection] [source/common/network/connection_impl.cc:243] [C35338] closing socket: 0
[2021-07-06 18:22:39.681][16][debug][client] [source/common/http/codec_client.cc:101] [C35338] disconnect. resetting 0 pending requests
[2021-07-06 18:22:39.681][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:393] [C35338] client disconnected, failure reason:
[2021-07-06 18:22:39.681][16][debug][router] [source/common/router/router.cc:1091] [C35337][S16870535376721757378] upstream reset: reset reason: connection failure, transport failure reason:
[2021-07-06 18:22:39.681][16][debug][http] [source/common/http/filter_manager.cc:883] [C35337][S16870535376721757378] Sending local reply with details upstream_reset_before_response_started{connection failure}
[2021-07-06 18:22:39.681][16][debug][http] [source/common/http/conn_manager_impl.cc:1469] [C35337][S16870535376721757378] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'
'date', 'Tue, 06 Jul 2021 18:22:39 GMT'
'server', 'envoy'
[2021-07-06 18:22:39.681][16][debug][connection] [source/common/network/connection_impl.cc:633] [C35337] remote close
[2021-07-06 18:22:39.681][16][debug][connection] [source/common/network/connection_impl.cc:243] [C35337] closing socket: 0
[2021-07-06 18:22:39.681][16][debug][conn_handler] [source/server/active_tcp_listener.cc:76] [C35337] adding to cleanup list

@leonardobsjr
Copy link

Was this issue fixed with 1.10.3?

@splittingred
Copy link
Author

@leonardobsjr As far as we can tell, this issue has been resolved in 1.10.3+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-nomad Consul & Nomad shared usability theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants