Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config Tries to Be Loaded Before Secrets Have Been Injected Into Pod #9593

Open
Evesy opened this issue Feb 7, 2023 · 12 comments
Open

Config Tries to Be Loaded Before Secrets Have Been Injected Into Pod #9593

Evesy opened this issue Feb 7, 2023 · 12 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Evesy
Copy link

Evesy commented Feb 7, 2023

What happened:

During startup of nginx we observed Nginx emitting emergency level logs as the configuration contained references to certificate files that Nginx had not yet loaded into the pod

What you expected to happen:

ingress-nginx should fully write secrets to the pod before attempting to start up

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.4.0

Kubernetes version (use kubectl version): 1.24.8

Environment:

  • Cloud provider or hardware configuration: GKE

  • OS (e.g. from /etc/os-release): ContainerOS

  • Kernel (e.g. uname -a): Linux ingress-nginx-external-controller-6c9449fbfd-p584h 5.10.147+ Basic structure  #1 SMP Thu Nov 10 04:41:53 UTC 2022 x86_64 Linux

  • Current state of ingress object, if applicable:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-tls-secret: ingress-nginx/cloudflare-origin-pull-ca
    nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
    nginx.ingress.kubernetes.io/client-body-buffer-size: 256k
    nginx.ingress.kubernetes.io/configuration-snippet: |
      rewrite ^/$ /admin/master/console permanent;

      more_set_headers "strict-transport-security: max-age=31536000" always;

      more_clear_input_headers "x-auth-request-client-id" "x-auth-request-required-groups" "x-auth-user" "x-auth-email" "x-auth-profile";

      proxy_ignore_client_abort "on";
    nginx.ingress.kubernetes.io/proxy-buffer-size: 16k
    nginx.ingress.kubernetes.io/server-snippet: |
      location = /.well-known/ruok {
        if ($http_x_debug = 1) {
          return 200 "imok";
          add_header Content-Type text/plain;
        }
      }
    nginx.ingress.kubernetes.io/service-upstream: "true"
    prometheus.io/path: .well-known/ruok
  name: example-ingress
  namespace: default
spec:
  ingressClassName: nginx-external
  rules:
  - host: myapp.tld
    http:
      paths:
      - backend:
          service:
            name: app
            port:
              name: http-web
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - myapp.tld
    secretName: myapp-tls
status:
  loadBalancer:
    ingress:
    - ip: 35.x.x.x
  • Others:
-------------------------------------------------------------------------------
I0206 17:24:42.780502       7 store.go:430] "Found valid IngressClass" ingress="platform-grafana/platform-grafana-public" ingressclass="nginx-external"
E0206 17:24:42.780475       7 queue.go:130] "requeuing" err=<

        -------------------------------------------------------------------------------
        Error: exit status 1
        2023/02/06 17:24:42 [warn] 34#34: the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg2721367965:141
        nginx: [warn] the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg2721367965:141
        2023/02/06 17:24:42 [warn] 34#34: the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg2721367965:142
        nginx: [warn] the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg2721367965:142
        2023/02/06 17:24:42 [warn] 34#34: the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg2721367965:143
        nginx: [warn] the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg2721367965:143
        2023/02/06 17:24:42 [emerg] 34#34: SSL_load_client_CA_file("/etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)
        nginx: [emerg] SSL_load_client_CA_file("/etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)
        nginx: configuration file /tmp/nginx/nginx-cfg2721367965 test failed

        -------------------------------------------------------------------------------
 > key="initial-sync"

How to reproduce this issue:
This hasn't been reproducible in a smaller test environment as of yet, it only seems to happen on our cluster with ~1000 ingresses. We've been on 1.4 for some time now and this is the first time we've observed the issue when nginx is rolling out

@Evesy Evesy added the kind/bug Categorizes issue or PR as related to a bug. label Feb 7, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 7, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

This is not showing a bug.

/remove-kind bug

The error message is stating this

SSL: error:0908F066:PEM routines:get_header_and_data:bad end line

and I suspect that is related to that pem file. And that pem file could be related to the auth-tls-secret annotation.

You could create another app and ingress with a vanilla image nginx:alpine and see if simple no extra-annotation ingress works. If simple ingress works, then you can proceed to add that annotation and see if the previously working ingress fails after adding that annotation.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 7, 2023
@longwuyuan
Copy link
Contributor

this file /etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem

@Evesy
Copy link
Author

Evesy commented Feb 7, 2023

@longwuyuan Is this not showing a bug?

The file (/etc/ingress-controller/ssl/ca-ingress-nginx-cloudflare-origin-pull-ca.pem) is loaded in by Nginx based on the annotation: nginx.ingress.kubernetes.io/auth-tls-secret: ingress-nginx/cloudflare-origin-pull-ca

The referenced Kubernetes secret, ingress-nginx/cloudflare-origin-pull-ca, is not changing when Nginx is being rolling restarted. The data in the secret is static and sound, and ingress-nginx also eventually loads this correctly without intervention.

This leads me to think ingress-nginx is attempting to validate/load the nginx config, which references that PEM on disk, before ingress-nginx has actually read the secret and written it to it's local filesystem

What are your thoughts?

@longwuyuan
Copy link
Contributor

longwuyuan commented Feb 7, 2023

hi @Evesy ,
thanks for reporting this. the requirement is complete detailed data on that error.

  • kubectl get po,svc,ing,secret -n $application-namespace
  • kubectl describe ing -n $application-namespace
  • Your complete and real curl command with -v
  • The details of the cert you are using. For example does it have the fullchain of the CA
  • If your data is not posted here on the issue, then it implies you want someone else to simulate the problem and configure all that tls and base it all on guess work. Don't know if that is fast or best
  • Ideally a step-by-step instruction set for someone to copy/paste on their minikube/kind cluster will help. If you do this, then you can also make it really helpful by including the steps to create a self-signed cert (openssl)
  • There is acute shortage of developer time. A issue with clear data as I described above reduces other people's time to reproduce and triage

With cloudflare CA being involved in your post, I think there is a lot to be considered, hence the small tiny minute details of the problem will help a lot. Cloudflare CA and fullchain etc for auth etc are a specialist's area

@github-actions
Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 13, 2023
@Restless-ET
Copy link

Hello 👋

By chance, do you have any other findings around this @Evesy ?

Believe I experience a similar situation but with the CA CRL file instead of the CRT (my secret provided by the annotation holds both "ca.crt" and "ca.crl").

I've confirmed it's happening on versions 1.1.3 ; 1.3.1 ; 1.4.0 and 1.5.1.
Although on v1.1.3 the logging format appears slightly different.

@Evesy
Copy link
Author

Evesy commented Jun 16, 2023

Hey @Restless-ET -- Unfortunately we haven't seen a reoccurrence of this issue since I raised the issue, and I was never able to reliably reproduce the issue either

@Restless-ET
Copy link

Yes, I experience the same... when I release a new version or simply do a rollout restart it doesn't happen every time and even when it does it's not for all the controller pods.

It doesn't seem to affect functionality on any of the endpoints configured, so I guess at this stage is really more about a logs noise reduction (and quicker detection of actual problems) then anything else.

Anyway, thank you for getting back on this. :)

@613andred
Copy link

This problem has severely impacted us in the past, I have just now been able to compile the information and replicate the problem.

I also believe it's the same underlying issue causing #10234 and #10265

Our context

  • Default ingress controller deployment
  • Ingress controller deployed as a DaemonSet (but can be reproduced with deployment)
  • Multiple mTLS ingress with their own trust stores

The following makes this issue occur more often

  • Large truststore files (Ex. use the unix /etc/ssl/certs/ca-certificates.crt which is ~206 KB)
  • Large number of ingress controllers on the cluster
  • Many ingresses (close to 1000)

Symptoms

  • Controller Pods in Crash backoff loop due to failure to validate ingress (same errors as next point)
  • Intermittent failure of ingress validation (admission webhook) with one of the following errors (in controller logs and/or in Events).
Error: UPGRADE FAILED: failed to create resource: admission webhook "validate.nginx.ingress.kubernetes.io" denied the request:
-------------------------------------------------------------------------------
Error: exit status 1
2023/06/06 16:55:24 [emerg] 4002#4002: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0B084088:x509 certificate routines:X509_load_cert_crl_file:no certificate or crl found)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0B084088:x509 certificate routines:X509_load_cert_crl_file:no certificate or crl found)
nginx: configuration file /tmp/nginx/nginx-cfg636383756 test failed

or

2023/02/06 17:24:42 [emerg] 34#34: SSL_load_client_CA_file("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)
nginx: [emerg] SSL_load_client_CA_file("/etc/ingress-controller/ssl/test-mtls-truststore.pem") failed (SSL: error:0908F066:PEM routines:get_header_and_data:bad end line)

Replicating the problem

I did the following in minikube

  • Deploy basic ingress controller
  • Deploy basic mTLS ingress with large truststore
#!/bin/bash
VERSION=4.7.1
NS=ingress-test

# Install ingress controller
helm upgrade nginx ingress-nginx/ingress-nginx -i --version ${VERSION} -n ${NS} --create-namespace

echo Wait for ingress controller to be live
until kubectl wait --for=condition=Ready pod --selector app.kubernetes.io/component=controller
do
  sleep 1
done

# Create large truststore (increased likelyhood of race condition)
cat << EOF | kubectl apply -n ${NS} -f - --server-side
apiVersion: v1
data:
  ca.crt: |
$(cat /etc/ssl/certs/ca-certificates.crt | base64 | sed "s/^/    /")
kind: Secret
metadata:
  name: truststore
type: Opaque
EOF

# Create ingress
cat <<EOF | kubectl apply -n ${NS} -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-tls-pass-certificate-to-upstream: "true"
    nginx.ingress.kubernetes.io/auth-tls-secret: ingress-test/truststore
    nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
    nginx.ingress.kubernetes.io/auth-tls-verify-depth: "1"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    update-time: ""
  name: ingress
spec:
  ingressClassName: nginx
  rules:
  - host: dummy.host.com
    http:
      paths:
      - backend:
          service:
            name: dummy-service
            port:
              number: 8080
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - dummy.host.com
EOF

Use 2 terminals

Terminal 1

exec into controller pod
kubectl exec -it deployment.apps/nginx-ingress-nginx-controller -- bash

Run the following command/script which:

  • saves the current md5 hash of the truststore
  • loops as quickly as possible checking if the truststore matches
  • when it does not: print the current time and how many previous successfull iterations occured
expected_md5=$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)
cnt=0
while true
do
  if [[ "$(md5sum /etc/ingress-controller/ssl/ca-ingress-test-truststore.pem)" == "${expected_md5}" ]] ; then
    let cnt++
  else
    echo "success count: $cnt"
    cnt=0
    echo "failure! $(date)"
fi
done

outputs:

success count: 1272
failure! Thu Aug 24 15:41:06 UTC 2023
success count: 663
failure! Thu Aug 24 15:41:12 UTC 2023
success count: 402
failure! Thu Aug 24 15:41:16 UTC 2023
success count: 392
failure! Thu Aug 24 15:41:19 UTC 2023
success count: 246
failure! Thu Aug 24 15:41:22 UTC 2023

or run (performed internally by the controller to validate the config)

cnt=0
while true
do
  if nginx -tq ; then
    let cnt++
  else
    echo "success count: $cnt"
    cnt=0
    echo "failure! $(date)"
fi
done

outputs:

2023/08/24 15:50:18 [emerg] 4320#4320: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 9
failure! Thu Aug 24 15:50:18 UTC 2023
2023/08/24 15:50:21 [emerg] 4332#4332: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 7
failure! Thu Aug 24 15:50:21 UTC 2023
2023/08/24 15:50:24 [emerg] 4347#4347: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 9
failure! Thu Aug 24 15:50:24 UTC 2023
2023/08/24 15:50:25 [emerg] 4352#4352: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-ingress-test-truststore.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /etc/nginx/nginx.conf test failed
success count: 2
failure! Thu Aug 24 15:50:25 UTC 2023

Terminal 2

After the monitoring is running in terminal 1
Create an update storm by constantly patching the Ingress resource.

while true; do
    kubectl patch -n ingress-test ingress ingress  --type merge --patch "metadata: {annotations: {update-time: \"$(date)\"}}"
done

outputs:

ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched
ingress.networking.k8s.io/ingress patched (no change)

Causes

  • non atomic file write causes race condition between writing certificate (controller) and using certificate (nginx) internal/net/ssl/ssl.go#L254
  • Unnecessary certificate updates (Any update to Ingress resource causes the certificate to be update). Common cause is ingress controllers restarting causing the Ingress.Status to be updated.

Mitigations that we applied

  • Reduce size of mTLS truststores / fix bad truststores (We had some clients use /etc/ssl/certs/ca-certificates.crt + our internal CA)
  • Reduce the number mTLS truststores by using a shared instance (reduces the number potential failure instances)

Ex. where ingress is the ingress namespaces that has Secret mtls-truststore

metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-tls-secret: ingress/mtls-truststore`
  • Disable update-stats controller.extraArgs.update-status: "false". Controller restarts no longer cause the Ingress resource to change (Updating resource status).
status:
  loadBalancer:
    ingress:
    - ip: x.x.x.x

Possible solutions

func ConfigureCACert(name string, ca []byte, sslCert *ingress.SSLCert) error {
	caName := fmt.Sprintf("ca-%v.pem", name)
+	tmpFileName := fmt.Sprintf("%v/.%v", file.DefaultSSLDirectory, caName)
	fileName := fmt.Sprintf("%v/%v", file.DefaultSSLDirectory, caName)

+	// Perform atomic write by doing a write followed by a rename (unix only)
-	err := os.WriteFile(fileName, ca, 0644)
+	err := os.WriteFile(tmpFileName, ca, 0644)

+	if err == nil {
+		err = os.Rename(tmpFileName, fileName)
+	}

	if err != nil {
		return fmt.Errorf("could not write CA file %v: %v", fileName, err)
	}

	sslCert.CAFileName = fileName

	klog.V(3).InfoS("Created CA Certificate for Authentication", "path", fileName)

	return nil
}
  • Remove unnecessary certificate write operations by checking if there is a change (using a file hash function)

Other potentially affected pieces of code:

I am willing to provide a PR with fixes if you can provide some guidance on my proposed solution(s).

@qds-x
Copy link

qds-x commented Mar 4, 2024

Just to add some information on this, we are able to consistently reproduce the issue by deploying ingresses with the following annotations

  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/proxy-ssl-name: non-existent-service.user-xx-yy-sandbox.svc.cluster.local
    nginx.ingress.kubernetes.io/proxy-ssl-secret: user-xx-yy-sandbox/dummy-proxy-ssl-secret
    nginx.ingress.kubernetes.io/proxy-ssl-verify: "on"
    nginx.ingress.kubernetes.io/proxy-ssl-verify-depth: "2"

Attempts to deploy many such ingresses simultaneously gives errors such as

-------------------------------------------------------------------------------

        * admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: 
-------------------------------------------------------------------------------
Error: exit status 1
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:145
nginx: [warn] the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:145
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:146
nginx: [warn] the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg16865532:146
2024/03/04 12:27:39 [warn] 2185398#2185398: the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg16865532:147
nginx: [warn] the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg16865532:147
2024/03/04 12:27:39 [emerg] 2185398#2185398: SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-user-xx-yy-sandbox-dummy-proxy-ssl-secret.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: [emerg] SSL_CTX_load_verify_locations("/etc/ingress-controller/ssl/ca-user-xx-yy-sandbox-dummy-proxy-ssl-secret.pem") failed (SSL: error:04800066:PEM routines::bad end line error:05880009:x509 certificate routines::PEM lib)
nginx: configuration file /tmp/nginx/nginx-cfg16865532 test failed

Observations:

  • In our testing ~50 ingresses were required for consistent reproduction, but this issue is sporadic for as few as 5 simulatenous ingress deploys
  • The 'bundle' provided as 'proxy-ssl-secret' in our case is single certificate, which would seem to count against the idea that trust store size is relevant
  • the secret can be a placeholder certificate, but reproduction was not successful with base64-encoded gibberish as the secret-data, nor if the referenced secret did not exist
  • the ingress need not be functional, it's an admissions issue

As our ingresses only need to use a single shared CA bundle which doesn't change often, our workaround right now is to mount said bundle as a configmap into the nginx pods, then use a configuration snippet to turn on TLS verification to the backend pods, referencing the mounted secret.

    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_ssl_trusted_certificate           /path/to/mounted/bundle.pem;
      proxy_ssl_verify                        on;
      proxy_ssl_verify_depth                  2;
      proxy_ssl_name                          non-existent-service.user-xx-yy-sandbox.svc.cluster.local; 

This seems to dodge the race condition but is far from ideal, not least because enabling configuration snippets exposes vulnerabilities.

I created a helm chart which consistently reproduces the issue. It deploys a placeholder secret, then deploys many ingresses with the above annotations which reference said secret.

@longwuyuan
Copy link
Contributor

Hi,

It seems distinctly that an event like a rollout of the controller resulting in existing controller pods terminating and new controller pods being created is required to cause this. Another event seems like a large volume of ingresses with the relevant annotation that injects secrets causes this. I see that some comments also concur that race condition(s) like situations are not ruled out.

To state the obvious, just one or a few ingresses syncing concurrently does not cause this problem. Also it is obvious that for the users that have mTLS secrets in ingresses and that too either in large volumes or involved in rollout during upgrades, require a better experience.

But the project is extremely short on resources and there is no developer time available to work on this. If a PR is submitted then it is likely that it will get reviewed but a e2e-test that mirrors the conditions in a kind cluster is a absolute requirement. I see the need for lots of certs there.

The project resources have a priority to work on securing the controller by default and also implementing the Gateway-API. We have actually deprecated features that are far from the implications of the Ingress-API specs like the tcp/udp forwarding.

But the best step forward is that I request you join the community meeting with announcing the intent to do so and discuss this in the ingress-nginx-dev channel of the Kubernetes Slack. It would help a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

6 participants