Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect path for nginx ingress health probes #2903

Closed
shiv-rajawat opened this issue Apr 22, 2022 · 23 comments
Closed

Incorrect path for nginx ingress health probes #2903

shiv-rajawat opened this issue Apr 22, 2022 · 23 comments

Comments

@shiv-rajawat
Copy link

shiv-rajawat commented Apr 22, 2022

What happened: Upon installing nginx ingress using helm, the default health probe with HTTP/HTTPS is getting configured with path as "/". As a result, the inbound connectivity to that nginx public IP is not working.

What you expected to happen: The path should be "/healthz" with HTTP/HTTPS health probes.

How to reproduce it: Install nginx ingress using helm
helm install ingress-nginx ingress-nginx/ingress-nginx

Check the AKS loadbalancer health probes for path in nginx ingress probes (HTTP/HTTPS).

health-probe

Anything else we need to know?: The issue is only reproducing for us in selected West Europe AKS clusters.

Environment:

  • Kubernetes version: 1.22.6
  • Size of cluster: 30 nodes
  • General description of workloads in the cluster - HTTP microservices, Angular applications
@ghost ghost added the triage label Apr 22, 2022
@ghost
Copy link

ghost commented Apr 22, 2022

Hi shiv-rajawat, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@LennartHaggkvistHM
Copy link

LennartHaggkvistHM commented Apr 22, 2022

This is also affecting us.

My conclusions are:

  • It is not only west Europe - we have the problem in east Australia as well
  • Old clusters has LB health probes of type TCP - but the newly created clusters use HTTP or HTTPS
  • Those health probes seems to be using the wrong path - manually changing to /healtz worked
  • Changing the HTTP/HTTPS probe to type TCP worked also (note that the port from the HTTP/HTTPS must be used and not the one that the portal suggests...)

@mkoertgen
Copy link

mkoertgen commented Apr 22, 2022

Same for me. Scenario: AKS (EastUS2) + ingress-nginx (via helm).

My observations:

  • Frontend-IP-config for Azure Load Balancer in the MC-rg then fails.
  • Old clusters are using TCP-Probes mapped to the k8s-service-ports.
  • New clusters are using HTTP/HTTPS-Probes mapped to 80/443 respectively.
  • As a result no traffic is routed to the ingress-controller since health probes fail.

Workaround:
When manually changing health probes back to TCP and mapped ports, health probes reconcile after a while.

image

Load Balancer starts routing traffic again

image

And we finally get our nice nginx 404-page again

image

@jabbera
Copy link

jabbera commented Apr 23, 2022

I just lost 4 hours of my weekend trying to diagnose this issue..... Eastus2

@shiv-rajawat
Copy link
Author

Well, I lost my entire day to debug this issue. Surprisingly I’m not able to reproduce this issue in other clusters in the same region.

@jabbera
Copy link

jabbera commented Apr 23, 2022

The only notable change I can think of is my dev cluster that didn't see the issue was created as 1.22.06 whereas my production cluster that saw the issue was an upgrade from 1.21.9

@jabbera
Copy link

jabbera commented Apr 23, 2022

@miwithro support might be saved a ton of tickets if this could be tracked down. I almost opened a sev A over it.

@shiv-rajawat
Copy link
Author

@miwithro support might be saved a ton of tickets if this could be tracked down. I almost opened a sev A over it.

We already have a Sev B opened.

@jabbera
Copy link

jabbera commented Apr 23, 2022

Same issue with my DR cluster in centralus.

@sebader
Copy link
Member

sebader commented Apr 25, 2022

Same issue here, AKS 1.22.6 in various regions (at least Sweden Central and Brazil south). Did anybody figure out a solution yet apart from the manual workaround?

@aristosvo
Copy link

aristosvo commented Apr 25, 2022

It seems the same or similar as this solved issue on the kubernetes/ingress-nginx repo itself.

Usage of the flag --set controller.service.externalTrafficPolicy=Local should probably fix it.

I'm not 100% sure if this is just a workaround or the fix, but I'll leave that discussion for the AKS team.

@ivanthelad
Copy link

ivanthelad commented Apr 26, 2022

Hi,
is this behaviour by design?The cloud provider documentations states that the default path for the probe will be "/" if not specified.
The table thats that for http/https endpoints the default probe will be "/"

previously the behaviour used to be "/healthz" but a recent change altered this kubernetes-sigs/cloud-provider-azure@6cdd799 where the default path was modified

The cloud provides logic can be found here.
https://github.com/kubernetes-sigs/cloud-provider-azure/blob/2b91a85af9436e8826bf7c8c79fb055e6d58ad8a/pkg/provider/azure_loadbalancer.go#L1977

one work around would be to supply the annotation "service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path" to the nginx helm install
https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml#L454

@heoelri
Copy link

heoelri commented Apr 26, 2022

It seems the same or similar as this solved issue on the kubernetes/ingress-nginx repo itself.

Usage of the flag --set controller.service.externalTrafficPolicy=Local should probably fix it.

I'm not 100% sure if this is just a workaround or the fix, but I'll leave that discussion for the AKS team.

Setting controller.service.externalTrafficPolicy to Local when deploying the ingress-nginx helm chart as mentioned by @aristosvo in #2903 (comment) did the trick on our end and solved the issue. The LB's health probe then points again to path /healthz instead of / - now for HTTP and HTTPS in a single health probe.

@ivanthelad
Copy link

ivanthelad commented Apr 26, 2022

What I see, the reason for this issue comes down to the helm chart adopting appProtocol spec combined with the cloud provider respecting the appProtocol setting. So when a service is created with appProtocol the LB will probe directly to pod process (via the kube-proxy hop). I think previously the probe was based on the NodeHealthPort created on each node. In this case the 503 would be returned from the kubernetes api if the pod endpoint was not available on the target host.
https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/templates/controller-service.yaml#L70

In general, i would guess the underlying CloudProvider was updated on AKS that has resulted in the appProtocol been respected and hence the changes to the probe where "/" is set when no path annotation is supplied on the service definition. possible fixes

  • Manually set the annotation on the service that includes the health check path
  • upgrade your nginx helm chart setting the param "Values.controller.service.appProtocol" to false

@ivanthelad
Copy link

ivanthelad commented Apr 26, 2022

i expect other applications that use to service spec appProtocol to be effected #2907

@phealy
Copy link
Contributor

phealy commented Apr 26, 2022

A documentation PR has been merged to add --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz to the helm commands, which will tell Cloud Provider how to set the health probe properly. Docs will be updated in the next build, later today.

@sebader
Copy link
Member

sebader commented Apr 26, 2022

@phealy thanks for updating the docs. While this makes all sense to me, including the solution, I want to point out this this is a major breaking change for many people out there and it was clearly not announced as such.

@idmurphy
Copy link

Looks like this issue started around 20/04 or slightly after. For AKS 1.22.6 built before then, we find the health probe is HTTPS with /healthz path added already, since then it has plain '/'. This is for exact same AKS 1.22.6 and same ingress/version helm automated deployment pipeline.

One other thing I note is that earlier clusters had aks-link in kube-system and the newer ones which are broken have tunnelfront in kube-system. Is this handling the health probe?

I find it very odd that the fix for this this is a documentation update and everyone stumbling into this issue needs to have a work-around hack on the helm deployment rather than the issue itself being fixed?! This is going to start causing more and more issues for everyone, especially anyone who has happily testing AKS upgrade to 1.22 in UAT before 20/04 and then suddenly find when they go to a production roll out it starts breaking. Shouldn't there be a general support notification of this issue?

@jabbera
Copy link

jabbera commented Apr 28, 2022

happily testing AKS upgrade to 1.22 in UAT before 20/04 and then suddenly find when they go to a production roll out it starts breaking.

This is exactly what happened to me on Saturday. I had a literal panic since there is no rollback. Also missed a little league game over it.

@idmurphy
Copy link

@phealy - is a "beta" option really considered a suitable solution for a production issue? When is this GA?

@sebader
Copy link
Member

sebader commented Apr 29, 2022

@idmurphy all the cloud provider-specific annotations are using the same syntax. This does not mean any of these are in preview. https://kubernetes-sigs.github.io/cloud-provider-azure/topics/loadbalancer/#loadbalancer-annotations (same for AWS https://kubernetes.io/docs/concepts/services-networking/service/#other-elb-annotations)

@knunery
Copy link

knunery commented May 1, 2022

THANK YOU mkoertgen! Running into this issue. I was about to open a ticket for this issue.

@feiskyer
Copy link
Member

feiskyer commented May 3, 2022

Sorry to see the appProtocol support inside cloud provider has broken ingress-nginx for AKS clusters >=1.22. The issue was caused by two reasons:

    1. the new version of nginx ingress controller added appProtocol and its probe path has to be /healthz;
    1. the new version of cloud-controller-manager added HTTP probing with default path / for appProtocol=http services.

AKS is rolling out a fix to keep backward compatibility (ETA is 48 hours from now) with fix cloud-provider-azure#1626. After this fix, the default probe protocol would only be changed to appProtocol for clusters >=1.24 (refer docs for more details).

Before the rollout of this fix, please upgrade the existing ingress-nginx with --set controller.service.externalTrafficPolicy=Local for faster mitigation.

And for newly created helm releases, you could also set --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz to ensure /healthz probe path be used with HTTP/HTTPS protocol (AKS docs has been updated with this annotation as well, e.g. here).

@feiskyer feiskyer closed this as completed May 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 2, 2022
martingegenleitner added a commit to martingegenleitner/thales-dke-service-setup that referenced this issue Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests