Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private AKS with NGINX ingress controller using Azure Internal Load Balancer. Can't reach service via Load Balancer outside subnet/pod. #2907

Closed
KIRY4 opened this issue Apr 25, 2022 · 15 comments
Labels
docs resolution/answer-provided Provided answer to issue, question or feedback.

Comments

@KIRY4
Copy link

KIRY4 commented Apr 25, 2022

What happened:
Hello everyone!! I have following issue/probably misconfiguration but I can't figure out what exactly I'm doing wrong.

From following VM's: vm-ubuntu-aks-mgmt-01 - 10.3.0.4, vm-ilb-01 - 10.3.1.6 I'm getting following.

curl 10.3.1.100 -k -v
Trying 10.3.1.100:80...
TCP_NODELAY set

curl http://helloworld.spg.internal -k -v
Trying 10.3.1.100:80...
TCP_NODELAY set

nslookup helloworld.spg.internal
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
Name: helloworld.spg.internal
Address: 10.3.1.100

But if I jump INSIDE pod for example: kubectl exec -it nginx-ingress-ingress-nginx-controller-7475cb6b9f-x7nf2 -n ingress-basic sh. And then curl from inside pod:

I'm getting SUCESS. Status code 200 OK and html page.

What you expected to happen:
curl http://helloworld.spg.internal - success (status code: 200 OK) from everywhere in vnet I mean from all VM's mentioned below (vm-ubuntu-aks-mgmt-kmazurov-01 - 10.3.0.4, vm-ilb-01 - 10.3.1.6)

How to reproduce it (as minimally and precisely as possible):
For testing/learning purposes I deployed private AKS in pre-created VNET.

Here is VNET configuration (no NSG configured since that playground for learning purposes):
vnet-cloud-01: 10.3.0.0/22

Subnets:
snet-cloud-mgmt-01: 10.3.0.0/24 - here I deployed mgmt VM (vm-ubuntu-aks-mgmt-01 - 10.3.0.4) with kubectl, helm, az cli to access cluster;
snet-cloud-priv-aks-01: 10.3.1.0/24 - here I deployed private AKS + one VM just for testing purposes (vm-ilb-01 - 10.3.1.6);
GatewaySubnet: 10.3.2.0/24 - I'm using it for P2S VPN connection to vnet.

Following this article: https://docs.microsoft.com/en-us/azure/aks/ingress-internal-ip I deployed Nginx Ingress controller with internal Load Balancer IP 10.3.1.100. Service created successfully and IP was assigned correctly.

Also I attached private DNS zone (spg.internal) and attach it to vnet-cloud-01 also add following A record:
Name | Type | TTL | Value |
helloworld | A | 10 | 10.3.1.100|

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.22.6 - cluster, v1.23.6 - kubectl.
  • Size of cluster (how many worker nodes are in the cluster?). 1 worker node.
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
    https://docs.microsoft.com/en-us/azure/aks/ingress-internal-ip - workload from this article.
  • Others:
@ghost ghost added the triage label Apr 25, 2022
@ghost
Copy link

ghost commented Apr 25, 2022

Hi KIRY4, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@KIRY4 KIRY4 changed the title Private AKS with NGINX ingress controlled using Azure Internal Load Balancer. Can't reach service via Load Balancer outside subnet. Private AKS with NGINX ingress controller using Azure Internal Load Balancer. Can't reach service via Load Balancer outside subnet/pod. Apr 25, 2022
@PAKalucki
Copy link

@KIRY4 Hey, I'm facing almost identical issue and after 2 days of troubleshooting I managed to establish that internal load balancer is unhealthy. It would seems that healthprobes are misconfigured, but I'm still unclear how to fix that part. Any and all input from AKS team would be appriciated on this.
image

@KIRY4
Copy link
Author

KIRY4 commented Apr 25, 2022

Hi @PAKalucki!! How did you get this diagram? I want to generate same diagram from my side...

@PAKalucki
Copy link

@KIRY4 In Azure Portal go to Load Balancing and find your load balancer created by AKS. In my case it was called "kubernetes-internal". Then go to insights (you might need to enable insights in aks).
image

@KIRY4
Copy link
Author

KIRY4 commented Apr 25, 2022

@PAKalucki thank you! I have same issue.
image

@PAKalucki
Copy link

PAKalucki commented Apr 25, 2022

@KIRY4 I have found workaround in the meantime. Go to Load balancer > Health probes. If you have probes for TCP protocol only make sure they point to proper ports (you can find them in AKS, nginx ingress service). If you have HTTP/HTTPS protocol probes make sure they use proper ports from AKS and that path is /healthz. The rules I had in there created by AKS were invalid.
It should looks similiar to mine (your ports might be different), and after few seconds of applying this change all traffic started working:
image

In case anyone from AKS team stumbles onto this, it's only a workaround. Please propose solution on how to solve this misconfiguration on AKS or Helm level so we don't have to change this manualy every time.

@aristosvo
Copy link

aristosvo commented Apr 25, 2022

It seems the same or similar as this solved issue on the kubernetes/ingress-nginx repo itself.

Usage of the flag --set controller.service.externalTrafficPolicy=Local should probably fix it.

I'm not 100% sure if this is just a nicer workaround or the fix, but I'll leave that discussion up to the AKS team.

@KIRY4
Copy link
Author

KIRY4 commented Apr 25, 2022

@aristosvo Thank you --set controller.service.externalTrafficPolicy=Local solved an issue for me. I should note that simple reinstall with this flag was not enough for me. I made following:

  1. Uninstall helm chart
  2. Delete existing internal load balancer (without this step it wasn't work for me);
  3. Install chart again with flag mentioned above.
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace ingress-basic --create-namespace \
    --set controller.replicaCount=1 \
    --set controller.service.externalTrafficPolicy=Local \
    -f internal-ingress.yaml

@PAKalucki
Copy link

In my case --set controller.service.externalTrafficPolicy=Local flag is only partial fix. The load balancer not only has missing /healthz path in Health probes but also ports are completly wrong. This might be caused by the fact that I'm using both HTTP and HTTPS but hard to tell.

@jamesstanleystewart
Copy link

We hit this same issue today, something seems to have changed in the configuration of the health probes for new AKS installations between February and now. We have several AKS instances (most recent install in Feb) and found they all had TCP health probes by default, only a recently added one had HTTP ones. Service health was showing "degraded" when the new AKS was first installed. When we changed to TCP to match our other installations traffic flowed to our ingresses as it had with previous installs and the service health event was resolved.

@andbaraks
Copy link

andbaraks commented Apr 26, 2022

I was able to reproduce the behavior by using the following yaml file for testing and two AKS cluster, one with Kubernetes version 1.21, the other with 1.22.

apiVersion: v1
kind: Service
metadata:
  name: nginx-ingress-ingress-nginx-controller
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: nginx-ingress
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/version: 1.1.0
    helm.sh/chart: ingress-nginx-4.0.12
    helm.toolkit.fluxcd.io/name: nginx-ingress
    helm.toolkit.fluxcd.io/namespace: pluops-shared-nginx-ingress-main
  annotations:
    meta.helm.sh/release-name: nginx-ingress
    meta.helm.sh/release-namespace: pluops-shared-nginx-ingress-main
  finalizers:
    - service.kubernetes.io/load-balancer-cleanup
spec:
  ports:
    - name: https
      protocol: TCP
      appProtocol: https
      port: 443
      targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: nginx-ingress
    app.kubernetes.io/name: ingress-nginx
  type: LoadBalancer
  sessionAffinity: None
  externalTrafficPolicy: Cluster
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  allocateLoadBalancerNodePorts: true
  internalTrafficPolicy: Cluster

Then I removed appProtocol: https from the yaml and created the same service with another name and I concluded it creates it with Protocol TCP:

1.21 AKS cluster
image

1.22 AKS cluster
image

About appProtocol / health probe:
https://kubernetes-sigs.github.io/cloud-provider-azure/topics/loadbalancer/#custom-load-balancer-health-probe
https://kubernetes.io/docs/concepts/services-networking/service/#application-protocol

I think this is the expected behavior (when you have appProtocol: https to use for health probe https protocol) and 1.22 just adopts this. The reason this happens is because 1.22 has by default Cloud Controller Manager (https://docs.microsoft.com/en-us/azure/aks/out-of-tree).

To confirm that, I enabled CCM (there is an issue in the documentation, correct command is “az aks update -n aks -g myResourceGroup --aks-custom-headers EnableCloudControllerManager=True”) on my 1.21 AKS cluster and applied the service yaml which includes appProtocol: https (just with a different name) and confirmed the above.
image

This confirms what happens is the expected behavior and in Kubernetes version 1.22 "appProtocol" is reflected in the health probe of the Load Balancer.

@ivanthelad
Copy link

@andbaraks @KIRY4 see #2903 (comment)

@phealy
Copy link
Contributor

phealy commented Apr 26, 2022

A documentation PR has been merged to add --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz to the helm commands, which will tell Cloud Provider how to set the health probe properly. Docs will be updated in the next build, later today. Please follow #2903 for updates on the general issue.

@phealy phealy closed this as completed Apr 26, 2022
@phealy phealy reopened this Apr 26, 2022
@phealy phealy added docs resolution/answer-provided Provided answer to issue, question or feedback. labels Apr 26, 2022
@ghost ghost removed the triage label Apr 26, 2022
@ghost
Copy link

ghost commented Apr 28, 2022

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

@feiskyer
Copy link
Member

feiskyer commented May 3, 2022

Sorry to see the appProtocol support inside cloud provider has broken ingress-nginx for AKS clusters >=1.22. The issue was caused by two reasons:

    1. the new version of nginx ingress controller added appProtocol and its probe path has to be /healthz;
    1. the new version of cloud-controller-manager added HTTP probing with default path / for appProtocol=http services.

AKS is rolling out a fix to keep backward compatibility (ETA is 48 hours from now) with fix cloud-provider-azure#1626. After this fix, the default probe protocol would only be changed to appProtocol for clusters >=1.24 (refer docs for more details).

Before the rollout of this fix, please upgrade the existing ingress-nginx with --set controller.service.externalTrafficPolicy=Local for faster mitigation.

And for newly created helm releases, you could also set --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz to ensure /healthz probe path be used with HTTP/HTTPS protocol (AKS docs has been updated with this annotation as well, e.g. here).

@ghost ghost locked as resolved and limited conversation to collaborators Jun 2, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs resolution/answer-provided Provided answer to issue, question or feedback.
Projects
None yet
Development

No branches or pull requests

8 participants