-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS resolution of *.mybinder.org on turing #1468
Comments
Hmmm perhaps kubectl exec into the hub pod and try ’nslookup mybinder.org 8.8.8.8’ and then ’nslookup mybinder.org’ and if the latter doesnt work inspect the k8s clusters DNS server and potentially network policies that could block/allow access to it. The DNS server can sometimes be found as the service kube-dns.kubesystem even though coredns is now used etc to be the kubernetes clusters dns server. Kubectl get pods -n kube-system (look for kube-dns or coredns) |
From the hub pod |
There are two coredns pods running in The log of the coredns autoscaler pod:
I am not quite sure how to read the date/timestamp. Does I0616 mean "Information", june, 16th? In which case it would have been from yesterday. The coredns pods print the following two lines to their log over and over again:
|
You can test it with python:
|
Is there anything that I might need to raise with Turing IT to get this resolved? They set up the subdomain |
I don't think it's the turing DNS that's the problem. If you launch https://turing.mybinder.org/v2/gist/manics/2545224d3c19ab381bfc899fa34c6e44/master?filepath=checkdns.ipynb on the Turing cluster you'll see that queries to the default DNS resolver are blocked, so it's most likely K8s or Z2JH configuration issue. Could you try temporarily disabling the Z2JH network policies? |
@manics @sgibson91 oh, In the latest version of the z2jh helm chart, I've made the default network policies allow for DNS traffic! It has also enabled network policies by default - a breaking change. |
So from what I can tell, we inherit network policies directly from the chart, here. The turing config contains no references to network policies. So why is this only a problem on turing? Or @consideRatio should we remove the network policies from the chart completely? |
@sgibson91 I don't know enough, but to clarify the situation, you can do:
Relevant facts:
Suggestion:
|
For the network policy controller, I used Azure's plugin. Option was either this or
|
@sgibson91 I added a labeling step in the comment I wrote above, that way, you can trial network connectivity from a busybox container influenced by the same networking constraints as a user pod.
I wonder if the issue that makes turing run into something but not GKE etc, is that the DNS lookup end up being influenced in your deployment, but not in the other clusters, perhaps because a somewhat local DNS server is redirecting traffic that is routed without being associated with the pod itself? Hmm... In general though, I wonder if the other deployments and turing has explicitly allowed DNS traffic, because to my knowledge, it is forbidden by the network policies in old z2jh helm charts. |
You already know way more than I do 🙂 |
Output of this:
... is the following:
I don't really know what this means? |
I think it means your lookup resulted in an answer which was cached from a DNS server, but when that led to another lookup, that ended up failing. Im not sure, but the desired outcome is that you get a response without any hickup like that final message. |
I followed up with Turing IT on this and now I'm pretty sure it's something within the cluster. Question is, what? I wonder if it's related to Azure's network policy https://docs.microsoft.com/en-us/azure/aks/use-network-policies#create-an-aks-cluster-and-enable-network-policy |
This is still a problem, which means notebooks that retrieve data from an external web resource will fail (unfortunately most of the ones I use do, though today is the first time I've ended up on the Turing cluster). If you're allowed to give others access I'm happy to poke around your cluster when I have time. |
Allowed, yes. Can do it easily, not really. Last time I got an external collaborator added to a subscription, it involved getting them a Turing email address and signing the Turing's privacy policy. @betatim has access though. |
Is this cluster the only AKS based cluster in the mybinder federation? Perhaps this is relevant, an issue that covers two potentially separate topics related to AKS:
|
If @consideRatio's suggestions don't help we could perhaps have a screen-sharing session where you share your view of the Azure admin interface and a few of us think of random things to poke? 😄 |
Now I'm back from leave - I'm very happy to share screen with anyone interested! |
How about next week? Do you want to start a <doodle-or-something-similar> poll? |
Poll link here: https://terminplaner.dfn.de/hLK738XeCNeWjeYe |
Thanks folks, I've closed the poll. @consideRatio and @manics, I've sent you both a calendar invite for Tuesday next week :) |
Summary of discussion
Documentation on K8s network policies suggests that if multiple network policies apply to a pod the result should be that any ingress or egress permitted by any of the rules will apply ("Combining Multiple Policies" on https://www.tufin.com/blog/kubernetes-network-policies-for-security-people). This isn't seen on the other federation members, which suggests one of:
#1715 is a proposed quick-fix, but we should keep this issue open until we understand the underlying problem. |
Wonderful summary @manics! ❤️
I understand it as you are saying that that it will be any of the NetworkPolicy resources that will make the call, while I understand the referenced document to say that all NetworkPolicy resources will combine like a logical OR statement, which means that all NetworkPolicy resources are to be considered and if any one of them allows for the network traffic it is okay. The official k8s documentation summarized it well.
With this, I think we can conclude that there is a bug with the AKS NetworkPolicy enforcement somehow, because what we observed broke this premise. |
I'm not sure what bug it is, but it seems they do have bugs in their NetworkPolicy enforcement specific to AKS, see for example: Azure/AKS#1135. |
I looked around in the Azure/AKS repo and could not find an issue that was a clear representation of our issue, but I think it is safe to say that Azure "NPM" or "Network policy manager" is not ready for production after seeing several failures to comply with the Kubernetes intent for NetworkPolicy resources and that their NPM differs from calico. |
#1715 was merged. Testing with https://turing.mybinder.org/v2/gist/manics/2545224d3c19ab381bfc899fa34c6e44/master?filepath=checkdns.ipynb |
How did you make the second cell (of the notebook) fail? I ran it many times (like 30 times) and it seemed to work always :-/ |
I didn't do anything special, I just ran it several times! |
Seems to be working fine for me now.... Maybe it was a transient K8s networking problem that's resolved itself? |
Given that it's blocked traffic to kubernetes internal DNS, this should be fixed by jupyterhub/zero-to-jupyterhub-k8s#1670, specifically the explicit, unconditional addition of port 53 in allowed egress: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1670/files#diff-70983994b57e310d348c747400dd1ae5ea9f3e4efe63da0155789c3a6bd2a411R41-R46 So there are two possible fixes that might help:
|
I've never seen netshoot, that's cool! I ran my own test today to see if I could learn anything and accidentally stumbled upon a real mystery! I duplicated the two network policies, adding podSelector:
matchExpressions:
- key: component
operator: In
values:
- dind-test
- singleuser-server-test
matchLabels:
release: turing and podSelector:
matchLabels:
app: jupyterhub
component: singleuser-server-test
release: turing Creating a pod with labels: app: jupyterhub
component: singleuser-server-test
release: turing However, deleting the This means that the bug was that our binder policy was (and is) not being applied at all to user pods on AKS. The result being that the jupyterhub netpol is the only policy, and it denies DNS by default. Disabling the singleuser policy fixes the DNS issue because it means that there is no network policy that applies to single-user pods, returning to allowing all egress traffic. |
@minrk ah nice pinning of the issue! I think you can report this in https://github.com/Azure/AKS |
opened Azure/AKS#2006 we could get around this by using a loop and generating two policies using matchLabels. |
I span up a new (small) cluster with the kubenet plugin and calico policy controller and this seems to have resolved the issue. Now we know that it'll be worth my time to tear the Turing cluster down and redeploy :D |
I was trying to demo binderlyzer and was assigned to the turing cluster. The notebook fetches from https://archive.analytics.mybinder.org and it failed DNS resolution for mybinder.org. I tested a few mybinder.org subdomains, including the top-level mybinder.org, and it couldn't resolve any mybinder.org domain. Manually assigning the pod to gke or ovh (didn't try gesis) doesn't have the same problem, so it appears to be a cluster DNS issue with turing specifically.
Simple test:
will hang at
Resolving mybinder.org (mybinder.org)...
I'm not sure what's up there, maybe something with cert-manager?
The text was updated successfully, but these errors were encountered: