-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a KEP for DNS Autopath in pod API #967
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like the idea, but I want to be REWALLLLLY careful about future safety and node-local cache failure modes.
|
||
This approach minimizes the number of DNS queries at client side to atmost 2(A, AAAA). The searchpath expansion logic moves to the server side. The server(clusterDNS - CoreDNS by default) will require additional metadata in order to complete searchpath expansion. | ||
|
||
The metadata can be attached as an EDNS0 option. We can define a new EDNS0 option - SearchPaths, option code: 15 that is currently [unassigned.](https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-11) We would need to reserve this or use an option number from the experimental range - 65001 to 65534. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think starting with an experimental makes sense -- maybe CoreDNS can accept both if/when we get a real allocation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that would be fine
The value of this option will be a comma-separated string consisting of all the searchpaths which are to be appended to the main query and looked up. This option can be useful outside of Kubernetes as well. | ||
|
||
Instead of modifying all client pod images to insert an EDNS0 option in their requests, we will put this logic in [Nodelocal DNSCache](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md), which is a daemonset running a per-node DNS Cache. | ||
All pods using clusterDNS will send requests to the local instance running on the same node. This instance will look for the custom suffix that we defined above, in incoming query names. In case of a match, it will extract the namespace and cluster suffix values and construct the EDNS0 option with the searchpath strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defeats the HA discussion for node-local cache, doesn't it? You can't flip back to cluster DNS -- it won't know the magic.
But if we push the magic to the server, then it needs to know the expansion schema. Yuck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current HA plan proposed in #1005 won't work if the searchpath is set to something only node-local cache can understand. I added this to the "risks" as well.
Pushing it to server, specifically CoreDNS might not be that bad.. the current autopath plugin in CoreDNS supports getting searchpaths from another plugin. kubernetes plugin has an implementation to provide the list of paths today - https://github.com/coredns/coredns/blob/72a633ba0956f771342f1f2993696f37e1dee6db/plugin/kubernetes/autopath.go#L11
We could extend this to match the domain name suffix and figure it out. Maybe pass in a version too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the magic string unambiguously contains the client pod's namespace, putting the logic in the server is not an issue. It would just be a change to the current autopath feature. The magic string would have all of the context so there would be no need for the pod watch.
The only real value of putting this into an EDNS0 option is that: 1) the EDNS0-based search path is usable in other contexts; 2) the magic domain name is tightly coupled to this particular use case rather than being more broadly visible; and 3) we could expand the EDNS0 option to include other info and implement more complex server-side DNS policy (needs use cases...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with putting the logic in the server to parse out the required fields and generate the searchpaths. The only concern is that then this feature is supported only when CoreDNS is the DNS server. That might be ok to start with since it is the default clusterDNS server now.
Instead of modifying all client pod images to insert an EDNS0 option in their requests, we will put this logic in [Nodelocal DNSCache](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md), which is a daemonset running a per-node DNS Cache. | ||
All pods using clusterDNS will send requests to the local instance running on the same node. This instance will look for the custom suffix that we defined above, in incoming query names. In case of a match, it will extract the namespace and cluster suffix values and construct the EDNS0 option with the searchpath strings. | ||
|
||
An optimization to consider: We can use the EDNS0 option to include a version number and metadata and move the logic of determining searchpaths to the server side. The benefit of this approach is that the size overhead of the EDNS0 option is minimal. But, it requires the server to have logic to determine searchpaths in addition to performing the expanded queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want this to be widely adopted, this seems like a way to slow that down. I don't really want to convene a committee to govern search path schema..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the concern. We are already defining an EDNS0 option that is just a list of search paths. My understanding of this request would be to instead populate that EDNS0 option with general metadata, and then it's up to the server to do with that what it will. I guess that is a more complex structure for the options, but either way we are defining our own option.
This is what I meant above when I said "implement more complex server-side DNS policy". For example, if the metadata included the source Pod namespace, and of course we know the destination namespace, we can enforce that these match (DNS-level multi-tenancy).
With this approach, I would consider this a generalization of the search path problem. "Search Path" is just another piece of meta-data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each DNS vendor will need to interpret the metadata option to generate the searchpaths right? Explicitly listing it in the client makes it more straighforward for the server - Perform the queries with each of the paths listed in the option. That seems easier to standardize than "generate searchpaths given all this metadata"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. These things are not mutually exclusive in any case - we could have a ESN0 option for search path and another for general meta-data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
|
||
` dnsPolicy: clusterFirstWithAutopath` | ||
|
||
Using this mode will set the searchpath on the pod as `search.$NS.$SUFFIX.k`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. k is a one-letter suffix to identify the domain name to be a query that needs search expansion. Since there are [no single-letter TLD so far](http://data.iana.org/TLD/tlds-alpha-by-domain.txt), this suffix will help identify queries needing search expansion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking over the design here. Assume that we (one day) want to change the kubernetes DNS schema (we should). That ends up being as user-facing change. How does that impact this?
Let's be concrete: Today we have streamlined access to $S.$N.svc.$Z
, but this has problems. One proposal was to change to $S.s.$N.$Z
. This is strictly ambiguous if there happens to be a namespace s
and you want to look up namespace svc
, so we can't really just try both schemas : foo.s.svc
can mean both.
So I think (please help me be wrong) we need to specify at minimum a schema version in the magic string. Something like search.$NS.$SUFFIX.v1.k
so foo.s.svc
=> foo.s.svc.search.$NS.$SUFFIX.v1.k
, which we can parse as a request for service "foo" in namespace "s". V2 would be foo.s.svc.search.$NS.$SUFFIX.v2.k
, which we can (unambiguously!) parse as a request for service "foo" in namespace "svc".
Last thought -- The .k
TLD seems unlikely to happen, but it's worth thinking about a less terse solutions. .kubernetes-autopath-v1
also seems unlikely as a TLD. Is it better or worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With node-local cache doing the search expansion, it seemed unnecessary to pass in version if node-local cache version was tied to the schema version. This could still be the case if we moved the logic to CoreDNS and the version would be specified in the CoreFile at cluster bringup time. Like:
autopath @kubernetes < version>
The kubernetes plugin will have the logic to construct searchpath based on this version parameter. However, if this is a DNS Server that clusters using different schema versions could talk to, we need the version to be part of the request, so yes, part of the searchpath.
"kubernetes-autopath-v1" is a little long, we want to pick a label that won't result in domain names becoming longer that 255 characters which is the DNSName limit. It is only 10 chars more than "cluster.local". Or we could do "k8s-v1".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubernetes plugin will have the logic to construct searchpath based on this version parameter. However, if this is a DNS Server that clusters using different schema versions could talk to, we need the version to be part of the request, so yes, part of the searchpath.
Even apart from this KEP, if clients are talking to the same DNS server expecting different DNS schemas, then that server will already need to know which version to serve up. So, I don't think we need to put it in the magic string - any pod using a different DNS policy would have this same problem, so we'll need to solve that if and when we change the schema version.
For the label, why not something like "ap.k8s.io" which is relatively short and we already control k8s.io so we can reserve that (and yes, it's three labels not one but I don't think that's an issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubernetes plugin will have the logic to construct searchpath based on this version parameter. However, if this is a DNS Server that clusters using different schema versions could talk to, we need the version to be part of the request, so yes, part of the searchpath.
Even apart from this KEP, if clients are talking to the same DNS server expecting different DNS schemas, then that server will already need to know which version to serve up. So, I don't think we need to put it in the magic string - any pod using a different DNS policy would have this same problem, so we'll need to solve that if and when we change the schema version.
outside of this KEP and without Autopath, the DNS Server does not need to know about versions. With the current autopath plugin implementation and CoreDNS, if the searchpath schema changed, autopath/kubernetes plugin will need to serve the right version. This really becomes an issue only when multiple k8s clusters with different versions might talk to the same coreDNS instance. I think narrowing the scope to make this Kubernetes + CoreDNS specific might simplify some of the decisions here.
Not sure I understand how different DNSPolicy will have a problem. Apart from ClusterFirst*, the other modes do not use the kubelet-generated searchpaths. They are using the node's searchpath or something custom. Why would anything change there when the searchpath schema is changed?
For the label, why not something like "ap.k8s.io" which is relatively short and we already control k8s.io so we can reserve that (and yes, it's three labels not one but I don't think that's an issue).
If we don't need version in the label, then ap.k8s.io is ok. Looks like there is a limit on the name length and label length, not number of labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"schema version" here is not specific to this kep. For example, if you request TXT dns-version.cluster.local it will return a version number to you (well, CoreDNS does, i don't know that it was ever added to kube-dns, but it is in the spec).
There is an old issue (I can't find it though) discussing changing from names like service.namespace.svc.cluster.local
to service.svc.namespace.cluster.local
. That is what we mean by the schema version in this context.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten will send some updates here this week. |
to get the CI to pass you need to get rid of spaces in the file name, add these tags around the table of contents:
and then run the hack/update-toc.sh command |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@prameshj: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while since I thought this one through. I apologize for getting lost in my mailbox.
I think this is a clever solution to the problem we have made for ourselves, but I fear it may be too clever.
First: is this really SUCH a big problem with the node-local cache in play? I would assume that gives us a lot of opportunity to look for solutions to the fan-out, and maybe moves this into a "fix costs more than the problem" category.
I know this KEP says "changing the schema" is out of scope, but maybe that's a more complete solution that has the roughly same level of impact and risk? Or maybe we don't even need to change the schema, just reduce the fan-out and ndots.
E.g. what if we made it easy for end-users and/or cluster admin to opt-out of the fancy search paths and instead just get $NS.svc.$SUFFIX
and ndots: 1
. That would be a breaking change in that some queries would no longer work, but it would eliminate most of the problems, wouldn't it?
Then we could discuss whether some sort of variable expansion for DNSConfig in the API is useful to let people opt back in to parts of the old behavior.
|
||
These searchpaths are included in pods' /etc/resolv.conf by kubelet and are enforced by setting ndots to 5. This means any hostname lookups with fewer than 5 dots will be expanded using all the search paths listed. | ||
|
||
When pod issues a query to lookup hostname "service123", it is expanded to 6 queries - one for the original hostname and one with each of the searchpaths appended. Some resolvers issue both A and AAAA queries, so this can be a total of 12 or more queries for every single DNS lookup. When these queries are issued in parallel, they end up at the node with the same source tuple and need to be DNAT'ed increasing the chance of a [netfilter race condition](https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought one of the libcs would do them in parallel, but forget which, regardless I think it's not illegal to do so, with reasonable respect for ordering when you get multiple responses
|
||
When a clientPod in the "default" namespace looks up a service - "mytestsvc”: | ||
|
||
1) The query will be sent out as `mytestsvc.search.default.cluster.local.ap.k8s.io`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With autopath, a bad-acting client can cause more total load to be generated with less CPU because "the system" is doing more work that was previously done by the client,
|
||
1) The query will be sent out as `mytestsvc.search.default.cluster.local.ap.k8s.io`. | ||
|
||
2) clusterDNS server(coreDNS by default) receives this and strips off the delimiter - "ap.k8s.io" and identifies "search" as start of the custom searchpath. The namespace and cluster suffix can be obtained from rest of the string. The DNS server can now construct the full service name for this query as - `mytestsvc.default.svc.cluster.local`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me why. The "pod" subdom was never part of any standard search, so why do we need to accomodate it?
As observed from the list of searchpaths in the current DNS schema, the searchpaths used for each pod is the same, except for the namespace. If this information can be included in a single searchpath, some entity that is aware of the DNS schema can expand this to a list of searchpaths. This entity could be the ClusterDNS server(CoreDNS), NodeLocal DNSCache or a sidecar container running in the client pod. | ||
|
||
The new searchpath is of the form: | ||
`search.$NS.$SUFFIX.ap.k8s.io`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. ap.k8s.io is the delimiter to identify if a query needs search expansion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why SUFFIX is needed - I guess it is to allow servers to handle multiple clusters?
|
||
This approach minimizes the number of DNS queries at client side to atmost 2(A, AAAA). The searchpath expansion logic moves to the server side. | ||
|
||
However, the above approach requires the clusterDNS server to know about the k8s DNS schema as well as the syntax of this custom searchpath. It does know about k8s DNS schema anyway, but knowledge of the custom searchpath is an additional requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we would certainly have to version this, so that it becomes at least search.$VERSION.$NS.$SUFFIX.ap.k8s.io
- this allows us to eventually change schema if we need to. In fact we might choose to distinguish "DNS schema version" from "well-known search-expansion set".
E.g. even in the v1 DNS schema we all know and love, It might be valuable to change the search-expansion to drop the #2 search (dodging the "namespace com") ambiguity.
`search.$NS.$SUFFIX.ap.k8s.io`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. ap.k8s.io is the delimiter to identify if a query needs search expansion. | ||
|
||
This searchpath can be set by using `dnsPolicy:None` to start with, as described [here](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-dns-config). | ||
Once this feature graduates to GA, the default searchpath for `ClusterFirst` and `ClusterFirstWithHostNet` will include this new searchpath instead of the 3 different k8s specific searchpaths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how we can do that - we need to know that the cluster's DNS resolver supports it (either query parsing or or EDNS0) so it seems like it has to be opt-in basically forever. It might be possible to change the default behavior on a cluster-by-cluster basis but not globally.
As observed from the list of searchpaths in the current DNS schema, the searchpaths used for each pod is the same, except for the namespace. If this information can be included in a single searchpath, some entity that is aware of the DNS schema can expand this to a list of searchpaths. This entity could be the ClusterDNS server(CoreDNS), NodeLocal DNSCache or a sidecar container running in the client pod. | ||
|
||
The new searchpath is of the form: | ||
`search.$NS.$SUFFIX.ap.k8s.io`, where $NS is the namespace of the pod and $SUFFIX is the cluster suffix. ap.k8s.io is the delimiter to identify if a query needs search expansion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem: any DNS processor who ISN'T aware of this magic might route the query to the real k8s.io zone servers. That's an info leak, at best and more likely it's a CVE.
I think you would have to preserve the suffix, so maybe $NS.autopath.$SUFFIX
instead? That should route to the cluster zone and not leak out into the world. "autopath" could be "ap" or "search" or "srch" or so something.
|
||
The clusterDNS service needs to support this new EDNS0 option and lookup multiple query names by expanding a single incoming query. This was tried on a test setup by [modifying the autopath plugin](https://github.com/coredns/coredns/compare/master...prameshj:auto) in CoreDNS to extract the searchpath from an EDNS0 option, for a proof of concept. | ||
|
||
If upstream ClusterDNS uses something other than CoreDNS, support for this EDNS0 option should be added for autopath to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/should/must
|
||
### Risks and Mitigations | ||
|
||
1) If NodeLocal DNSCache is responsible for adding the EDNS0 option, DNS resolution can break if NodeLocal DNSCache is down or if the pods configured with the special searchpath point to the kube-dns service directly to resolve query names. This is because without the EDNS0 option, the custom searchpath is not resolvable by kube-dns/CoreDNS. Running 2 DNSCache instances would be necessary to keep searchpath expansion working during upgrades. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a significant limitation, IMO, and could make it infeasible to rely on the proxy.
@prameshj just out of curiosity, my experience is that DNS timeouts, due to packet drops per example, has much more impact in latency than multiple "unseccesful" queries. I think that if that works, we can and an option that enables TCP, or make it default, but using environment variables instead of modifying the resolv.conf, to avoid incompatibilities
|
Would this KEP be a reasonable solution to kubernetes/kubernetes#90571? We have been working on finding a means of allowing searchpaths longer than the historical glibc/kubelet limitations (256 chars / 6 names) in OpenShift via CoreDNS+autopath. |
Yes, with this enhancement, the searchpath for k8s drops to 1, leaving more room for other custom searchpaths to be specified. The limit does not get lifted, but there is option to specify more custom values. |
I have to agree this has lost a lot of urgency given Node Local DNS. Autopath was conceived well before Node Local existed and has certainly helped some users, but as written it is only narrowly useful. We could integrate with Node Local in this way, but I think Node Local solves enough of this problem that it's probably unnecessary. My preferred solution to this problem is to solve it more generally in DNS, with an RFC that defines a new option that clients can use to attach the search path, and with eventual updates in glibc and other resolvers (i.e., a new option in resolv.conf). That's at least a several-years-long approach, with an uncertain future. |
It would be great to solve this generally in DNS and expose a resolv.conf option. I agree that makes it a much larger scope and longer project. NodeLocalDNS has solved a bunch of issues with DNS latency, but autopath will still be useful. It will minimize resource consumption on the cache pods, avoid the upstream CoreDNS from getting overwhelmed with queries(we sometimes see nodelocaldns time out queries since the upstream CoreDNS took > 2s to return). I have not seen issues about this recently, so I am planning to keep the KEP open, without a milestone for now. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/cc |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
In order to graduate to beta, we will need: | ||
|
||
Conformance tests exercising this new searchpath. | ||
Verify performance/scalability via automated testing and measure the improvement over the existing `clusterFirst` mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some user cases, we want to enable autopath
in some of our clusters, however, we still want a benchmark to compare with enable autopath
and disable autopath
to make sure that there is no performance decrease after we enable it.
The benchmark is appreciated when promoting to beta or GA.
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
@prameshj: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.