Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process packets only on the POSTROUTING hook (before SNAT) #54

Merged
merged 7 commits into from
Jul 19, 2024

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Jul 18, 2024

We only need to process packets that are destined to Pods, never destined to the local host.
Based on the netfilter diagram https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks this can be on FORWARD and POSTROUTING.
However, FORWARD does not work with IPVS, as it seems the IPVS kernel logic follows a different path http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.filter_rules.html
Using OUTPUT forces us to discriminate the traffic that is destined to localhost to avoid being impacted by network policies, usually adding a rule to accept packets to lo.
Using POSTROUTING before SNAT happens (to avoid to loose the original source IP) seems the right place

Fixes: #46

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 18, 2024
Change-Id: I774e00a799b9f790924468ece37ced419618c6a4
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 19, 2024
@aojea aojea changed the title Add an output-chain, needed for proxy-mode=ipvs Process packets only on the POSTROUTING hook (before SNAT) Jul 19, 2024
@aojea
Copy link
Contributor Author

aojea commented Jul 19, 2024

hmm, ipv6 flakes now 🤔

@aojea
Copy link
Contributor Author

aojea commented Jul 19, 2024

2 / 2 green with this PR https://github.com/kubernetes-sigs/kube-network-policies/actions

Accepting always ICMPv6 ND seems to make it work reliable

// IPv6 needs ICMP Neighbor Discovery to work
tx.Add(&knftables.Rule{
Chain: chainName,
Rule: knftables.Concat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's more readable to write all-constant strings as one string (#49 (comment)):

		tx.Add(&knftables.Rule{
			Chain: chainName,
			Rule: "icmpv6 type { nd-neighbor-solicit, nd-neighbor-advert } accept",
		})

(same for other similar cases)

@uablrek
Copy link
Contributor

uablrek commented Jul 19, 2024

Beside the nit this lgtm. For documentation, this is how it looks now:

# nft list table inet kube-network-policies
table inet kube-network-policies {
        comment "rules for kubernetes NetworkPolicy"
        set podips-v4 {
                type ipv4_addr
                comment "Local V4 Pod IPs with Network Policies"
        }

        set podips-v6 {
                type ipv6_addr
                comment "Local V6 Pod IPs with Network Policies"
        }

        chain postrouting {
                type filter hook postrouting priority srcnat - 5; policy accept;
                icmpv6 type { nd-neighbor-solicit, nd-neighbor-advert } accept
                ct state established,related accept
                ip saddr @podips-v4 queue to 100 comment "process IPv4 traffic with network policy enforcement"
                ip daddr @podips-v4 queue to 100 comment "process IPv4 traffic with network policy enforcement"
                ip6 saddr @podips-v6 queue to 100 comment "process IPv6 traffic with network policy enforcement"
                ip6 daddr @podips-v6 queue to 100 comment "process IPv6 traffic with network policy enforcement"
        }
}

I tested e2e in dual-stack with IPv6 as "base family" on K8s v1.30.3.

@uablrek
Copy link
Contributor

uablrek commented Jul 19, 2024

An unrelated comment: when building a bin/ dir is created. A .gitignore with "/bin" would be appropriate IMHO.

@aojea
Copy link
Contributor Author

aojea commented Jul 19, 2024

Thanks Lars, do you mind following up on those nits?

I'd like to get a release with ipvs working and the nftables initialization fix

@aojea aojea merged commit 16bd899 into kubernetes-sigs:main Jul 19, 2024
10 of 11 checks passed
@vaskozl
Copy link
Contributor

vaskozl commented Jul 22, 2024

This appears to break my setup with kube-proxy in nftables mode and flannel with EnableNFTables (which enables masquerading in POSTROUTING).

I haven't had time to dig in to the issue yet but figured I'd leave a comment in case this wasn't tested with kube-proxy in nftables mode.

@uablrek
Copy link
Contributor

uablrek commented Jul 22, 2024

I can confirm, e2e test with FOCUS=Netpol fails with proxy-mode=nftables on main. I will check more tomorrow. I hope it is enough to change the chain priority (based on the fact that proxy-mode=iptables works with iptables in "nft-mode").

My mistake, sorry. I ran on k8s v1.30.3 without the nftables feature-gate. I ran again on v1.31.0-beta.0 and the e2e tests pass with proxy-mode=nftables

@aojea
Copy link
Contributor Author

aojea commented Jul 22, 2024

Once we have support in kind we can have CI in Kubernetes master kubernetes-sigs/kind#3612
, please confirm if is about the priority, the nftables proxy code only uses the postrouting hook for masquerading AFAIK
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/nftables/proxier.go

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

@vaskozl Can you please write an issue? Not too elaborate, basically what's in #54 (comment).

Hold that, please see the updated #54 (comment)

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

This is how it looks with proxy-mode=nftables (narrowed to postrouting chains):

table ip kube-proxy {
      chain nat-postrouting {
              type nat hook postrouting priority srcnat; policy accept;
              jump masquerading
      }
      chain masquerading {
              meta mark & 0x00004000 == 0x00000000 return
              meta mark set meta mark ^ 0x00004000
              masquerade fully-random
      }
}
# (same for ip6)
table inet kube-network-policies {
      chain postrouting {
              type filter hook postrouting priority srcnat - 5; policy accept;
              icmpv6 type { nd-neighbor-solicit, nd-neighbor-advert } accept
              ct state established,related accept
              ip saddr @podips-v4 queue to 100 comment "process IPv4 traffic with network policy enforcement"
              ip daddr @podips-v4 queue to 100 comment "process IPv4 traffic with network policy enforcement"
              ip6 saddr @podips-v6 queue to 100 comment "process IPv6 traffic with network policy enforcement"
              ip6 daddr @podips-v6 queue to 100 comment "process IPv6 traffic with network policy enforcement"
      }
}

The netpol chain already has a higher prio srcnat - 5, which was what I intended to check, and this looks ok to me.

@vaskozl What problem do you see? The netpol e2e is only cluster internal (I think). Do you have problems with external traffic?

@vaskozl
Copy link
Contributor

vaskozl commented Jul 23, 2024

Sorry I wasn't sure there was an issue and didn't have details hence I didn't want to create a potentially false issue if nftables tested successfully:

In addition to the two tables you listed, I have the following postrouting priority srcnat; from flannel:

table ip flannel-ipv4 {
	comment "rules for flannel-ipv4"
	chain postrtg {
		comment "chain to manage traffic masquerading by flannel"
		type nat hook postrouting priority srcnat; policy accept;
		meta mark 0x00004000 return
		ip saddr 10.244.15.0/24 ip daddr 10.244.0.0/16 return
		ip saddr 10.244.0.0/16 ip daddr 10.244.15.0/24 return
		ip saddr != 10.244.15.0/24 ip daddr 10.244.0.0/16 return
		ip saddr 10.244.0.0/16 ip daddr != 224.0.0.0/4 masquerade fully-random
		ip saddr != 10.244.0.0/16 ip daddr 10.244.0.0/16 masquerade fully-random
	}
}

I tried upgrading twice and within seconds I see failed liveness probes across all pods and failed image pulls. External connections from within the cluster appear to timeout too. Downgrading required full node restarts.

It is a very weird issue indeed, given that for some reason I required a restart (I did check that the forward rules were reapplied on revert). I can look to debug more in the weekend and will create a propper issue with details if I find anything wrong.

@aojea
Copy link
Contributor Author

aojea commented Jul 23, 2024

rry I wasn't sure there was an issue and didn't have details hence I didn't want to create a potentially false issue if nftables tested successfully:

no worry about that, feel free to open issues like that to discuss, I'm not worried about issues that turn out to be environment problem for people like you that spend the time to discuss and to troubleshoot, this feedback is very welcome.

I'm going to setup a job to test nftables so we can rely on automated testing to validate nftables functionality

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

failed image pulls

This is interresting. It might be that the netpol postrouting chain interfere in some way with external traffic. It make sense to me since masquerading should affect pod-IP -> non-pod-IP traffic only.

I tested with the versions below, and Netpol e2e passes.

  11:35:48 Server Version: v1.30.3
  11:35:48 CNI-plugin; flannel:v0.24.2
  11:35:48 Proxy-mode: "nftables"
  11:35:48 crio version 1.29.2
  11:35:48 BaseFamily=IPv6

BTW "BaseFamily=IPv6" meaning that the default family for services is IPv6 in a dual-stack cluster:

# kubectl get svc kubernetes
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   fd00:4000::1   <none>        443/TCP   18m

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

🤔 but then external traffic problems should be a v0.5.0 issue, rather than an nftables issue. I'll make some tests with outgoing traffic from pods. I have no image load problems though.

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

@vaskozl What kind of netpol rules do you use? Most logical to me is that egress blocking rules would be causing issues for external traffic, but I may be wrong. In my own tests I use ingress blocking rules, so I want to extend with relevant rules.

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

I added TCP liveness/readiness probes, and if I apply my "deny ingress" rule, the probes fail. This make sense to me, a deny ingress should block the probes. @aojea Do you know if there is some "don't do that" for this situation documented somewhere. Please, don't search for it, I can do that myself, but if you happen to know...

@vaskozl
Copy link
Contributor

vaskozl commented Jul 23, 2024

I don't have any egress rules but apply traditional (not banp/anp) ingress netpols to most everything in the cluster. Afaik Host to local workload traffic should not be blocked by network policies and isn't by all CNIs that support netpols.

I use 1.30 with flannel in 0.25.3 and ipv4 single stack. I'll make an issue after I debug asap.

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

I backed down to v0.4.0, and TCP probes works with a "deny-all-ingress" rule applied.

@aojea I think this is the root of the problem

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

Afaik Host to local workload traffic should not be blocked by network policies and isn't by all CNIs that support netpols.

You are right. v0.4.0, nor Calico or Cilium netpol does that

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

I can't explain the failing image loads though.

@vaskozl
Copy link
Contributor

vaskozl commented Jul 23, 2024

That may have been a knock on of the whole cluster liveness probes causing restarts and some rolling restarts I issued that slowed down my kublet and as such and indirect observation on my relatively weak setup. Otherwise maybe hostNetwork pods that I also have rules on? I'll try to have more accurate information about the failure soon.

@aojea
Copy link
Contributor Author

aojea commented Jul 23, 2024

I added TCP liveness/readiness probes, and if I apply my "deny ingress" rule, the probes fail. This make sense to me, a deny ingress should block the probes. @aojea Do you know if there is some "don't do that" for this situation documented somewhere. Please, don't search for it, I can do that myself, but if you happen to know...

@danwinship is the source of truth for network policies 😄 and has also revisited the problem of probes recently, it also mentions this in the KEP kubernetes/enhancements#4558

Changing the definition of NetworkPolicy v1 to remove the "kubelet
  can access all local pods" hole, even in the case where the cluster
  is configured in a way to forbid the use of probes that require the
  hole. Even if _kubelet_ no longer needs to access pods, some users
  likely depend on that rule to allow access to some pods from other
  host-level processes, and there is no other way in NetworkPolicy v1
  to write such a rule if we remove the built-in one

I think that traffic originated from the node should not be subject for network policies.

I checked with nftables trace and all trafic coming from the node has the iif lo independently that the destination IP belongs to another interface

trace id 2da4b9ff ip filter INPUT packet: iif "lo" @ll,0,112 0x800 ip saddr 192.168.8.2 ip daddr 192.168.8.2 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 30324 ip length 60 tcp sport 41872 tcp dport 10250 tcp flags == syn tcp window 33280

@uablrek
Copy link
Contributor

uablrek commented Jul 23, 2024

For the record; this is not a proxy-mode=nftables problem

@aojea
Copy link
Contributor Author

aojea commented Jul 23, 2024

yeah, opened a new issue so we have this well documented #65 , please continue the conversation there, also submitted a PR and tagged you both for feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Doesn't work with proxy-mode=ipvs
4 participants