Antrea wildcard fqdn netpolicy not working #3680

jsalatiel · 2022-04-21T23:38:31Z

Describe the bug
According to the netpol documentation, one could use example like the following to match fqdn:

apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: acnp-fqdn
spec:
  priority: 1
  appliedTo:
  - podSelector: {}
  egress:
  - action: Allow
    to:
      - fqdn: "*.google.com"

I also have a cluster netpol default-deny priority 999 on Baseline. So by default all traffic should be denied except traffic to google. The problem is if that I try to curl www.google.com from the container it is still being denied by the default-deny baseline rule. If I change the fqdn policy to allow "www.google.com" instead of "*.google.com" it does work, so for some reason the wildcard fqdn is not working.

To Reproduce

Create a FQDN wildcard policy that match all pods and higher priority
Create a default-deny policy with the lowest priority that also match all pods
Try to access some URL from the FQDN
Expected
It should work
'
Actual behavior
The wildcard FQDN is not matching, only a expressed FQDN works.

Versions:

Antrea version: 1.6.0
Kubernetes version (use kubectl version). 1.22.8
Container runtime: crio-o

Additional info: There are some other rules that allows the pods resolve DNS for example, but I removed those from the context because it is not related to the context of the problem.

The text was updated successfully, but these errors were encountered:

antoninbas · 2022-04-25T23:52:19Z

@Dyanngg could you help triage this issue? IIRC, wildcard rules rely on DNS response interception (instead of proactive querying), so there could be an issue with that code?

Dyanngg · 2022-04-26T00:39:19Z

@jsalatiel Actually could you please share the rules you use to allow Pods to resolve DNS in the issue? Those might very well be relevant. Also trying to understand that by the problem is if that I try to curl [www.google.com](http://www.google.com/) from the container it is still being denied by the default-deny baseline rule. Was this verified by policy rule stats etc., or is it just that we're seeing requests to www.google.com being denied?
In the meantime, I will experiment with this setup in my own env and try to reproduce this

Dyanngg · 2022-04-26T04:46:02Z

Update: I have tried this on my own test setup and wasn't able to reproduce. @jsalatiel one thing I've noticed however is that you used

 appliedTo:
  - podSelector: {}

in the fqdn policy. Did you put the same appliedTo for the baseline deny-all policy? The reason I'm asking is, we need to make sure that the two-way communication between the client Pod and dns Pod is not dropped by the baseline deny rule.
In other words, there need to be explicit ACNP rules to ensure 1. client Pod -> core-dns Pods is allowed, and 2. core-dns Pods -> client Pod is allowed for the fqdn to work. OR, define baseline deny-all as the following:

apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: cnp-baseline-deny
spec:
  tier: baseline
  priority: 1
  appliedTo:
  - namespaceSelector:          # Selects all non-system Namespaces in the cluster
     matchExpressions:
     - {key:  kubernetes.io/metadata.name, operator: NotIn, values: [kube-system]}
  ingress:
  - action: Drop
  egress:
  - action: Drop

jsalatiel · 2022-04-26T08:52:52Z

Hi, @Dyanngg . These are my other policies:

apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: 101-kube-system
spec:
    priority: 1
    tier: Emergency
    ingress:
    - name: kubesystem-ingress-intra-namespace
      action: Allow
      enableLogging: false
      appliedTo:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system

    egress:
    - name: kubesystem-egress-intra-namespace
      action: Allow
      enableLogging: false
      appliedTo:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system

    - name: all-pods-to-kubedns-and-localdns
      action: Allow
      enableLogging: false
      appliedTo:
      - namespaceSelector: {}
      to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
        podSelector:
          matchLabels:
            k8s-app: kube-dns
      - ipBlock:
          cidr: 169.254.25.10/32
      ports:
      - port: 53
        protocol: UDP

That should allow both ingress/egress from/to kube-system and egress to kube-dns and nodelocaldns from anywhere.

---
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: 799-default-deny
spec:
  priority: 99
  tier: Baseline
  appliedTo:
   - namespaceSelector: {}
  ingress:
    - action: Drop
      enableLogging: true
  egress:
    - action: Drop
      enableLogging: true

169.254.25.10/32 is the nodelocaldns deployed by kubespray. (Can be related to the problem?)

jsalatiel · 2022-04-26T11:23:32Z

More debugging here.

I have changed default DROP to default REJECT in the baseline tier to make sure I would have one single line on the np.log for the curl.

For the following netpol ( changed from ANCP to ANP):

apiVersion: crd.antrea.io/v1alpha1
kind: NetworkPolicy
metadata:
  name: test-anp
spec:
    priority: 1
    appliedTo:
      - podSelector: {}
    egress:
      - action: Allow
        to:
        - fqdn: *.google.com"
        enableLogging: true

I get the output:

curl https://www.google.com  -I                                                                                                
curl: (7) Failed to connect to www.google.com port 443 after 8 ms: Connection refused

and the respective netpol.log:

2022/04/26 11:08:50.851043 EgressDefaultRule AntreaClusterNetworkPolicy:799-default-deny Reject 16 10.239.67.7 42074 142.250.218.36
443 TCP 60

When I change the netpol for:

apiVersion: crd.antrea.io/v1alpha1
kind: NetworkPolicy
metadata:
  name: test-anp
spec:
    priority: 1
    appliedTo:
      - podSelector: {}
    egress:
      - action: Allow
        to:
        - fqdn: www.google.com"
        enableLogging: true

I get the following output on curl:

curl https://www.google.com  -I
HTTP/2 200
content-type: text/html; charset=ISO-8859-1
p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
date: Tue, 26 Apr 2022 11:09:22 GMT
server: gws
x-xss-protection: 0
x-frame-options: SAMEORIGIN
expires: Tue, 26 Apr 2022 11:09:22 GMT
...

and the following netpol.log

2022/04/26 11:09:22.585779 AntreaPolicyEgressRule AntreaNetworkPolicy:testing/test-anp Allow 14900 10.239.67.7 42284 142.250.218.36

The only netpolicies I have are:

# kubectl get ClusterNetworkPolicy  
NAME               TIER        PRIORITY   DESIRED NODES   CURRENT NODES   AGE
101-kube-system    Emergency   1          4               4               161m
799-default-deny   Baseline    99         4               4               6d12h

# kubectl get networkpolicies.crd.antrea.io
NAME       TIER          PRIORITY   DESIRED NODES   CURRENT NODES   AGE
test-anp   application   1          2               2               4d12h

Dyanngg · 2022-04-26T21:48:11Z

Hi @jsalatiel,
Thanks so much for providing these additional information! I applied the exact same policies in my local testbed (except for the 169.254.25.10/32 bit because I do not have nodelocaldns), and again the wildcard policy is working as expected. So it seems that the problem could be Antrea failed to intercept the DNS response from nodelocaldns. Maybe the nodelocaldns is not listening on UDP port 53? I will dig more into this.

antoninbas · 2022-04-26T22:11:03Z

@jsalatiel Since you are using nodelocaldns, I also have a few questions:

did you configure AntreaProxy correctly for nodelocaldns (not sure if it is directly related to the issue if you didn't): https://github.com/antrea-io/antrea/blob/main/docs/antrea-proxy.md#when-you-are-using-nodelocal-dnscache
can you exec into an antrea-agent Pod and get the contents of /etc/resolv.conf?
can you get the contents of /etc/resolv.conf for one of the Nodes?
can you exec into a "regular" workload Pod (not in the hostNetwork) and get the contents of /etc/resolv.conf?

jsalatiel · 2022-04-26T22:13:51Z

@Dyanngg This is the result from dig on the localdns IP from a container.

dig www.google.com @169.254.25.10

; <<>> DiG 9.11.26-RedHat-9.11.26-6.el8 <<>> www.google.com @169.254.25.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53784
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: eb59d029bd7ad0d0 (echoed)
;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		10	IN	A	142.250.78.228

;; Query time: 0 msec
;; SERVER: 169.254.25.10#53(169.254.25.10)
;; WHEN: Tue Apr 26 22:12:11 UTC 2022
;; MSG SIZE  rcvd: 85

So apparently it is resolving and it is on UDP 53.

Why would antrea intercept www.google.com correctly and not *.google.com if those are all DNS queries ?

jsalatiel · 2022-04-26T22:19:21Z

@antoninbas

No, I have not. I just kubectl apply antrea.yml ( so, all defaults ). I will check and get the results back.
antrea-agent

 cat /etc/resolv.conf 
search kube-system.svc.cluster.lan svc.cluster.lan cluster.lan default.svc.cluster.lan
nameserver 169.254.25.10
options ndots:5

Node.
Except for "nameserver 192.168.254.1", the other entries have been added by kubespray playbook

 cat /etc/resolv.conf
# Ansible entries BEGIN
domain cluster.lan
search default.svc.cluster.lan svc.cluster.lan
nameserver 169.254.25.10
nameserver 10.239.0.3
nameserver 192.168.254.1
options ndots:2
options timeout:2
options attempts:2
# Ansible entries END
# Generated by NetworkManager

"regular" workload Pod

search testing.svc.cluster.lan svc.cluster.lan cluster.lan default.svc.cluster.lan
nameserver 169.254.25.10
options ndots:5

Dyanngg · 2022-04-26T22:22:48Z

Why would antrea intercept www.google.com correctly and not *.google.com if those are all DNS queries ?

For policies with specific FQDN (as opposed to wildcard FQDNs), Antrea will directly contact dns server specified with env variable KUBE_DNS_SERVICE_HOST and KUBE_DNS_SERVICE_PORT in the agent and use the result to program FQDN policy datapath rules. So, for www.google.com, it is possible that the DNS query result does not come from nodelocaldns

You can check this by looking at Antrea agent logs:

		host, port := os.Getenv(kubeDNSServiceHost), os.Getenv(kubeDNSServicePort)
		if host == "" || port == "" {
			klog.InfoS("Unable to derive DNS server from the kube-dns Service, will fall back to local resolver")
			controller.dnsServerAddr = ""
		} else {
			controller.dnsServerAddr = host + ":" + port
			klog.InfoS("Using kube-dns Service for DNS requests", "dnsServer", controller.dnsServerAddr)
		}

which should produce a log line like

I0426 20:22:07.342192       1 fqdn.go:185] "Using kube-dns Service for DNS requests" dnsServer="10.96.0.10:53"

jsalatiel · 2022-04-26T22:28:32Z

@antoninbas I have added skipServices: ["kube-system/kube-dns"] , restart all pods , but I still get the same problem. So I think it is not related.

antoninbas · 2022-04-26T23:04:06Z

@jsalatiel your answers to 2 & 4 are surprising to me. I would have expected the ClusterIP for CoreDNS there, even when using nodelocaldns. The dnsPolicy for the antrea-agent Pod is ClusterFirstWithHostNet.

BTW, do you see anything interesting in the antrea-agent logs about DNS queries?

antoninbas · 2022-04-26T23:11:21Z

@jsalatiel your answers to 2 & 4 are surprising to me. I would have expected the ClusterIP for CoreDNS there, even when using nodelocaldns. The dnsPolicy for the antrea-agent Pod is ClusterFirstWithHostNet.

Ignore this, I see that kubespray configures kubelet this way, so it is expected:
https://github.com/kubernetes-sigs/kubespray/blob/d57ddf0be805407239141b334c6425717aa1cf3f/roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2#L45-L60

antoninbas · 2022-04-26T23:29:27Z

@Dyanngg If you cannot reproduce after deploying NodeLocal DNSCache to your cluster, you may need to provision a cluster with Kubespray.

jsalatiel · 2022-04-27T00:19:56Z

Why would antrea intercept www.google.com correctly and not *.google.com if those are all DNS queries ?

For policies with specific FQDN (as opposed to wildcard FQDNs), Antrea will directly contact dns server specified with env variable KUBE_DNS_SERVICE_HOST and KUBE_DNS_SERVICE_PORT in the agent and use the result to program FQDN policy datapath rules. So, for www.google.com, it is possible that the DNS query result does not come from nodelocaldns

You can check this by looking at Antrea agent logs:
		host, port := os.Getenv(kubeDNSServiceHost), os.Getenv(kubeDNSServicePort)
		if host == "" || port == "" {
			klog.InfoS("Unable to derive DNS server from the kube-dns Service, will fall back to local resolver")
			controller.dnsServerAddr = ""
		} else {
			controller.dnsServerAddr = host + ":" + port
			klog.InfoS("Using kube-dns Service for DNS requests", "dnsServer", controller.dnsServerAddr)
		}
which should produce a log line like
I0426 20:22:07.342192       1 fqdn.go:185] "Using kube-dns Service for DNS requests" dnsServer="10.96.0.10:53"

@antoninbas The agent logs shows:
I0427 00:16:20.205809 1 fqdn.go:181] "Unable to derive DNS server from the kube-dns Service, will fall back to local resolver"

antoninbas · 2022-04-27T00:24:50Z

I0427 00:16:20.205809 1 fqdn.go:181] "Unable to derive DNS server from the kube-dns Service, will fall back to local resolver"

This is not really an issue. Apparently kubespray uses a different name for the CoreDNS service.

Dyanngg · 2022-04-27T00:33:24Z

I0427 00:16:20.205809 1 fqdn.go:181] "Unable to derive DNS server from the kube-dns Service, will fall back to local resolver"

This is not really an issue. Apparently kubespray uses a different name for the CoreDNS service.

But it does explain why the policy works when www.google.com is in the spec. The fqdn controller will instantiate an net.DefaultResolver and use that to parse www.google.com directly. It further confirms that the problem lies in Antrea not being able to intercept the dns response from nodelocaldns

jsalatiel · 2022-04-27T19:47:19Z

I0427 00:16:20.205809 1 fqdn.go:181] "Unable to derive DNS server from the kube-dns Service, will fall back to local resolver"

This is not really an issue. Apparently kubespray uses a different name for the CoreDNS service.

# kubectl  get svc -n kube-system
NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
antrea                 ClusterIP   10.239.5.222   <none>        443/TCP                  9d
coredns                ClusterIP   10.239.0.3     <none>        53/UDP,53/TCP,9153/TCP   9d

antoninbas · 2022-04-27T21:42:28Z

@jsalatiel that would be what I mean by a different name. Most clusters I have encountered (including the ones provisioned by kubeadm) use kube-dns as the Service name, for legacy reasons.

Dyanngg · 2022-04-27T22:52:32Z

I'm trying to repro the issue with a kubespray provisioned cluster. @jsalatiel Could you confirm that you installed Antrea on such kubespray cluster after uninstalling the original CNI (I see flannel as the default)? Just trying to get the same setup.

jsalatiel · 2022-04-28T00:34:47Z

@Dyanngg Use this override.yml and pass -e '@override.yml' in ansible command line. You will get exactly my setup. ( no default CNI )

download_container: false
container_manager: crio
kube_network_plugin: cni
etcd_kubeadm_enabled: true
dashboard_enabled: false
metrics_server_enabled: false
podsecuritypolicy_enabled: false
cert_manager_enabled: false
helm_enabled: true
kube_service_addresses: 10.239.0.0/18
kube_pods_subnet: 10.239.64.0/18
cluster_name: cluster.lan
upstream_dns_servers:
  - 192.168.254.1
kube_proxy_metrics_bind_address: 0.0.0.0:10249
ingress_nginx_enabled: false
calico_iptables_backend: "Auto"
~

jsalatiel · 2022-04-28T08:30:11Z

Btw, The underlying OS in my kubespray cluster is AlmaLinux 8.5

Dyanngg · 2022-04-29T04:06:56Z

I was able to reproduce this issue with a kubespray cluster with nodelocaldns enabled, with Antrea v1.6 build.
TL;DR - this issue can be solved with #3510, which was intended to fix another issue and is already merged in main branch (after the 1.6 release cutoff).

Wildcard FQDN rule matching is made possible by Antrea installing a DNS reply packet interception rule at the highest priority in the AntreaIngressRuleTable. Before PR #3510 is merged, however, there is a flow which bypass packets for ingress rule evaluation as long as the packet's pkt_mark=0x80000000/0x80000000 is set (which means the packet comes from the localhost). This flow exists in IngressSecurityClassifierTable to make sure that liveness probes etc. are not dropped by netpol rules.

Unfortunately with nodelocaldns, the dns query response packet matches this bypass flow and thus skipped the dns intercept flow. With PR #3510, the above mentioned flow is changed to match only non-reply packets, and thus will not bypass the dns reply packet for ingress tables.

@hongliangl Since we have this bug that can be resolved with #3510, maybe we could backport it to v1.6?

(special thanks to @antoninbas for help in debugging this issue)

hongliangl · 2022-04-29T05:42:49Z

If we backport #3510, we also need backport #3630.

Backport PRs are created. See #3715, #3716

antoninbas · 2022-04-29T17:41:27Z

@hongliangl I approved both PRs. BTW, I think that it is possible to cherry-pick 2 separate changes with a single PR.
It may have been appropriate in that case:

    ./hack/cherry-pick-pull.sh upstream/release-3.14 12345 56789  # Cherry-picks PR 12345, then 56789 and proposes the combination as a single PR.

hongliangl · 2022-04-30T01:03:20Z

Thanks @antoninbas

jsalatiel · 2022-04-30T01:05:06Z

Are there plans to release 1.6.1? Or will the fix only be available on 1.7?

antoninbas · 2022-05-03T22:54:59Z

@jsalatiel fix will be included in 1.6.1. Release will be late this week or next.

tnqn · 2022-05-12T13:13:29Z

@jsalatiel https://github.com/antrea-io/antrea/releases/tag/v1.6.1 has been released, which should have the fix. Please let us know if the issue is resolved. Other minor releases should have no this issue.

jsalatiel · 2022-05-12T22:01:25Z

Working perfectly.

jsalatiel added the kind/bug Categorizes issue or PR as related to a bug. label Apr 21, 2022

antoninbas assigned Dyanngg Apr 25, 2022

antoninbas added the area/network-policy Issues or PRs related to network policies. label Apr 25, 2022

tnqn mentioned this issue May 10, 2022

Automated cherry pick of #3630: Bump antrea.io/ofnet from v0.5.5 to v0.5.7 #3510: Fix the issue of local probe bypassing flows on Windows #3761

Closed

tnqn closed this as completed May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antrea wildcard fqdn netpolicy not working #3680

Antrea wildcard fqdn netpolicy not working #3680

jsalatiel commented Apr 21, 2022

antoninbas commented Apr 25, 2022

Dyanngg commented Apr 26, 2022 •

edited

Loading

Dyanngg commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

Dyanngg commented Apr 26, 2022

antoninbas commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

Dyanngg commented Apr 26, 2022 •

edited

Loading

jsalatiel commented Apr 26, 2022

antoninbas commented Apr 26, 2022 •

edited

Loading

antoninbas commented Apr 26, 2022

antoninbas commented Apr 26, 2022

jsalatiel commented Apr 27, 2022

antoninbas commented Apr 27, 2022

Dyanngg commented Apr 27, 2022

jsalatiel commented Apr 27, 2022

antoninbas commented Apr 27, 2022

Dyanngg commented Apr 27, 2022

jsalatiel commented Apr 28, 2022 •

edited

Loading

jsalatiel commented Apr 28, 2022

Dyanngg commented Apr 29, 2022

hongliangl commented Apr 29, 2022 •

edited

Loading

antoninbas commented Apr 29, 2022

hongliangl commented Apr 30, 2022

jsalatiel commented Apr 30, 2022

antoninbas commented May 3, 2022

tnqn commented May 12, 2022

jsalatiel commented May 12, 2022

Antrea wildcard fqdn netpolicy not working #3680

Antrea wildcard fqdn netpolicy not working #3680

Comments

jsalatiel commented Apr 21, 2022

antoninbas commented Apr 25, 2022

Dyanngg commented Apr 26, 2022 • edited Loading

Dyanngg commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

Dyanngg commented Apr 26, 2022

antoninbas commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

jsalatiel commented Apr 26, 2022

Dyanngg commented Apr 26, 2022 • edited Loading

jsalatiel commented Apr 26, 2022

antoninbas commented Apr 26, 2022 • edited Loading

antoninbas commented Apr 26, 2022

antoninbas commented Apr 26, 2022

jsalatiel commented Apr 27, 2022

antoninbas commented Apr 27, 2022

Dyanngg commented Apr 27, 2022

jsalatiel commented Apr 27, 2022

antoninbas commented Apr 27, 2022

Dyanngg commented Apr 27, 2022

jsalatiel commented Apr 28, 2022 • edited Loading

jsalatiel commented Apr 28, 2022

Dyanngg commented Apr 29, 2022

hongliangl commented Apr 29, 2022 • edited Loading

antoninbas commented Apr 29, 2022

hongliangl commented Apr 30, 2022

jsalatiel commented Apr 30, 2022

antoninbas commented May 3, 2022

tnqn commented May 12, 2022

jsalatiel commented May 12, 2022

Dyanngg commented Apr 26, 2022 •

edited

Loading

Dyanngg commented Apr 26, 2022 •

edited

Loading

antoninbas commented Apr 26, 2022 •

edited

Loading

jsalatiel commented Apr 28, 2022 •

edited

Loading

hongliangl commented Apr 29, 2022 •

edited

Loading