Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods running on windows nodes cannot access the API server using the internal Kubernetes Service #1759

Closed
perithompson opened this issue Jan 19, 2021 · 5 comments · Fixed by #1824
Assignees
Labels
area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. p0

Comments

@perithompson
Copy link
Contributor

Describe the bug
Pods running on windows nodes cannot access the API server when trying to connect via the kubernetes internal service. It seems that the IP address is accessible from the pod but https traffic fails.

PS C:\> tnc kubernetes.default.svc.cluster.local -p 443
WARNING: TCP connect to (10.96.0.1 : 443) failed
ComputerName           : kubernetes.default.svc.cluster.local
RemoteAddress          : 10.96.0.1
RemotePort             : 443
InterfaceAlias         : vEthernet (iis-2019-6b0636)
SourceAddress          : 192.168.4.16
PingSucceeded          : True
PingReplyDetails (RTT) : 86 ms
TcpTestSucceeded       : False

This is not the case when accessing the API server via this service from the host.

To Reproduce

  • Deploy Antrea cluster with windows node.
  • Start a windows pod
  • Attempt to connect to the kubernetes service via cluster IP or DNS name

Expected
API Server should be accessible on cluster IP service

Actual behavior
A clear and concise description of what's the actual behavior. If applicable, add screenshots, log messages, etc. to help explain the problem.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). v0.12.0 (Containerd) from @ruicao93 but I think other versions are effected
  • Kubernetes version: v1.19.1
  • Container runtime: Containerd 1.4.3
  • OVS: 2.14.0

Additional context

Tracing route to kubernetes.default.svc.cluster.local [10.96.0.1]
over a maximum of 30 hops:
  1    79 ms    74 ms    73 ms  192.168.4.1
  2    84 ms    77 ms    79 ms  kubernetes.default.svc.cluster.local [10.96.0.1]
Trace complete.
PS C:\> curl.exe -vvv https://kubernetes.default.svc.cluster.local
* Rebuilt URL to: https://kubernetes.default.svc.cluster.local/
*   Trying 10.96.0.1...
* TCP_NODELAY set
* connect to 10.96.0.1 port 443 failed: Timed out
* Failed to connect to kubernetes.default.svc.cluster.local port 443: Timed out
* Closing connection 0
curl: (7) Failed to connect to kubernetes.default.svc.cluster.local port 443: Timed out
E0119 02:51:23.483912    1128 proxysocket.go:208] I/O error: readfrom tcp 10.96.0.1:443->10.96.0.1:49752: read tcp 10.176.37.239:49753->10.176.37.186:6443: wsarecv: A con
nection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to 
respond.
.0.0,proto=0,tp_src=0,tp_dst=0),eth(src=68:4f:64:14:d8:9a,dst=01:00:0c:cc:cc:cd),in_port(2),eth_type(0x05ff)
2021-01-19T11:29:56.667Z|92859|dpif(handler11)|WARN|system@ovs-system: execute ct(zone=65520,nat),recirc(0x1) failed (not supported) on packet ip,vlan_tci=0x0000,dl_src=0
0:50:56:a6:97:d2,dl_dst=00:50:56:a6:fb:e3,nw_src=10.176.37.205,nw_dst=10.176.37.208,nw_proto=4,nw_tos=0,nw_ecn=0,nw_ttl=63
 with metadata skb_priority(0),skb_mark(0),in_port(2) mtu 0
2021-01-19T11:29:56.667Z|92860|dpif(handler11)|WARN|system@ovs-system: execute ct(zone=65520,nat),recirc(0x1) failed (not supported) on packet ip,vlan_tci=0x0000,dl_src=0
0:50:56:a6:fb:e3,dl_dst=00:50:56:a6:97:d2,nw_src=10.176.37.208,nw_dst=10.176.37.205,nw_proto=4,nw_tos=0,nw_ecn=0,nw_ttl=63
 with metadata skb_priority(0),skb_mark(0),in_port(2) mtu 0
2021-01-19T11:29:56.950Z|92861|dpif(handler11)|WARN|system@ovs-system: execute ct(zone=65520,nat),recirc(0x1) failed (not supported) on packet ip,vlan_tci=0x0000,dl_src=0
0:50:56:a6:95:2e,dl_dst=00:50:56:a6:83:25,nw_src=10.176.37.224,nw_dst=10.176.37.218,nw_proto=4,nw_tos=0,nw_ecn=0,nw_ttl=63
 with metadata skb_priority(0),skb_mark(0),in_port(2) mtu 0
2021-01-19T11:29:56.950Z|92862|dpif(handler11)|WARN|system@ovs-system: execute ct(zone=65520,nat),recirc(0x1) failed (not supported) on packet ip,vlan_tci=0x0000,dl_src=0
0:50:56:a6:95:2e,dl_dst=00:50:56:a6:83:25,nw_src=10.176.37.224,nw_dst=10.176.37.218,nw_proto=4,nw_tos=0,nw_ecn=0,nw_ttl=63
 with metadata skb_priority(0),skb_mark(0),in_port(2) mtu 0
@perithompson perithompson added the kind/bug Categorizes issue or PR as related to a bug. label Jan 19, 2021
@ruicao93 ruicao93 self-assigned this Jan 19, 2021
@ruicao93 ruicao93 added the area/OS/windows Issues or PRs related to the Windows operating system. label Jan 19, 2021
@antoninbas
Copy link
Contributor

That doesn't really help with this issue here :), but I just wanted to point out that the fact that the ping tests "succeeds" is a bit misleading here. ICMP traffic is not load-balanced to an actual Pod here. If I recall correctly, for Windows Nodes, we create a "Service interface" on the Node and all ClusterIPs are assigned to it. So when the ping test "succeeds" (or in the case of the traceroute), it's just the host replying directly to the ICMP requests, with no proxy / load-balancing involved.

@antoninbas
Copy link
Contributor

@ruicao93 does it mean that all Pod-to-Service traffic is broken for Windows Nodes in the latest release, or is this some edge case?

@antoninbas antoninbas added this to the Antrea v0.13.0 release milestone Jan 19, 2021
@ruicao93
Copy link
Contributor

@ruicao93 does it mean that all Pod-to-Service traffic is broken for Windows Nodes in the latest release, or is this some edge case?

@antoninbas : I think only the services like "kubernetes" are broken. It's because the endpointIP is node IP instead of pod IP.

@ruicao93 ruicao93 added the p0 label Jan 20, 2021
@perithompson
Copy link
Contributor Author

@antoninbas, @ruicao93 is correct, We've only seen this on the Kubernetes API internal service, I was going through this with him yesterday. The only thing that I could think of was that the service has a double hop readfrom tcp 10.96.0.1:443->10.96.0.1:49752: read tcp 10.176.37.239:49753->10.176.37.186:6443 although I don't know if that is unique? I wondered if this is the same result if you created a service similar to other external resources on an EndpointIP, but I haven't tested that yet.

@ruicao93
Copy link
Contributor

ruicao93 commented Jan 20, 2021

@perithompson: The root cause should be that for such case, we both do DNAT and SNAT in same pipeline. Which is not supported by OVS as test.

So we need a new design to split these two NAT operation in different CT_ZONE. I will popose a new design soon.

ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod access cluster service and the selected endpoint uses
node IP(hostnetwork mode). The request packets need to be SNATed
after have been DNATed. On Windows node, antrea both applied both
DNAT and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod access cluster service and the selected endpoint uses
node IP(hostnetwork mode). The request packets need to be SNATed
after have been DNATed. On Windows node, antrea both applied both
DNAT and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod access cluster service and the selected endpoint uses
node IP(hostnetwork mode). The request packets need to be SNATed
after have been DNATed. On Windows node, antrea both applied both
DNAT and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod access cluster service and the selected endpoint uses
node IP(hostnetwork mode). The request packets need to be SNATed
after have been DNATed. On Windows node, antrea both applied both
DNAT and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 8, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 9, 2021
When a Pod access cluster service and the selected endpoint uses
node IP(hostnetwork mode). The request packets need to be SNATed
after have been DNATed. On Windows node, antrea both applied both
DNAT and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit to ruicao93/antrea that referenced this issue Feb 9, 2021
When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
ruicao93 added a commit that referenced this issue Feb 9, 2021
…usterIP Service (#1824)

When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: #1759

Signed-off-by: Rui Cao <[email protected]>
antoninbas pushed a commit to antoninbas/antrea that referenced this issue Feb 10, 2021
…usterIP Service (antrea-io#1824)

When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
antoninbas pushed a commit to antoninbas/antrea that referenced this issue Feb 10, 2021
…usterIP Service (antrea-io#1824)

When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: antrea-io#1759

Signed-off-by: Rui Cao <[email protected]>
antoninbas pushed a commit that referenced this issue Feb 11, 2021
…usterIP Service (#1824)

When a Pod accesses a ClusterIP Service and the IP of the selected
endpoint is not in "cluster-cidr". The request packets need to be
SNAT'd after have been DNAT'd. For example, the endpoint Pod may
run in hostNetwork and the IP of the endpoint is the current
Node IP. Currently, on Windows Node antrea applies both DNAT
and SNAT in the same ct_zone. That's not supported by OVS.

In this patch, we introduce a new ct_zone to track this kind of
SNATed connection in a different ct_zone.

Fixes: #1759

Signed-off-by: Rui Cao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug. p0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants