Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support ipvs mode for kube-proxy #692

Closed
wants to merge 1 commit into from

Conversation

m1093782566
Copy link

@m1093782566 m1093782566 commented Jun 7, 2017

Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).

related issue: kubernetes/kubernetes#17470 kubernetes/kubernetes#44063

related PR: kubernetes/kubernetes#46580 kubernetes/kubernetes#48994

@thockin @quinton-hoole @wojtek-t

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 7, 2017
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 7, 2017
@dhilipkumars
Copy link
Contributor

@m1093782566 Thank you for reacting quickly, Could we add some statistics we collected during our experiment? Like a table that compares IpTables vs IPVS with few thousand services created.

@m1093782566
Copy link
Author

@dhilipkumars I think @haibinxie has the original statistics.

@m1093782566
Copy link
Author

IPVS vs. IPTables Latency to Add Rules

Measured by iptables and ipvsadm, observation:

  • In iptables mode, latency to add rules increase significantly when number of service increases

  • In IPVS mode, latency to add VIP and backend IPs does not increase when number of service increases

number of services 1 5000 20000
number of rules 8 40000 160000
iptables 2ms 11min 5hours
ipvs are neat 2ms 2ms

@dhilipkumars I am not sure if the statistics is sufficient.

@spiffxp
Copy link
Member

spiffxp commented Jun 7, 2017

@spiffxp
Copy link
Member

spiffxp commented Jun 7, 2017

@kubernetes/sig-network-proposals


### Network policy

For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking a soon-to-be GA feature seems like an edge case that we probably cannot overlook. Even with something in alpha form, I think we should be sure that all existing functionality is supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNAT is required for cross host communication, there should be compromise in between, probably an awareness of the situation in release note. I am also open to any solution/fix that we can work towards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from soon to be a GA feature, network policy is really important that people probably would trade performance for it. A draft idea is to not do snat, rather, on each host, add a new routing table and fwmark all service traffic to this table. The routing table will take care of routing back the packet. There's a lot of caveats, i don't know if it will ever work at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).

Copy link
Author

@m1093782566 m1093782566 Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if L3 overlay(flannel with vxlan backend) is the matter. Maybe I need a reverse traffic goes through IPVS proxy so that traffic reaches the source pod (after performing DNAT). I will re-test it in my environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@murali-reddy do you have any idea?

Copy link

@murali-reddy murali-reddy Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @thockin, there is no need for SNAT, atleast not for all traffic paths. So when a pod access cluster IP or node port, IPVS does DNAT. On reverse path pretty much (can not think of any pod networking solution where its not true) route to source pod is back to source pod node.

Where its gets tricky, is when a node port is accessed from outside the cluster. Node on which destination pod is running may route traffic directly to client through default gateway. To prevent that we need to SNAT traffic, so the return traffic goes through the same node through which client accessed node port. We do loose the source IP in this case. But AFAIK this is not unique to use of IPVS but even iptable kube-proxy will have to do the SNAT.

FWIW, i have implemented logic in kube-router to deal with external client accessing node port. I just tested Flannel VXLAN + IPVS service proxy + network policy, i dont see any issue. Dont see reason to do SNAT. Please test it with your POC, and see you can remove this restriction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @murali-reddy.

Copy link
Author

@m1093782566 m1093782566 Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-tested in my environment(Flannel VXLAN + IPVS service proxy) and found something different.

There is an ipvs service with 2 destinations which are in the different host.

[root@SHA1000130405 home]# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn     
TCP  10.102.128.4:3080 rr
  -> 10.244.0.235:8080            Masq    1      0          0         
  -> 10.244.2.123:8080            Masq    1      0          0 

There was no SNAT rules applied and I curl VIP in each container.

# In container 10.244.0.235

$curl 10.102.128.4:3080
# get response from 10.244.2.123:8080

I tcpdumped in another host to see what the source IP is.

$ tcpdump -i flannel.1

20:44:48.021765 IP 10.244.0.235.36844 > 10.244.2.123.webcache: Flags [S], seq 519767460, win 28200, options [mss 1410,sackOK,TS val 416
20:44:48.021998 IP 10.244.2.123.webcache > 10.244.0.235.36844: Flags [S.], seq 1131844123, ack 519767461, win 27960, options [mss 1410,76,nop,wscale 7], length 0

The output shows that the source IP no changed.

So, it seems that there is no need to do SNAT for cross-host communication.

However, when ipvs scheduled the self container, there is no response returned. It seems that the packet is dropped when the traffic reached the self container.

I can confirm that it has nothing to do with same host vs. cross host, the issue is about whether the request hit the self container.

When I applied SNAT rules as iptables proxy does, it can reach self container via VIP.

iptables -t nat -A PREROUTING -s 10.244.0.235 -j KUBE-MARK-MASQ

Unfortunately, the source container lost its IP and the source IP became flannel.1's IP.

Anyway, it's a good news to me that SNAT is unnecessary for cross-host communication so that it won't break the network policy though I need more knowledge to fix the issue I mentioned above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this is nothing specic to IPVS, please search for hairpin related issues with kube-proxy/kubelet https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip

- container -> host (cross host)


## TODO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the TODO is necessary since it should be part of the PR steps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@m1093782566
Copy link
Author

@spiffxp Yes. @haibinxie wrote the doc and I translated it to markdown file and added some details

Copy link
Contributor

@ddysher ddysher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together.

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to see a proposal just for alpha version. what's the plan for beta and stable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the proposal is not "for alpha". alpha is just a milestone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will fix.


For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there absolutely no way? We tainted some cluster nodes with ipvs for external loadbalancing; if cleanup cleans all rules, things will break. Using ipvs for external loadbalance for external traffic is not uncommon i think.

--cleanup-proxyrules will clear all ipvs service in a host.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddysher Do you have any good idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of. It's probably ok to just call out this in document or flag comment. User with existing ipvs services should use with caution. If one just wants to clean up kubernetes related ipvs services, just start the proxy and clear any unwanted k8s services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can add this to document or flag comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?

Copy link
Author

@m1093782566 m1093782566 Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only clean rules with RC1918 addresses?
[m1093782566] It still has a possibility of clearing ipvs rules created by other process although the the possibility is low.

[m1093782566] Kube-apiserver knows the service cluster IP range through its--service-cluster-ip-range parameter. However, kube-proxy knows nothing about that. I don't suggest to add a new --service-cluster-ip-range flag in kube-proxy since it will easily conflicts with kube-apiserver's flag. Even though kube-proxy clearing all ipvs rules in the service cluster IP range, it may leave some ipvs rules in external IP, exteral LB ingress IP and Node IP.

Clearing all ipvs rules with RFC1918 addresses probably is easier to implement and has lower possibility of clearing user's existing ipvs rules.

@fisherxu What do you think about it?


In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

### Change to build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc linked somewhere uses seesaw, i suppose we changed to libnetwork afterwards?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We changed to use libnetwork to avoid cgo and libnl dependency.


### IPVS setup and network topology

IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iptables is already widely used now; it's better to say 'an alternative to' instead of replacement IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for reminding.


IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.

We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint.
Copy link
Contributor

@ddysher ddysher Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about "some ipvs services"? I suppose it's "one ipvs for each kubernetes service, port and protocol combination"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose kubernetes service is NodePort type. Then ipvs Proxier will create two ipvs serivice, one is NodeIP:NodePort and the other one is ClusterIP:Port.

Of course, I will explain it in the doc. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

Since ipvs proxier will fall back on iptables when support node port type service. I should give another example.

Note that the relationship between kubernetes service and ipvs service is 1:N。The address of ipvs service corresponding to service's access IP, such as Cluster IP, external IP and LB.ingress.IP. If a kubernetes service has more than one access IP, For example, a external IP type service has 2 access IP(ClusterIP and External IP), then ipvs proxier will create 2 ipvs serivice - one for Cluster IP and the other one for External IP.

ipvsadm -A -t 10.244.1.100:8080 -s rr -p [timeout]
```

When a service specify session affinity, ipvs proxy will assign a timeout value(180min by default) to ipvs service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

180s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Other design considerations

### IPVS setup and network topology
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design mixes ipvs and iptables rules, can we have a section dedicated to explain the interaction between ipvs and iptables, and which is responsible for what requirements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - this needs to detail every sort of flow and every feature of Services.

Copy link
Author

@m1093782566 m1093782566 Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I created a section "when fall back on iptables" to explain the interaction between ipvs and iptables. Thanks!

@m1093782566
Copy link
Author

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the author's name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!


### NodePort type service support

For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.
Copy link
Contributor

@ddysher ddysher Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this comment somehow gets lost.

Why do we enforce interface name? e.g. assuming ethx won't work for predictable network interface name for newer systemd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about take all UP state network interface(except the "vethxxx"s) address as the Node IP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates:

As discussed in sig-network meeting, IPVS proxier will fall back on iptables when support node port type service.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc. This needs a LOT more detail, though. I flagged a few big issues, but I don't feel like this is covering the depth we need.

Is this built around exec ipvsadm?

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)

How will you use/expose the different algorithms?

I am sure I have more questions. This is one of the most significant changes in the history of kube-proxy and kubernetes services. Please spend some time helping me not be scared of it. :)

@@ -0,0 +1,152 @@
# Alpha Version IPVS Load Balancing Mode in Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the proposal is not "for alpha". alpha is just a milestone.


For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?


## Other design considerations

### IPVS setup and network topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - this needs to detail every sort of flow and every feature of Services.


### NodePort type service support

For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?


### Sync period

Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

Copy link
Author

@m1093782566 m1093782566 Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing seesion affinity will call UpdateService API which directly send update command to kernel via socket communication and won't take a service down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhilipkumars did a test and found ipvs update did not disrupt the service or even the existing connection.


### Network policy

For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).


For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.

## Test validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to enumerate everything you have to test:

pod -> pod, same VM
pop -> pod, other VM
pod -> own VM, own hostPort
pod -> own VM, other hostPort
pod -> other VM, other hostPort

pod -> own VM
pod -> other VM
pod -> internet
pod -> http://metadata

VM -> pod, same VM
VM -> pod, other VM
VM -> same VM hostPort
VM -> other VM hostPort

pod -> own clusterIP, hairpin
pod -> own clusterIP, same VM, other pod, no port remap
pod -> own clusterIP, same VM, other pod, port remap
pod -> own clusterIP, other VM, other pod, no port remap
pod -> own clusterIP, other VM, other pod, port remap
pod -> other clusterIP, same VM, no port remap
pod -> other clusterIP, same VM, port remap
pod -> other clusterIP, other VM, no port remap
pod -> other clusterIP, other VM, port remap
pod -> own node, own nodePort, hairpin
pod -> own node, own nodePort, policy=local
pod -> own node, own nodePort, same VM
pod -> own node, own nodePort, other VM
pod -> own node, other nodePort, policy=local
pod -> own node, other nodePort, same VM
pod -> own node, other nodePort, other VM
pod -> other node, own nodeport, policy=local
pod -> other node, own nodeport, same VM
pod -> other node, own nodeport, other VM
pod -> other node, other nodeport, policy=local
pod -> other node, other nodeport, same VM
pod -> other node, other nodeport, other VM
pod -> own external LB, no remap, policy=local
pod -> own external LB, no remap, same VM
pod -> own external LB, no remap, other VM
pod -> own external LB, remap, policy=local
pod -> own external LB, remap, same VM
pod -> own external LB, remap, other VM

VM -> same VM nodePort, policy=local
VM -> same VM nbodePort, same VM
VM -> same VM nbodePort, other VM
VM -> other VM nodePort, policy=local
VM -> other VM nbodePort, same VM
VM -> other VM nbodePort, other VM

VM -> external LB

public -> nodeport, policy=local
public -> nodeport, policy=global
public -> external LB, no remap, policy=local
public -> external LB, no remap, policy=global
public -> external LB, remap, policy=local
public -> external LB, remap, policy=global

public -> nodeport, manual backend
public -> external LB, manual backend

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super!

@haibinxie
Copy link

@thockin
Let me know if this helps.

Is this built around exec ipvsadm?
[Haibin Michael Xie] this is built on top of libnetwork, which talks to kernel via socket communication. Not on top of ipvsadm.

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?
[Haibin Michael Xie] do you mean OS updates? I don't know the full picture of how iptables handles it, IMO this is with no difference from iptables.

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)
[Haibin Michael Xie] There is periodic full resync and in memory cache based diff. This should be handled already in iptables, and there is no difference in this regard.

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)
[Haibin Michael Xie] Same to above libnetwork use socket to talk to kernel which is very efficient.

How will you use/expose the different algorithms?
[Haibin Michael Xie] If it's LB algorithm, it is already mentioned in the proposal. kube-proxy has a new parameter --ipvs-scheduler.

@m1093782566
Copy link
Author

m1093782566 commented Jun 14, 2017

@thockin

I re-tested and found something different. SNAT is not required for cross-host communication. So it won't break network policy. It's really a big finding to me :)

I find the packet will be dropped when container vist VIP and the real backend is itself. I have no idea now but will try to find out why.

@m1093782566
Copy link
Author

m1093782566 commented Jun 15, 2017

I will update the proposal and try to fix the review comments in newer proposal.

@thockin I will add pseudo-code explaining how the resync loop works. Thanks.

@m1093782566 m1093782566 force-pushed the ipvs-proxy branch 2 times, most recently from c0a5b3e to ebb958f Compare June 16, 2017 07:08
@m1093782566
Copy link
Author

m1093782566 commented Jun 16, 2017

@thockin @ddysher @cmluciano @murali-reddy I update the proposals according to review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie @ThomasZhou

@k8s-ci-robot
Copy link
Contributor

@m1093782566: GitHub didn't allow me to request PR reviews from the following users: haibinxie.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

@thockin @ddysher @cmluciano I update the proposals according review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@m1093782566 m1093782566 force-pushed the ipvs-proxy branch 2 times, most recently from 934152c to a552531 Compare June 16, 2017 07:39
@m1093782566
Copy link
Author

m1093782566 commented Jun 16, 2017

@thockin

How do you prevent dropped packets during updates?

How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

According to @dhilipkumars 's test result. It shows ipvs update did not disrupt the service or even the existing connection

sudo ipvsadm -L -n
[sudo] password for d:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.11.12.13:6379 wlc persistent 10
  -> 10.192.0.1:32768             Masq    1      1          0

the real backedn service is redis-alpine conecting to the service

docker run --net=host -it redis:3.0-alpine redis-cli -h 10.11.12.13 -p 6379
10.11.12.13:6379> info Clients
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

In a parllel session if i update the scheduler algorithm or persistence timeout the service is not disrupted. Ipvsadm and libnetwor's ipvs pkg works in the same principle firing netlink message to the kernal so the behaviour should be the same.

Udpate persistance timout still no

$sudo ipvsadm -E -t 10.11.12.13:6379 -p 60
$sudo ipvsadm -L -n --persistent-conn
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port            Weight    PersistConn ActiveConn InActConn
  -> RemoteAddress:Port
TCP  10.11.12.13:6379 wlc persistent 60
  -> 10.192.0.1:32768             1         1           1          0

what other parameters should we test?

@haibinxie
Copy link

@thockin Could you confirm if @m1093782566 's comment above is the right thing to do? anything else is left on closing it. please expect me keep bothering you until it's closed :)

If you get a chance we can have a quick phone call on reviewing and addressing issues on this, we want to commit to release the feature in 1.8.

@m1093782566
Copy link
Author

@danwinship Do you have interest in reviewing this design proposal? Any comments are welcome. :)

@m1093782566
Copy link
Author

m1093782566 commented Aug 14, 2017

Hi @thockin @ddysher

I just come up with an idea about implementing nodeport type service via ipvs.

Can we take all the IP address which ADDRTYPE match dst-type LOCAL as the address of ipvs service? For example,

[root@100-106-179-225 ~]# ip route show table local type local
100.106.179.225 dev eth0  proto kernel  scope host  src 100.106.179.225 
127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1 
127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1 
172.16.0.0 dev flannel.1  proto kernel  scope host  src 172.16.0.0 
172.17.0.1 dev docker0  proto kernel  scope host  src 172.17.0.1 
192.168.122.1 dev virbr0  proto kernel  scope host  src 192.168.122.1 

Then, [100.106.179.225, 127.0.0.0/8, 127.0.0.1, 172.16.0.0, 172.17.0.1, 192.168.122.1] would be the address of ipvs service for nodeport service.

I assume KUBE-NODEPORTS chain created by iptables proxier did the same thing? For example,

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

I am not opposed to implementing nodeport service via iptables, I just want to know if the approach mentioned above make sense? Or am I wrong?

Looking forward to receiving your opinions.

@m1093782566
Copy link
Author

m1093782566 commented Aug 14, 2017

By doing this, I think we can remove the design constraint that assuming node IP is the address of eth{x}?

@m1093782566
Copy link
Author

m1093782566 commented Aug 14, 2017

@feiskyer Do you have bandwidth to take a look at this protosal? Thanks :)

@feiskyer
Copy link
Member

Using a list of IP addresses for nodePort has potential problems, e.g. ip addresses may be changed or new nics may be added later. And I don't think watching the changes of ip addresses and nics is a good idea.

Maybe using iptables for nodePort services is a better choice.

@m1093782566
Copy link
Author

Glad to receive your feedback, @feiskyer

And I don't think watching the changes of ip addresses and nics is a good idea.

Yes, I agree. Thanks for your thoughts.

@k8s-github-robot k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 15, 2017
@luxas
Copy link
Member

luxas commented Oct 8, 2017

ping @kubernetes/sig-network-feature-requests
Any movement here lately?

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 8, 2017
@haibinxie
Copy link

@haibinxie
Copy link

@luxas alpha in 1.8. we are working on beta release in 1.9

@castrojo
Copy link
Member

This change is Reviewable

@cmluciano
Copy link

/keep-open

@spiffxp
Copy link
Member

spiffxp commented Dec 14, 2017

/lifecycle frozen
@cmluciano I'm keeping this open on your behalf, if this is no longer relevant to keep open please /remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 14, 2017
@cmluciano
Copy link

@m1093782566 Is there a PR that supersedes this one?

@m1093782566
Copy link
Author

@cmluciano

NO.

This PR is the only design proposal. IPVS proxier already reached beta while this document is still pending, unfortunately.

@m1093782566
Copy link
Author

/close

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.