support ipvs mode for kube-proxy #692

m1093782566 · 2017-06-07T12:13:51Z

Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).

related issue: kubernetes/kubernetes#17470 kubernetes/kubernetes#44063

related PR: kubernetes/kubernetes#46580 kubernetes/kubernetes#48994

@thockin @quinton-hoole @wojtek-t

k8s-ci-robot · 2017-06-07T12:13:53Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dhilipkumars · 2017-06-07T12:46:59Z

@m1093782566 Thank you for reacting quickly, Could we add some statistics we collected during our experiment? Like a table that compares IpTables vs IPVS with few thousand services created.

m1093782566 · 2017-06-07T13:04:41Z

@dhilipkumars I think @haibinxie has the original statistics.

m1093782566 · 2017-06-07T13:14:20Z

IPVS vs. IPTables Latency to Add Rules

Measured by iptables and ipvsadm, observation:

In iptables mode, latency to add rules increase significantly when number of service increases
In IPVS mode, latency to add VIP and backend IPs does not increase when number of service increases

number of services	1	5000	20000
number of rules	8	40000	160000
iptables	2ms	11min	5hours
ipvs	are neat	2ms	2ms

@dhilipkumars I am not sure if the statistics is sufficient.

spiffxp · 2017-06-07T17:09:41Z

related to the google doc here? https://docs.google.com/document/d/1YEBWR4EWeCEWwxufXzRM0e82l_lYYzIXQiSayGaVQ8M

spiffxp · 2017-06-07T17:10:13Z

@kubernetes/sig-network-proposals

cmluciano · 2017-06-07T19:10:05Z

contributors/design-proposals/ipvs-proxy.md

+
+### Network policy
+
+For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.


Breaking a soon-to-be GA feature seems like an edge case that we probably cannot overlook. Even with something in alpha form, I think we should be sure that all existing functionality is supported.

SNAT is required for cross host communication, there should be compromise in between, probably an awareness of the situation in release note. I am also open to any solution/fix that we can work towards.

apart from soon to be a GA feature, network policy is really important that people probably would trade performance for it. A draft idea is to not do snat, rather, on each host, add a new routing table and fwmark all service traffic to this table. The routing table will take care of routing back the packet. There's a lot of caveats, i don't know if it will ever work at all.

When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).

I am not sure if L3 overlay(flannel with vxlan backend) is the matter. Maybe I need a reverse traffic goes through IPVS proxy so that traffic reaches the source pod (after performing DNAT). I will re-test it in my environment.

@murali-reddy do you have any idea?

Agree with @thockin, there is no need for SNAT, atleast not for all traffic paths. So when a pod access cluster IP or node port, IPVS does DNAT. On reverse path pretty much (can not think of any pod networking solution where its not true) route to source pod is back to source pod node.

Where its gets tricky, is when a node port is accessed from outside the cluster. Node on which destination pod is running may route traffic directly to client through default gateway. To prevent that we need to SNAT traffic, so the return traffic goes through the same node through which client accessed node port. We do loose the source IP in this case. But AFAIK this is not unique to use of IPVS but even iptable kube-proxy will have to do the SNAT.

FWIW, i have implemented logic in kube-router to deal with external client accessing node port. I just tested Flannel VXLAN + IPVS service proxy + network policy, i dont see any issue. Dont see reason to do SNAT. Please test it with your POC, and see you can remove this restriction.

Thanks @murali-reddy.

I re-tested in my environment(Flannel VXLAN + IPVS service proxy) and found something different.

There is an ipvs service with 2 destinations which are in the different host.

[root@SHA1000130405 home]# ipvsadm -ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.102.128.4:3080 rr -> 10.244.0.235:8080 Masq 1 0 0 -> 10.244.2.123:8080 Masq 1 0 0

There was no SNAT rules applied and I curl VIP in each container.

# In container 10.244.0.235 $curl 10.102.128.4:3080 # get response from 10.244.2.123:8080 I tcpdumped in another host to see what the source IP is. $ tcpdump -i flannel.1 20:44:48.021765 IP 10.244.0.235.36844 > 10.244.2.123.webcache: Flags [S], seq 519767460, win 28200, options [mss 1410,sackOK,TS val 416 20:44:48.021998 IP 10.244.2.123.webcache > 10.244.0.235.36844: Flags [S.], seq 1131844123, ack 519767461, win 27960, options [mss 1410,76,nop,wscale 7], length 0

The output shows that the source IP no changed.

So, it seems that there is no need to do SNAT for cross-host communication.

However, when ipvs scheduled the self container, there is no response returned. It seems that the packet is dropped when the traffic reached the self container.

I can confirm that it has nothing to do with same host vs. cross host, the issue is about whether the request hit the self container.

When I applied SNAT rules as iptables proxy does, it can reach self container via VIP.

iptables -t nat -A PREROUTING -s 10.244.0.235 -j KUBE-MARK-MASQ

Unfortunately, the source container lost its IP and the source IP became flannel.1's IP.

Anyway, it's a good news to me that SNAT is unnecessary for cross-host communication so that it won't break the network policy though I need more knowledge to fix the issue I mentioned above.

Again this is nothing specic to IPVS, please search for hairpin related issues with kube-proxy/kubelet https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip

cmluciano · 2017-06-07T19:11:11Z

contributors/design-proposals/ipvs-proxy.md

+  -  container -> host (cross host)
+
+
+## TODO


I'm not sure if the TODO is necessary since it should be part of the PR steps.

Fixed. Thanks.

m1093782566 · 2017-06-08T01:37:03Z

@spiffxp Yes. @haibinxie wrote the doc and I translated it to markdown file and added some details

ddysher

Thanks for putting this together.

ddysher · 2017-06-08T00:33:41Z

contributors/design-proposals/ipvs-proxy.md

@@ -0,0 +1,152 @@
+# Alpha Version IPVS Load Balancing Mode in Kubernetes


It's odd to see a proposal just for alpha version. what's the plan for beta and stable?

Yeah, the proposal is not "for alpha". alpha is just a milestone.

Okay, will fix.

ddysher · 2017-06-08T00:40:41Z

contributors/design-proposals/ipvs-proxy.md

+
+For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)
+
+In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new  kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.


Is there absolutely no way? We tainted some cluster nodes with ipvs for external loadbalancing; if cleanup cleans all rules, things will break. Using ipvs for external loadbalance for external traffic is not uncommon i think.

--cleanup-proxyrules will clear all ipvs service in a host.

@ddysher Do you have any good idea?

Not that I know of. It's probably ok to just call out this in document or flag comment. User with existing ipvs services should use with caution. If one just wants to clean up kubernetes related ipvs services, just start the proxy and clear any unwanted k8s services.

Yes. We can add this to document or flag comment.

Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?

Can you only clean rules with RC1918 addresses?
[m1093782566] It still has a possibility of clearing ipvs rules created by other process although the the possibility is low.

[m1093782566] Kube-apiserver knows the service cluster IP range through its--service-cluster-ip-range parameter. However, kube-proxy knows nothing about that. I don't suggest to add a new --service-cluster-ip-range flag in kube-proxy since it will easily conflicts with kube-apiserver's flag. Even though kube-proxy clearing all ipvs rules in the service cluster IP range, it may leave some ipvs rules in external IP, exteral LB ingress IP and Node IP.

Clearing all ipvs rules with RFC1918 addresses probably is easier to implement and has lower possibility of clearing user's existing ipvs rules.

@fisherxu What do you think about it?

ddysher · 2017-06-08T00:42:05Z

contributors/design-proposals/ipvs-proxy.md

+
+In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new  kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.
+
+### Change to build


The doc linked somewhere uses seesaw, i suppose we changed to libnetwork afterwards?

Yes. We changed to use libnetwork to avoid cgo and libnl dependency.

ddysher · 2017-06-08T00:49:08Z

contributors/design-proposals/ipvs-proxy.md

+
+### IPVS setup and network topology
+
+IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode. 


iptables is already widely used now; it's better to say 'an alternative to' instead of replacement IMO

Fixed. Thanks for reminding.

ddysher · 2017-06-08T01:09:31Z

contributors/design-proposals/ipvs-proxy.md

+
+IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode. 
+
+We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint.


Can we be more specific about "some ipvs services"? I suppose it's "one ipvs for each kubernetes service, port and protocol combination"?

Suppose kubernetes service is NodePort type. Then ipvs Proxier will create two ipvs serivice, one is NodeIP:NodePort and the other one is ClusterIP:Port.

Of course, I will explain it in the doc. Thanks.

Update:

Since ipvs proxier will fall back on iptables when support node port type service. I should give another example.

Note that the relationship between kubernetes service and ipvs service is 1:N。The address of ipvs service corresponding to service's access IP, such as Cluster IP, external IP and LB.ingress.IP. If a kubernetes service has more than one access IP, For example, a external IP type service has 2 access IP(ClusterIP and External IP), then ipvs proxier will create 2 ipvs serivice - one for Cluster IP and the other one for External IP.

ddysher · 2017-06-08T01:48:47Z

contributors/design-proposals/ipvs-proxy.md

+ipvsadm -A -t 10.244.1.100:8080 -s rr -p [timeout]
+```
+
+When a service specify session affinity, ipvs proxy will assign a timeout value(180min by default) to ipvs service.


The current default value for iptables proxy is 180min, see https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L205

ddysher · 2017-06-08T01:53:08Z

contributors/design-proposals/ipvs-proxy.md

+
+## Other design considerations
+
+### IPVS setup and network topology


The design mixes ipvs and iptables rules, can we have a section dedicated to explain the interaction between ipvs and iptables, and which is responsible for what requirements?

+1 - this needs to detail every sort of flow and every feature of Services.

Yes, I created a section "when fall back on iptables" to explain the interaction between ipvs and iptables. Thanks!

m1093782566 · 2017-06-08T02:42:27Z

@ThomasZhou @haibinxie @fisherxu

dhilipkumars · 2017-06-07T12:48:54Z

contributors/design-proposals/ipvs-proxy.md

@@ -0,0 +1,152 @@
+# Alpha Version IPVS Load Balancing Mode in Kubernetes


Please add the author's name

Fixed. Thanks!

ddysher · 2017-06-08T03:38:36Z

contributors/design-proposals/ipvs-proxy.md

+
+### NodePort type service support
+
+For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.


Sorry this comment somehow gets lost.

Why do we enforce interface name? e.g. assuming ethx won't work for predictable network interface name for newer systemd.

Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?

What about take all UP state network interface(except the "vethxxx"s) address as the Node IP?

Updates:

As discussed in sig-network meeting, IPVS proxier will fall back on iptables when support node port type service.

thockin

Thanks for the doc. This needs a LOT more detail, though. I flagged a few big issues, but I don't feel like this is covering the depth we need.

Is this built around exec ipvsadm?

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)

How will you use/expose the different algorithms?

I am sure I have more questions. This is one of the most significant changes in the history of kube-proxy and kubernetes services. Please spend some time helping me not be scared of it. :)

thockin · 2017-06-10T23:19:16Z

contributors/design-proposals/ipvs-proxy.md

@@ -0,0 +1,152 @@
+# Alpha Version IPVS Load Balancing Mode in Kubernetes


Yeah, the proposal is not "for alpha". alpha is just a milestone.

thockin · 2017-06-10T23:37:54Z

contributors/design-proposals/ipvs-proxy.md

+
+For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)
+
+In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new  kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.


Can you only clean rules with RC1918 addresses? Or only rules in the service IP range?

thockin · 2017-06-10T23:38:57Z

contributors/design-proposals/ipvs-proxy.md

+
+## Other design considerations
+
+### IPVS setup and network topology


+1 - this needs to detail every sort of flow and every feature of Services.

thockin · 2017-06-10T23:42:46Z

contributors/design-proposals/ipvs-proxy.md

+
+### NodePort type service support
+
+For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.


Correct. This is a challenging design constraint. Do we need to use IPVS for NodePorts or can that fall back on iptables?

thockin · 2017-06-10T23:48:22Z

contributors/design-proposals/ipvs-proxy.md

+
+### Sync period
+
+Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update.


How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

Changing seesion affinity will call UpdateService API which directly send update command to kernel via socket communication and won't take a service down.

@dhilipkumars did a test and found ipvs update did not disrupt the service or even the existing connection.

thockin · 2017-06-10T23:57:55Z

contributors/design-proposals/ipvs-proxy.md

+
+### Network policy
+
+For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.


When I prototyped IPVS, and I just re-did my tests, I don't see a need to SNAT. I tcpdumped it at both ends.

client pod has IP P
service has IP S
backend has IP B

P -> S packet
leaves pod netns to root netns
IPVS rewrites destination to B (packet is src:P, dst:B)
arrives at B's node's root netns
forwarded to B
response is src:B, dst:P
arrives at P's nodes' root netns
conntrack converts packet to src:S, dest:P

This works (on GCP, at least) because the only route to P is back to P's node. I can see how it might not work in some environments, but frankly, if this can't work in most environments, we should maybe not put it in kube-proxy.

Preserving client IP has been a huge concern, and I am not inclined to throw that away. WRT NetworkPolicy, this doesn't break the API, but it does end up breaking almost every implementation (in a really not-obviously-fixable way).

thockin · 2017-06-10T23:59:04Z

contributors/design-proposals/ipvs-proxy.md

+
+For IPVS NAT mode to work, **all packets from the realservers to the client must go through the director**. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, **which will break network policy**.
+
+## Test validation


I tried to enumerate everything you have to test:

pod -> pod, same VM
pop -> pod, other VM
pod -> own VM, own hostPort
pod -> own VM, other hostPort
pod -> other VM, other hostPort

pod -> own VM
pod -> other VM
pod -> internet
pod -> http://metadata

VM -> pod, same VM
VM -> pod, other VM
VM -> same VM hostPort
VM -> other VM hostPort

pod -> own clusterIP, hairpin
pod -> own clusterIP, same VM, other pod, no port remap
pod -> own clusterIP, same VM, other pod, port remap
pod -> own clusterIP, other VM, other pod, no port remap
pod -> own clusterIP, other VM, other pod, port remap
pod -> other clusterIP, same VM, no port remap
pod -> other clusterIP, same VM, port remap
pod -> other clusterIP, other VM, no port remap
pod -> other clusterIP, other VM, port remap
pod -> own node, own nodePort, hairpin
pod -> own node, own nodePort, policy=local
pod -> own node, own nodePort, same VM
pod -> own node, own nodePort, other VM
pod -> own node, other nodePort, policy=local
pod -> own node, other nodePort, same VM
pod -> own node, other nodePort, other VM
pod -> other node, own nodeport, policy=local
pod -> other node, own nodeport, same VM
pod -> other node, own nodeport, other VM
pod -> other node, other nodeport, policy=local
pod -> other node, other nodeport, same VM
pod -> other node, other nodeport, other VM
pod -> own external LB, no remap, policy=local
pod -> own external LB, no remap, same VM
pod -> own external LB, no remap, other VM
pod -> own external LB, remap, policy=local
pod -> own external LB, remap, same VM
pod -> own external LB, remap, other VM

VM -> same VM nodePort, policy=local
VM -> same VM nbodePort, same VM
VM -> same VM nbodePort, other VM
VM -> other VM nodePort, policy=local
VM -> other VM nbodePort, same VM
VM -> other VM nbodePort, other VM

VM -> external LB

public -> nodeport, policy=local
public -> nodeport, policy=global
public -> external LB, no remap, policy=local
public -> external LB, no remap, policy=global
public -> external LB, remap, policy=local
public -> external LB, remap, policy=global

public -> nodeport, manual backend
public -> external LB, manual backend

haibinxie · 2017-06-12T18:21:37Z

@thockin
Let me know if this helps.

Is this built around exec ipvsadm?
[Haibin Michael Xie] this is built on top of libnetwork, which talks to kernel via socket communication. Not on top of ipvsadm.

I'd like to see pseudo-code explaining how the resync loop works, and what the intermediate state looks like.

How do you prevent dropped packets during updates?
[Haibin Michael Xie] do you mean OS updates? I don't know the full picture of how iptables handles it, IMO this is with no difference from iptables.

How do you do cleanups across proxy restarts, where you might have lost information? (e.g. create service A and B, you crash, service A gets deleted, you restart, you get a sync for B - what happens to A?)
[Haibin Michael Xie] There is periodic full resync and in memory cache based diff. This should be handled already in iptables, and there is no difference in this regard.

How does this scale (if I have 10,000 services and 5 backends each, is this 50,000 exec calls?)
[Haibin Michael Xie] Same to above libnetwork use socket to talk to kernel which is very efficient.

How will you use/expose the different algorithms?
[Haibin Michael Xie] If it's LB algorithm, it is already mentioned in the proposal. kube-proxy has a new parameter --ipvs-scheduler.

m1093782566 · 2017-06-14T13:25:54Z

@thockin

I re-tested and found something different. SNAT is not required for cross-host communication. So it won't break network policy. It's really a big finding to me :)

I find the packet will be dropped when container vist VIP and the real backend is itself. I have no idea now but will try to find out why.

m1093782566 · 2017-06-15T14:25:18Z

I will update the proposal and try to fix the review comments in newer proposal.

@thockin I will add pseudo-code explaining how the resync loop works. Thanks.

m1093782566 · 2017-06-16T07:12:18Z

@thockin @ddysher @cmluciano @murali-reddy I update the proposals according to review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie @ThomasZhou

k8s-ci-robot · 2017-06-16T07:12:19Z

@m1093782566: GitHub didn't allow me to request PR reviews from the following users: haibinxie.

Note that only kubernetes members can review this PR, and authors cannot review their own PRs.

In response to this:

@thockin @ddysher @cmluciano I update the proposals according review comments and add more details. PTAL.

Any comments are welcomed. Thanks.

/cc @haibinxie

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

m1093782566 · 2017-06-16T11:03:15Z

@thockin

How do you prevent dropped packets during updates?

How will you do changes to Services without downtime? If I change the session affinity, for example, you shouldn't take a service down to change it.

According to @dhilipkumars 's test result. It shows ipvs update did not disrupt the service or even the existing connection

sudo ipvsadm -L -n
[sudo] password for d:
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.11.12.13:6379 wlc persistent 10
  -> 10.192.0.1:32768             Masq    1      1          0

the real backedn service is redis-alpine conecting to the service

docker run --net=host -it redis:3.0-alpine redis-cli -h 10.11.12.13 -p 6379
10.11.12.13:6379> info Clients
# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

In a parllel session if i update the scheduler algorithm or persistence timeout the service is not disrupted. Ipvsadm and libnetwor's ipvs pkg works in the same principle firing netlink message to the kernal so the behaviour should be the same.

Udpate persistance timout still no

$sudo ipvsadm -E -t 10.11.12.13:6379 -p 60
$sudo ipvsadm -L -n --persistent-conn
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port            Weight    PersistConn ActiveConn InActConn
  -> RemoteAddress:Port
TCP  10.11.12.13:6379 wlc persistent 60
  -> 10.192.0.1:32768             1         1           1          0

what other parameters should we test?

haibinxie · 2017-07-31T05:21:32Z

@thockin Could you confirm if @m1093782566 's comment above is the right thing to do? anything else is left on closing it. please expect me keep bothering you until it's closed :)

If you get a chance we can have a quick phone call on reviewing and addressing issues on this, we want to commit to release the feature in 1.8.

m1093782566 · 2017-08-12T06:02:23Z

@danwinship Do you have interest in reviewing this design proposal? Any comments are welcome. :)

m1093782566 · 2017-08-14T08:56:26Z

Hi @thockin @ddysher

I just come up with an idea about implementing nodeport type service via ipvs.

Can we take all the IP address which ADDRTYPE match dst-type LOCAL as the address of ipvs service? For example,

[root@100-106-179-225 ~]# ip route show table local type local
100.106.179.225 dev eth0  proto kernel  scope host  src 100.106.179.225 
127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1 
127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1 
172.16.0.0 dev flannel.1  proto kernel  scope host  src 172.16.0.0 
172.17.0.1 dev docker0  proto kernel  scope host  src 172.17.0.1 
192.168.122.1 dev virbr0  proto kernel  scope host  src 192.168.122.1

Then, [100.106.179.225, 127.0.0.0/8, 127.0.0.1, 172.16.0.0, 172.17.0.1, 192.168.122.1] would be the address of ipvs service for nodeport service.

I assume KUBE-NODEPORTS chain created by iptables proxier did the same thing? For example,

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

I am not opposed to implementing nodeport service via iptables, I just want to know if the approach mentioned above make sense? Or am I wrong?

Looking forward to receiving your opinions.

m1093782566 · 2017-08-14T09:20:05Z

By doing this, I think we can remove the design constraint that assuming node IP is the address of eth{x}?

m1093782566 · 2017-08-14T10:19:32Z

@feiskyer Do you have bandwidth to take a look at this protosal? Thanks :)

feiskyer · 2017-08-14T14:03:05Z

Using a list of IP addresses for nodePort has potential problems, e.g. ip addresses may be changed or new nics may be added later. And I don't think watching the changes of ip addresses and nics is a good idea.

Maybe using iptables for nodePort services is a better choice.

m1093782566 · 2017-08-15T01:31:15Z

Glad to receive your feedback, @feiskyer

And I don't think watching the changes of ip addresses and nics is a good idea.

Yes, I agree. Thanks for your thoughts.

luxas · 2017-10-08T12:17:00Z

ping @kubernetes/sig-network-feature-requests
Any movement here lately?

haibinxie · 2017-10-09T17:11:49Z

@luxas it's released in 1.8 https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#downloads-for-v180

haibinxie · 2017-10-09T17:12:23Z

@luxas alpha in 1.8. we are working on beta release in 1.9

castrojo · 2017-10-10T15:54:18Z

This change is

cmluciano · 2017-11-10T19:40:02Z

/keep-open

spiffxp · 2017-12-14T23:33:00Z

/lifecycle frozen
@cmluciano I'm keeping this open on your behalf, if this is no longer relevant to keep open please /remove-lifecycle frozen

cmluciano · 2018-01-03T16:13:46Z

@m1093782566 Is there a PR that supersedes this one?

m1093782566 · 2018-01-04T01:28:37Z

@cmluciano

NO.

This PR is the only design proposal. IPVS proxier already reached beta while this document is still pending, unfortunately.

m1093782566 · 2018-01-08T09:13:01Z

/close

Signed-off-by: kerthcet <[email protected]>

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 7, 2017

m1093782566 force-pushed the ipvs-proxy branch from 26252e0 to 33b8f67 Compare June 7, 2017 12:14

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 7, 2017

This was referenced Jun 7, 2017

Implement IPVS-based in-cluster service load balancing kubernetes/kubernetes#46580

Merged

Implement IPVS-based in-cluster service load balancing kubernetes/enhancements#265

Closed

cmluciano reviewed Jun 7, 2017

View reviewed changes

ddysher reviewed Jun 8, 2017

View reviewed changes

dhilipkumars reviewed Jun 8, 2017

View reviewed changes

ddysher reviewed Jun 8, 2017

View reviewed changes

thockin reviewed Jun 11, 2017

View reviewed changes

m1093782566 mentioned this pull request Jun 12, 2017

pod networking between the nodes in different subnets cloudnativelabs/kube-router#21

Closed

m1093782566 force-pushed the ipvs-proxy branch 2 times, most recently from c0a5b3e to ebb958f Compare June 16, 2017 07:08

m1093782566 force-pushed the ipvs-proxy branch 2 times, most recently from 934152c to a552531 Compare June 16, 2017 07:39

add ipvs proxy mode design proposal

a64684e

m1093782566 force-pushed the ipvs-proxy branch from dd36260 to a64684e Compare July 28, 2017 06:45

k8s-github-robot assigned sarahnovotny and calebamiles Aug 15, 2017

k8s-github-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 15, 2017

cmluciano mentioned this pull request Aug 16, 2017

Proposal: IPvS implementation for KubeProxy #429

Closed

m1093782566 mentioned this pull request Sep 21, 2017

add proxy-mode:ipvs in service.md kubernetes/website#5571

Merged

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 8, 2017

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 14, 2017

k8s-ci-robot assigned thockin and unassigned sarahnovotny and calebamiles Dec 28, 2017

k8s-ci-robot closed this Jan 8, 2018

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

add kerthcet as member (kubernetes#692)

0593189

Signed-off-by: kerthcet <[email protected]>


		### Network policy

		For IPVS NAT mode to work, all packets from the realservers to the client must go through the director. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, which will break network policy.

		@@ -0,0 +1,152 @@
		# Alpha Version IPVS Load Balancing Mode in Kubernetes


		For more details about it, refer to [http://kb.linuxvirtualserver.org/wiki/Ipvsadm](http://kb.linuxvirtualserver.org/wiki/Ipvsadm)

		In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.


		In order to clean up inactive rules(includes iptables rules and ipvs rules), we will introduce a new kube-proxy parameter `--cleanup-proxyrules ` and mark the older `--cleanup-iptables` deprecated. Unfortunately, there is no way to distinguish whether an ipvs service is created by ipvs proxy or other process, `--cleanup-proxyrules` will clear all ipvs service in a host.

		### Change to build


		### IPVS setup and network topology

		IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.


		IPVS is a replacement or IPTables as load balancer, it’s assumed reader of this proposal is familiar with IPTables load balancer mode. We will create a dummy interface and assign all service Cluster IPs to the dummy interface(maybe called `kube0`). In alpha version, we will implicitly use NAT mode.

		We will create some ipvs services for each kubernetes service. The VIP of ipvs service corresponding to the accessable IP(such as cluster IP, external IP, nodeIP, ingress IP, etc.) of kubernetes service. Each destination of an ipvs service corresponding to an kubernetes service endpoint.


		## Other design considerations

		### IPVS setup and network topology


		### NodePort type service support

		For NodePort type service, IPVS proxy will take all accessable IPs in a host as the virtual IP of ipvs service. Specifically, accessable IP excludes `lo`, `docker0`, `vethxxx`, `cni0`, `flannel0`, etc. Currently, we assume they are IPs bound to `eth{i}`.


		### Sync period

		Similar to iptables proxy, IPVS proxy will do full sync loop every 10 seconds by default. Besides, every update on kubernetes service and endpoint will trigger an ipvs service and destination update.


		For IPVS NAT mode to work, all packets from the realservers to the client must go through the director. It means ipvs proxy should do SNAT in a L3-overlay network(such as flannel) for cross-host network communication. When a container request a cluster IP to visit an endpoint in another host, we should enable `--masquer-all` for ipvs proxy, which will break network policy.

		## Test validation

support ipvs mode for kube-proxy #692

support ipvs mode for kube-proxy #692

Conversation

m1093782566 commented Jun 7, 2017 • edited Loading

k8s-ci-robot commented Jun 7, 2017

dhilipkumars commented Jun 7, 2017

m1093782566 commented Jun 7, 2017

m1093782566 commented Jun 7, 2017

spiffxp commented Jun 7, 2017

spiffxp commented Jun 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

murali-reddy Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 commented Jun 8, 2017

ddysher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddysher Jun 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

m1093782566 commented Jun 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddysher Jun 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haibinxie commented Jun 12, 2017

m1093782566 commented Jun 14, 2017 • edited Loading

m1093782566 commented Jun 15, 2017 • edited Loading

m1093782566 commented Jun 16, 2017 • edited Loading

k8s-ci-robot commented Jun 16, 2017

m1093782566 commented Jun 16, 2017 • edited Loading

haibinxie commented Jul 31, 2017

m1093782566 commented Jun 7, 2017 •

edited

Loading

m1093782566 Jun 13, 2017 •

edited

Loading

murali-reddy Jun 14, 2017 •

edited

Loading

m1093782566 Jun 14, 2017 •

edited

Loading

m1093782566 Jun 15, 2017 •

edited

Loading

ddysher Jun 8, 2017 •

edited

Loading

m1093782566 Jul 24, 2017 •

edited

Loading

ddysher Jun 8, 2017 •

edited

Loading

m1093782566 Jun 15, 2017 •

edited

Loading

m1093782566 commented Jun 14, 2017 •

edited

Loading

m1093782566 commented Jun 15, 2017 •

edited

Loading

m1093782566 commented Jun 16, 2017 •

edited

Loading

m1093782566 commented Jun 16, 2017 •

edited

Loading

m1093782566 commented Aug 14, 2017 •

edited

Loading

m1093782566 commented Aug 14, 2017 •

edited

Loading

m1093782566 commented Aug 14, 2017 •

edited

Loading

spiffxp commented Dec 14, 2017 •

edited

Loading