-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Multiple Reductions in MTU #1086
Fix Multiple Reductions in MTU #1086
Conversation
@zerkms any chance you would be willing to test this patch in your environment and see if it has the effect you're looking for? We don't use overlay networks in any clusters that I have access to. If you're willing, it would be exceptionally helpful if you would run iperf before and after this patch against the pod and through a service to a pod on a node running this patch. That way we could be sure that we weren't causing a regression. |
6c110b4
to
8ba3da9
Compare
@aauren would it be possible to have a dedicated image built with that patch? (available on docker hub) |
@aauren awesome, please allow me couple days to schedule that. |
@aauren I only deployed it to a cluster without I have found that a bridge interface is still created with MTU 1480.
Why is that? I use bridged networks to run KVM virtual machines and I use the default 1500 for both physical and a bridge interface. |
@aauren wondering if you really wanted to get rid of accounting if overlay enabled when calculating the MTU? I think we will end up in situtaion where
I believe reduing MTU to avoid fragmentation was the reason in perf improvements. |
@murali-reddy I'll admit that I'm not the most knowledgable person about the ipip tunnels that Linux builds out. However, it seems like the ip tooling is already accounting for the MTU reduction on the tunnel side of the connection. So when we reduce the MTU ahead of time, it appears as though the MTU get's reduced by double the amount it should. This led me to believe that reducing it ahead of time was an error on our part. Thinking about this a bit, I think that we have the following flows that will behave differently based upon whether or not we reduce the MTU ahead of time. Let me know what you think @murali-reddy: Summary/TL;DR;MTU preemptively set on kube-bridge has the chance to avoid fragmentation and make the communication more performant in the following scenarios:
MTU preemptively set on kube-bridge has the chance to make communication less performant in the following scenarios:
MTU preemptively set on kube-bridge has the chance to cause fragmentation and make the communication less performant in the following scenarios:
Scenarios for ConsiderationFor these experiments we'll assume that:
External -> Service
ResultsMTU preemptively set on kube-bridge to 1480 has the chance to cause fragmentation unnecessarily Pod -> IPVS Local Endpoint Service
ResultsMTU preemptively set on kube-bridge reduces the performance of traffic that could otherwise be satisfied by 1500 MTU Pod -> IPVS Remote Endpoint Service
ResultsMTU preemptively set on kube-bridge has the chance to avoid fragmentation and allow communication to be more performant Pod -> BGP ECMP Service
ResultsMTU preemtively set on kube-bridge has the chance to make communication less performant as communication could have stayed at 1500 MTU without fragmentation Pod -> Local Destination Pod
ResultsMTU preemtively set on kube-bridge has the chance to make communication less performant as communication could have stayed at 1500 MTU without fragmentation Pod -> Remote Destination Pod
ResultsMTU preemptively set on kube-bridge has the chance to avoid fragmentation and allow communication to be more performant |
@zerkms I'm not sure what happened to the PR build process, but it appears that the version of kube-router in that docker image didn't contain the patches from this PR. I've pushed over |
@aauren last week I conducted an iptables verification bug, that you reopened (with the "wrong" docker image) - I should not rerun it again right? And I haven't run the MTU verficiation yet (but I scheduled it to do it tomorrow). I will take that latest image indeed. I plan to run the following iperf tests: a) pod - pod, the same network Anything else? |
If you could also test from a non kubernetes node to a service that would also be helpful. Also taking before and after tests is crucial so that we have something to compare it to. If you would like I could build you a container that included #1090 as well if you wanted to test both in one shot. |
@aauren no need for that: my only cluster with overlay networks runs on aws and serves production traffic: I can only run there fixes that affect it directly. I'm keen to help with verifying #1090 as well, and will do it next week in a test cluster (as that bug does not require ipip) :-) Thanks for what you do for the project! |
here are the results (for every case I have run more than 2 times, but they were close enough, so I include only 2 runs for every case) old version: 1.1.1MTUs: physical interface - 9001, bridge - 8981, tunnel interface - 89611. Same AZ (no tunnel), pod-pod
2. Same AZ (no tunnel), pod-svc
3. Different AZ (tunnel), pod-pod
4. Different AZ (tunnel), pod-svc
5. outside (vpn, my laptop)
new version: v1.2.1-47-g3effa257, built on 2021-05-24T13:06:46-0500, go1.16.4MTUs: physical interface - 9001, bridge - 9001, tunnel interface - 89811. Same AZ (no tunnel), pod-pod
2. Same AZ (no tunnel), pod-svc
3. Different AZ (tunnel), pod-pod
4. Different AZ (tunnel), pod-svc
5. outside (vpn, my laptop)
|
Thanks @zerkms! I appreciate you taking the time to test it out. From going over the results briefly, it looks very similar between the two, not really any performance gained or lost. Is that about what you got from your tests as well? @murali-reddy What are your thoughts. Do you want to / have time to do your own iperf testing with and without this patch? Any thoughts on my use-cases? |
Yep, I'm not a professional network engineer, so I don't know how to conduct a more precise tests, but from those iperf runs it looks the same. |
@aauren Second this. I was gonna report this double mtu reduction but was wonder if maybe in some older linux, ip tool doesn't account for the 20 bytes. but this applies to the overlay IPIP tunnel interface. The kube-bridge and pod mtu though, should be a separate issue. (somewhere else in the code) can we merge and release this PR sooner? and work on the kube-bridge/pod mtu in a separate issue please? Thanks. |
8ba3da9
to
24d7721
Compare
@kailunshi This one is sticky, its going to take me a while to get back into this thread to fully test. I just updated this PR so that it is actually merge-able, but its been so long now that its going to be a bit before I can get the context again and look into merging it. |
24d7721
to
d1211de
Compare
After looking through this issue yet again, I still find that I think it makes more sense not to incur the double MTU reduction. I went ahead and reproduced the same set of tests that @zerkms did in their environment and I find that the same holds true. For all conceivable flows, the performance is either the same, or just slightly better without the MTU reduction: Results from iperf3 running with
While I do still hold with my comment here: #1086 (comment) that there may be flows that are less performant because they become fragmented without the MTU reduction, I think that they are in the minority when compared to the flows that see performance increases without the MTU reduction. Additionally, when it in practice I'm unable to find a situation where increasing the MTU causes a performance regression. Given all of that, I'm going to merge this PR. |
Fixes kube-router so that it mostly leaves MTU alone. I'm not 100% sure about this PR as these MTU reductions seem to have come out of #102, #108, & #109 all of which are a bit light on details about why reducing MTU saw performance improvements.
What I suspect may be the case is that it's possible that the
ip
command may not have originally reduced the MTU when buildingipip
tunnels. However, it is now doing so, so our reduction of MTU actually results in a very small MTU size when overlay networking is enabled and caused #1033.Proof that creating an
ipip
device type reduces the MTU of the tunnel link automatically in recent tooling/kernels:Given that, I don't see why we would reduce the kube-bridge interface ahead of the MTU reduction that will already be part of the tunnel. It is possible that it might benefit us to reduce the MTU on the CNI configuration in case traffic routes across the pod network or to a k8s service, but we don't do this currently (except if automtu is enabled), so I'm not sure its worth introducing that now.