Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Multiple Reductions in MTU #1086

Merged

Conversation

aauren
Copy link
Collaborator

@aauren aauren commented May 17, 2021

Fixes kube-router so that it mostly leaves MTU alone. I'm not 100% sure about this PR as these MTU reductions seem to have come out of #102, #108, & #109 all of which are a bit light on details about why reducing MTU saw performance improvements.

What I suspect may be the case is that it's possible that the ip command may not have originally reduced the MTU when building ipip tunnels. However, it is now doing so, so our reduction of MTU actually results in a very small MTU size when overlay networking is enabled and caused #1033.

Proof that creating an ipip device type reduces the MTU of the tunnel link automatically in recent tooling/kernels:

# ip addr
...
3: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
...
# ip tunnel add testtun1 mode ipip local <ip> remote <ip> dev br0
# ip addr
...
# 65: testtun1@br0: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
...

Given that, I don't see why we would reduce the kube-bridge interface ahead of the MTU reduction that will already be part of the tunnel. It is possible that it might benefit us to reduce the MTU on the CNI configuration in case traffic routes across the pod network or to a k8s service, but we don't do this currently (except if automtu is enabled), so I'm not sure its worth introducing that now.

@aauren
Copy link
Collaborator Author

aauren commented May 17, 2021

@zerkms any chance you would be willing to test this patch in your environment and see if it has the effect you're looking for? We don't use overlay networks in any clusters that I have access to.

If you're willing, it would be exceptionally helpful if you would run iperf before and after this patch against the pod and through a service to a pod on a node running this patch. That way we could be sure that we weren't causing a regression.

@aauren aauren force-pushed the fix_multiple_reductions_in_mtu branch from 6c110b4 to 8ba3da9 Compare May 17, 2021 21:33
@zerkms
Copy link
Contributor

zerkms commented May 17, 2021

@aauren would it be possible to have a dedicated image built with that patch? (available on docker hub)

@aauren
Copy link
Collaborator Author

aauren commented May 18, 2021

@zerkms
Copy link
Contributor

zerkms commented May 18, 2021

@aauren awesome, please allow me couple days to schedule that.

@zerkms
Copy link
Contributor

zerkms commented May 21, 2021

@aauren I only deployed it to a cluster without ipip overlay (but I verified #1056 (comment) which is not fixed unfortunately).

I have found that a bridge interface is still created with MTU 1480.

4: kube-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default qlen 1000

Why is that? I use bridged networks to run KVM virtual machines and I use the default 1500 for both physical and a bridge interface.

@murali-reddy
Copy link
Member

@aauren wondering if you really wanted to get rid of accounting if overlay enabled when calculating the MTU?

I think we will end up in situtaion where kube-bridge interface MTU will be +20 bytes than IPIP tunnel interfaces which will resulting in fragmentaion.

Fixes kube-router so that it mostly leaves MTU alone. I'm not 100% sure about this PR as these MTU reductions seem to have come out of #102, #108, & #109 all of which are a bit light on details about why reducing MTU saw performance improvements.

I believe reduing MTU to avoid fragmentation was the reason in perf improvements.

@aauren
Copy link
Collaborator Author

aauren commented May 24, 2021

@murali-reddy I'll admit that I'm not the most knowledgable person about the ipip tunnels that Linux builds out. However, it seems like the ip tooling is already accounting for the MTU reduction on the tunnel side of the connection. So when we reduce the MTU ahead of time, it appears as though the MTU get's reduced by double the amount it should. This led me to believe that reducing it ahead of time was an error on our part.

Thinking about this a bit, I think that we have the following flows that will behave differently based upon whether or not we reduce the MTU ahead of time. Let me know what you think @murali-reddy:

Summary/TL;DR;

MTU preemptively set on kube-bridge has the chance to avoid fragmentation and make the communication more performant in the following scenarios:

  • Pod -> IPVS Remote Endpoint Service
  • Pod -> Remote Destination Pod

MTU preemptively set on kube-bridge has the chance to make communication less performant in the following scenarios:

  • Pod -> Local Destination Pod
  • Pod -> BGP ECMP Service
  • Pod -> IPVS Local Endpoint Service

MTU preemptively set on kube-bridge has the chance to cause fragmentation and make the communication less performant in the following scenarios:

  • External -> Service

Scenarios for Consideration

For these experiments we'll assume that:

  • the CNI and kube-bridge interfaces are configured in pairs so that they aren't skewed
  • the MTU isn't being subtracted twice as it is now (i.e. after creating the IPIP tunnel we manually increase the link by 20 MTU) which essentially forces fragmentation no matter what
    • i.e. if kube-bridge is set to 1480 MTU and the ipip link connected to kube-bridge is set to 1460 MTU, fragmentation will have to occur as now bytes 1460 to 1480 are now used for the new IP header so a packet at the maximum transmit unit will be fragmented into 2 packets

External -> Service

  • will ingress at 1500 MTU via the node's primary physical interface
  • will be accepted by kube-dummy-if at 1500 MTU and processed by IPVS which will translate traffic to a pod IP addr
  • it will then traverse kube-bridge where
    • if the MTU has been reduced to 1480 will possibly result in fragmentation
    • if the kube-bridge has been left alone it will have no effect
  • it will then enter the pod's namespace where it's side of the tunnel will be configured by the CNI configuration
    • if the MTU has been reduced on the kube-bridge we'll assume the pod's side of the link will also be 1480 MTU and the packet will continue unchanged
    • if the MTU has not been reduced on the kube-bridge we'll assume the pod's side of the link will also be 1500 MTU and the packet will continue unchanged
  • pod processes the traffic and egresses

Results

MTU preemptively set on kube-bridge to 1480 has the chance to cause fragmentation unnecessarily

Pod -> IPVS Local Endpoint Service

  • traffic will egress from the pod to it's namespace link at:
    • 1480 MTU if pre MTU reduction has been enabled
    • 1500 MTU if pre MTU reduction has not been enabled
  • traffic will emerge from pod's namespace to kube-bridge also set to MTU of link, packet will be left as is
  • traffic will be accepted by kube-dummy-if at 1500 MTU, frame will be enlarged if pre MTU reduction has been enabled but no fragmentation can occur
  • traffic will be processed by IPVS which will translate to a pod IP addr
  • traffic will be accepted by local node by kube-bridge and:
    • will be reduced back to 1480 MTU if pre MTU reduction has been enabled, however, since that part of the packet was unused it will not result in fragmentation
    • will be left unchanged at 1500 MTU if pre MTU reduction was not enabled

Results

MTU preemptively set on kube-bridge reduces the performance of traffic that could otherwise be satisfied by 1500 MTU

Pod -> IPVS Remote Endpoint Service

  • traffic will egress from the pod to it's namespace link at:
    • 1480 MTU if pre MTU reduction has been enabled
    • 1500 MTU if pre MTU reduction has not been enabled
  • traffic will emerge from pod's namespace to kube-bridge also set to MTU of link, packet will be left as is
  • traffic will be accepted by kube-dummy-if at 1500 MTU, frame will be enlarged if pre MTU reduction has been enabled but no fragmentation can occur
  • traffic will be processed by IPVS which will translate to a pod IP addr
  • traffic will be accepted by an IPIP tunnel interface and:
    • if pre MTU reduction was enabled, no fragmentation should occur as bytes 1480 to 1500 were unused and blank
    • if pre MTU reduction was not enabled, fragmentation has the chance to occur if packet was at its maximum transmit size
  • <further analysis of this flow is unimportant as it follows similar flows already detailed>

Results

MTU preemptively set on kube-bridge has the chance to avoid fragmentation and allow communication to be more performant

Pod -> BGP ECMP Service

  • traffic will egress from the pod to it's namespace link at:
    • 1480 MTU if pre MTU reduction has been enabled
    • 1500 MTU if pre MTU reduction has not been enabled
  • traffic will emerge from pod's namespace to kube-bridge also set to MTU of link, packet will be left as is
  • traffic will be accepted by the node's physical interface at 1500 MTU, frame will be enlarged if pre MTU reduction has been enabled but no fragmentation can occur
  • will ingress remote node at 1500 MTU via the node's primary physical interface
  • will be accepted by kube-dummy-if at 1500 MTU and processed by IPVS which will translate traffic to a pod IP addr
  • it will then traverse kube-bridge where
    • if the MTU has been reduced to 1480 no fragmentation will occur as these bytes are empty
    • if the kube-bridge has been left alone it will have no effect, packet will stay at 1500 MTU
  • it will then enter the pod's namespace where it's side of the tunnel will be configured by the CNI configuration
    • if the MTU has been reduced on the kube-bridge we'll assume the pod's side of the link will also be 1480 MTU and the packet will continue unchanged
    • if the MTU has not been reduced on the kube-bridge we'll assume the pod's side of the link will also be 1500 MTU and the packet will continue unchanged
  • pod processes the traffic and egresses

Results

MTU preemtively set on kube-bridge has the chance to make communication less performant as communication could have stayed at 1500 MTU without fragmentation

Pod -> Local Destination Pod

  • traffic will egress from the pod to it's namespace link at:
    • 1480 MTU if pre MTU reduction has been enabled
    • 1500 MTU if pre MTU reduction has not been enabled
  • traffic will emerge from pod's namespace to kube-bridge also set to MTU of link, packet will be left as is
  • traffic will route into local destination pod's namespace via kube-bridge also set to MTU of link, packet will be left as is
  • traffic will be received into the local destination pod's namespace at the same MTU, packet will be left as is
  • pod will process traffic and return

Results

MTU preemtively set on kube-bridge has the chance to make communication less performant as communication could have stayed at 1500 MTU without fragmentation

Pod -> Remote Destination Pod

  • traffic will egress from the pod to it's namespace link at:
    • 1480 MTU if pre MTU reduction has been enabled
    • 1500 MTU if pre MTU reduction has not been enabled
  • traffic will emerge from pod's namespace to kube-bridge also set to MTU of link, packet will be left as is
  • traffic will be accepted by an IPIP tunnel interface and:
    • if pre MTU reduction was enabled, no fragmentation should occur
    • if pre MTU reduction was not enabled, fragmentation has the chance to occur if packet was at its maximum transmit size
  • <further analysis of this flow is unimportant as it follows similar flows already detailed>

Results

MTU preemptively set on kube-bridge has the chance to avoid fragmentation and allow communication to be more performant

@aauren
Copy link
Collaborator Author

aauren commented May 24, 2021

@zerkms I'm not sure what happened to the PR build process, but it appears that the version of kube-router in that docker image didn't contain the patches from this PR.

I've pushed over docker.io/cloudnativelabs/kube-router-git:amd64-fix_multiple_reductions_in_mtu with the correct changes and validated that it no longer reduces the MTU in a VM. Please try https://hub.docker.com/layers/cloudnativelabs/kube-router-git/amd64-fix_multiple_reductions_in_mtu/images/sha256-a7c28c97e888e4ca0c2a5b83b1ac8ea35f299d7c8ca0a858d93b9ae3544ac90d?context=explore again.

@zerkms
Copy link
Contributor

zerkms commented May 25, 2021

@aauren last week I conducted an iptables verification bug, that you reopened (with the "wrong" docker image) - I should not rerun it again right?

And I haven't run the MTU verficiation yet (but I scheduled it to do it tomorrow). I will take that latest image indeed.

I plan to run the following iperf tests:

a) pod - pod, the same network
b) pod - pod, different networks connected via ipip
c) pod - service, the same network
c) pod - service, different networks connected via ipip

Anything else?

@aauren
Copy link
Collaborator Author

aauren commented May 25, 2021

I plan to run the following iperf tests:

a) pod - pod, the same network
b) pod - pod, different networks connected via ipip
c) pod - service, the same network
c) pod - service, different networks connected via ipip

If you could also test from a non kubernetes node to a service that would also be helpful.

Also taking before and after tests is crucial so that we have something to compare it to.

If you would like I could build you a container that included #1090 as well if you wanted to test both in one shot.

@zerkms
Copy link
Contributor

zerkms commented May 25, 2021

@aauren no need for that: my only cluster with overlay networks runs on aws and serves production traffic: I can only run there fixes that affect it directly.

I'm keen to help with verifying #1090 as well, and will do it next week in a test cluster (as that bug does not require ipip) :-)

Thanks for what you do for the project!

@zerkms
Copy link
Contributor

zerkms commented May 27, 2021

@aauren

here are the results (for every case I have run more than 2 times, but they were close enough, so I include only 2 runs for every case)

old version: 1.1.1

MTUs: physical interface - 9001, bridge - 8981, tunnel interface - 8961

1. Same AZ (no tunnel), pod-pod

root@iperf-client-dc485c95d-gs25b:/# iperf -c 10.70.68.225
------------------------------------------------------------
Client connecting to 10.70.68.225, TCP port 5001
TCP window size: 3.49 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.139 port 50990 connected with 10.70.68.225 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.29 GBytes  4.54 Gbits/sec
root@iperf-client-dc485c95d-gs25b:/# iperf -c 10.70.68.225
------------------------------------------------------------
Client connecting to 10.70.68.225, TCP port 5001
TCP window size: 4.00 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.139 port 51234 connected with 10.70.68.225 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.33 GBytes  4.58 Gbits/sec

2. Same AZ (no tunnel), pod-svc

root@iperf-client-dc485c95d-gs25b:/# iperf -c 10.70.187.228
------------------------------------------------------------
Client connecting to 10.70.187.228, TCP port 5001
TCP window size: 1.71 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.139 port 48688 connected with 10.70.187.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.30 GBytes  4.55 Gbits/sec
root@iperf-client-dc485c95d-gs25b:/# iperf -c 10.70.187.228
------------------------------------------------------------
Client connecting to 10.70.187.228, TCP port 5001
TCP window size: 2.79 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.139 port 49462 connected with 10.70.187.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.31 GBytes  4.56 Gbits/sec

3. Different AZ (tunnel), pod-pod

root@iperf-client-db6b46ccf-bw2vv:/# iperf -c 10.70.68.228
------------------------------------------------------------
Client connecting to 10.70.68.228, TCP port 5001
TCP window size: 1008 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.246 port 51648 connected with 10.70.68.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.91 GBytes  4.22 Gbits/sec
root@iperf-client-db6b46ccf-bw2vv:/# iperf -c 10.70.68.228
------------------------------------------------------------
Client connecting to 10.70.68.228, TCP port 5001
TCP window size:  878 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.246 port 52112 connected with 10.70.68.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.99 GBytes  4.28 Gbits/sec

4. Different AZ (tunnel), pod-svc

root@iperf-client-db6b46ccf-7pt8c:/# iperf -c 10.70.175.218 -p 5002
------------------------------------------------------------
Client connecting to 10.70.175.218, TCP port 5002
TCP window size:  942 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.247 port 60660 connected with 10.70.175.218 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.83 GBytes  4.15 Gbits/sec
root@iperf-client-db6b46ccf-7pt8c:/# iperf -c 10.70.175.218 -p 5002
------------------------------------------------------------
Client connecting to 10.70.175.218, TCP port 5002
TCP window size:  942 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.247 port 32862 connected with 10.70.175.218 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.86 GBytes  4.18 Gbits/sec

5. outside (vpn, my laptop)

11:43:31 in ~ took 24s
➜ iperf -c 10.70.68.228
------------------------------------------------------------
Client connecting to 10.70.68.228, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.250.1.2 port 50094 connected with 10.70.68.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  94.4 MBytes  78.8 Mbits/sec

11:43:43 in ~ took 10s
➜ iperf -c 10.70.68.228
------------------------------------------------------------
Client connecting to 10.70.68.228, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.250.1.2 port 50198 connected with 10.70.68.228 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  93.6 MBytes  78.3 Mbits/sec

new version: v1.2.1-47-g3effa257, built on 2021-05-24T13:06:46-0500, go1.16.4

MTUs: physical interface - 9001, bridge - 9001, tunnel interface - 8981

1. Same AZ (no tunnel), pod-pod

root@iperf-client-dc485c95d-md7gq:/# iperf -c 10.70.68.233
------------------------------------------------------------
Client connecting to 10.70.68.233, TCP port 5001
TCP window size: 3.59 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.145 port 41222 connected with 10.70.68.233 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.20 GBytes  4.47 Gbits/sec
root@iperf-client-dc485c95d-md7gq:/# iperf -c 10.70.68.233
------------------------------------------------------------
Client connecting to 10.70.68.233, TCP port 5001
TCP window size: 3.05 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.145 port 41744 connected with 10.70.68.233 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.38 GBytes  4.62 Gbits/sec

2. Same AZ (no tunnel), pod-svc

root@iperf-client-dc485c95d-md7gq:/# iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size: 3.46 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.145 port 45792 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.21 GBytes  4.48 Gbits/sec
root@iperf-client-dc485c95d-md7gq:/# iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size: 3.65 MByte (default)
------------------------------------------------------------
[  3] local 10.70.70.145 port 46290 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.03 GBytes  4.32 Gbits/sec

3. Different AZ (tunnel), pod-pod

root@iperf-client-db6b46ccf-42mb5:/# iperf -c 10.70.68.233
------------------------------------------------------------
Client connecting to 10.70.68.233, TCP port 5001
TCP window size:  812 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.249 port 53394 connected with 10.70.68.233 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.06 GBytes  4.34 Gbits/sec
root@iperf-client-db6b46ccf-42mb5:/# iperf -c 10.70.68.233
------------------------------------------------------------
Client connecting to 10.70.68.233, TCP port 5001
TCP window size: 1008 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.249 port 53828 connected with 10.70.68.233 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.10 GBytes  4.38 Gbits/sec

4. Different AZ (tunnel), pod-svc

root@iperf-client-db6b46ccf-xw88x:/# iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size:  780 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.250 port 33326 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.01 GBytes  4.31 Gbits/sec
root@iperf-client-db6b46ccf-xw88x:/# iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size:  975 KByte (default)
------------------------------------------------------------
[  3] local 10.70.69.250 port 33724 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  5.15 GBytes  4.43 Gbits/sec

5. outside (vpn, my laptop)

12:05:46 in ~
➜ iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.250.1.2 port 40132 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  95.9 MBytes  80.1 Mbits/sec

12:13:25 in ~ took 10s
➜ iperf -c 10.70.134.148 -p 5002
------------------------------------------------------------
Client connecting to 10.70.134.148, TCP port 5002
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.250.1.2 port 40140 connected with 10.70.134.148 port 5002
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  95.0 MBytes  79.6 Mbits/sec

@aauren
Copy link
Collaborator Author

aauren commented May 28, 2021

Thanks @zerkms!

I appreciate you taking the time to test it out. From going over the results briefly, it looks very similar between the two, not really any performance gained or lost. Is that about what you got from your tests as well?

@murali-reddy What are your thoughts. Do you want to / have time to do your own iperf testing with and without this patch? Any thoughts on my use-cases?

@zerkms
Copy link
Contributor

zerkms commented May 29, 2021

Is that about what you got from your tests as well?

Yep, I'm not a professional network engineer, so I don't know how to conduct a more precise tests, but from those iperf runs it looks the same.

@kailunshi
Copy link
Contributor

@murali-reddy I'll admit that I'm not the most knowledgable person about the ipip tunnels that Linux builds out. However, it seems like the ip tooling is already accounting for the MTU reduction on the tunnel side of the connection. So when we reduce the MTU ahead of time, it appears as though the MTU get's reduced by double the amount it should. This led me to believe that reducing it ahead of time was an error on our part.

@aauren Second this. I was gonna report this double mtu reduction but was wonder if maybe in some older linux, ip tool doesn't account for the 20 bytes. but this applies to the overlay IPIP tunnel interface.

The kube-bridge and pod mtu though, should be a separate issue. (somewhere else in the code)

can we merge and release this PR sooner? and work on the kube-bridge/pod mtu in a separate issue please? Thanks.

@aauren aauren force-pushed the fix_multiple_reductions_in_mtu branch from 8ba3da9 to 24d7721 Compare June 11, 2022 17:15
@aauren
Copy link
Collaborator Author

aauren commented Jun 11, 2022

@kailunshi This one is sticky, its going to take me a while to get back into this thread to fully test.

I just updated this PR so that it is actually merge-able, but its been so long now that its going to be a bit before I can get the context again and look into merging it.

@aauren aauren force-pushed the fix_multiple_reductions_in_mtu branch from 24d7721 to d1211de Compare June 24, 2022 21:00
@aauren
Copy link
Collaborator Author

aauren commented Jun 24, 2022

After looking through this issue yet again, I still find that I think it makes more sense not to incur the double MTU reduction. I went ahead and reproduced the same set of tests that @zerkms did in their environment and I find that the same holds true. For all conceivable flows, the performance is either the same, or just slightly better without the MTU reduction:

Results from iperf3 running with -u -b 0 and only grabbing from the summary line

 Without MTU Reduction:
	From External -> Service: AVG 2.27 Gbits/sec
		Fri Jun 24 22:13:27 UTC 2022 - 2.31 Gbits/sec
		Fri Jun 24 22:13:37 UTC 2022 - 2.35 Gbits/sec
		Fri Jun 24 22:13:47 UTC 2022 - 2.19 Gbits/sec
		Fri Jun 24 22:13:57 UTC 2022 - 2.25 Gbits/sec
		Fri Jun 24 22:14:07 UTC 2022 - 2.25 Gbits/sec
	From Pod -> Service: 1.30 Gbits/sec
		Fri Jun 24 22:26:48 UTC 2022 - 1.33 Gbits/sec
		Fri Jun 24 22:27:02 UTC 2022 - 1.35 Gbits/sec
		Fri Jun 24 22:27:16 UTC 2022 - 1.32 Gbits/sec
		Fri Jun 24 22:27:30 UTC 2022 - 1.23 Gbits/sec
		Fri Jun 24 22:27:44 UTC 2022 - 1.29 Gbits/sec
	From Pod -> Pod Same Node: 2.24 Gbits/sec
		Fri Jun 24 22:29:57 UTC 2022 - 2.24 Gbits/sec
		Fri Jun 24 22:30:11 UTC 2022 - 2.28 Gbits/sec
		Fri Jun 24 22:30:25 UTC 2022 - 2.22 Gbits/sec
		Fri Jun 24 22:30:39 UTC 2022 - 2.22 Gbits/sec
		Fri Jun 24 22:30:53 UTC 2022 - 2.23 Gbits/sec
	From Pod -> Pod Different Node: 1.01 Gbits/sec


With MTU Reduction:
	From External -> Service: AVG 2.20 Gbits/sec
		Fri Jun 24 22:39:59 UTC 2022 - 2.22 Gbits/sec
		Fri Jun 24 22:40:09 UTC 2022 - 2.14 Gbits/sec
		Fri Jun 24 22:40:19 UTC 2022 - 2.25 Gbits/sec
		Fri Jun 24 22:40:29 UTC 2022 - 2.27 Gbits/sec
		Fri Jun 24 22:40:39 UTC 2022 - 2.14 Gbits/sec
	From Pod -> Service: 1.22 Gbits/sec
		Fri Jun 24 22:41:50 UTC 2022 - 1.23 Gbits/sec
		Fri Jun 24 22:42:04 UTC 2022 - 1.21 Gbits/sec
		Fri Jun 24 22:42:18 UTC 2022 - 1.22 Gbits/sec
		Fri Jun 24 22:42:31 UTC 2022 - 1.22 Gbits/sec
		Fri Jun 24 22:42:45 UTC 2022 - 1.22 Gbits/sec
	From Pod -> Pod Same Node: 1.99 Gbits/sec
		Fri Jun 24 22:44:11 UTC 2022 - 2.00 Gbits/sec
		Fri Jun 24 22:44:25 UTC 2022 - 1.94 Gbits/sec
		Fri Jun 24 22:44:39 UTC 2022 - 2.02 Gbits/sec
		Fri Jun 24 22:44:53 UTC 2022 - 2.01 Gbits/sec
		Fri Jun 24 22:45:07 UTC 2022 - 2.00 Gbits/sec
	From Pod -> Pod Different Node: 1.01 Gbits/sec

While I do still hold with my comment here: #1086 (comment) that there may be flows that are less performant because they become fragmented without the MTU reduction, I think that they are in the minority when compared to the flows that see performance increases without the MTU reduction. Additionally, when it in practice I'm unable to find a situation where increasing the MTU causes a performance regression.

Given all of that, I'm going to merge this PR.

@aauren aauren merged commit f97eb7c into cloudnativelabs:master Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants