Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to toggle TCP early demux #1212

Merged
merged 1 commit into from
Sep 16, 2020
Merged

Conversation

SaranBalaji90
Copy link
Contributor

@SaranBalaji90 SaranBalaji90 commented Sep 13, 2020

What type of PR is this?
bug

Which issue does this PR fix:

What does this PR do / Why do we need it:
This PR adds support to toggle tcp_early_demux. This change is required to support liveliness/readiness check on pods using its own security groups. Reason being, unlike regular pods, for pods using security groups kubelet traffic doesn't directly reach the host veth of the pod. For pods using security groups, traffic goes out of the eth0 and then comes back in through Trunk ENI. At this point response packets from the pod (SYN-ACK or ACK) are dropped because of tcp_early_demux setting.

Two scenario where the packets were dropped are

  1. From Kubelet to pod-eni: Kubelet send SYN packet and pod responded with SYN-ACK but kernel dropped the packet after receiving it in host veth dev.
  2. From pod-eni to kubelet: Pod-eni sent ACK in response to SYN-ACK from kubelet but ACK pkt is dropped by kernel again post host veth dev.

tcp_early_demux is enabled by default. Disabling tcp_early_demux adds 10-15 ms to overall packet throughput.

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Performing curl from host ns to branch eni pod will fail without the change.

Testing done on this change:

After updating cni init container, instance has tcp early demux disabled on the instance and curl started working.

Automation added to e2e:

NA

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

Yes

Does this PR introduce any user-facing change?:


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

scripts/init.sh Outdated Show resolved Hide resolved
@SaranBalaji90 SaranBalaji90 changed the title Disable TCP early demux when pod-eni is enabled Add support to toggle TCP early demux Sep 13, 2020
scripts/init.sh Outdated
else
sysctl -w "net.ipv4.tcp_early_demux=0"
fi

cat "/proc/sys/net/ipv4/conf/$PRIMARY_IF/rp_filter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cat rp_filter line should go above with the other rp_filter code, otherwise the output is going to be quite confusing :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to add env: [ {name: "ENABLE_TCP_EARLY_DEMUX", value: "false"} ], on line 229 of config/master/manifests.jsonnet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true .. missed it, will update.

Copy link
Contributor

@anguslees anguslees Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to add env: [ {name: "ENABLE_TCP_EARLY_DEMUX", value: "false"} ], on line 229 of config/master/manifests.jsonnet

Re ^this: Are we going to disable early_demux for everyone? Only for sg-pp users? Ask users to disable the option explicitly as part of the instructions for enabling sg-pp?

Proposed logic disables it for everyone by default. That's probably ok, but we're removing a kernel optimisation and it would be nice to at least measure the impact before doing that unilaterally.

Copy link
Contributor Author

@SaranBalaji90 SaranBalaji90 Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed about this, currently the plan is to do the performance testing between kubelet to other pods and then release it as part of 1.7.3. Couple of blog posts called out that actually disabling this would be beneficial (https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#tuning-adjusting-ip-protocol-early-demux and http://www.newfreesoft.com/linux/linux_kernel_socket_protocol_stack_routing_lookup_cache_mechanism_1567/, which describes that caching happens at two places one through conntrack as well)

Making it disabled will also remove one more flag that needs to be turned on to enable sg-pp. We can also update the default value to true, if its causing performance issues on some of the clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding to what @SaranBalaji90 commented -

These are the function's in kernel which are executed ->

skb_sk_is_empty -> 1 : sk is null and 0 : sk is not null

vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_rcv_finish -> 0 -> 0 [skb_sk_is_empty 1 ]  
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_route_input_rcu -> 0 -> 0 [skb_sk_is_empty 0 ] 
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_route_input_rcu:0.0.0.0/0 is FIB HIT [RTN_UNICAST(Gateway or direct route)] 
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 | e__ip_route_input_rcu -> 0 -> 0 [skb_sk_is_empty 0 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_forward -> 0 -> 0 [skb_sk_is_empty 0 ] 
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_forward -> 1 -> 0 [skb_sk_is_empty 0 ] 
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_rcv_finish -> 1 -> 0 [skb_sk_is_empty 0 ] 

For the branch eni case it was going into both if conditions in ip_rcv_finish-

	if (net->ipv4.sysctl_ip_early_demux &&
	    !skb_dst(skb) &&
	    !skb->sk &&
	    !ip_is_fragment(iph)) {
		const struct net_protocol *ipprot;
		int protocol = iph->protocol;
		ipprot = rcu_dereference(inet_protos[protocol]);
		if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) {
			err = edemux(skb);
			if (unlikely(err))
				goto drop_error;
			/* must reload iph, skb->head might have changed */
			iph = ip_hdr(skb);
		}
	}
	/*
	 *	Initialise the virtual path cache for the packet. It describes
	 *	how the packet travels inside Linux networking.
	 */
	if (!skb_valid_dst(skb)) {
		err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
					   iph->tos, dev);
		if (unlikely(err))
			goto drop_error;
	}

Because the first if does - !skb_dst(skb) && !skb->sk so that means sk is null and dst is null, so now early demux filled dst_entry. But then skb_valid_dst is failing so now it also does a route lookup which in turn calls ip_forward and because skb->sk is set then packet is dropped. Basically the early demux optimization purpose of using the cached value rather than routing lookup wasn't working here.

@anguslees
Copy link
Contributor

anguslees commented Sep 14, 2020

Also the PR description doesn't mention sg-pp at all. In 10 years time, we're going to wonder why this sysctl was set in the way it was, and this PR description is going to be our only record of the motivation for this change. It needs to explain the whole story for that team member archaeologist from the future.

@SaranBalaji90
Copy link
Contributor Author

Also the PR description doesn't mention sg-pp at all. In 10 years time, we're going to wonder why this sysctl was set in the way it was, and this PR description is going to be our only record of the motivation for this change. It needs to explain all the whole story for that team member archeologist from the future.

I agree. Will update the description @anguslees

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good, but this PR needs a rebase.

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since under load DISABLE_TCP_EARLY_DEMUX=true add some overhead, and it's only needed for kubelet to pod-eni communication, it makes sense to have it off by default.

scripts/init.sh Outdated Show resolved Hide resolved
@jayanthvn
Copy link
Contributor

jayanthvn commented Sep 16, 2020

Another option if net->ipv4.sysctl_ip_early_demux needs to be enabled -> Even in cases of regular pod curl we see tcp_v4_early_demux in ip_rcv_finish being called and it will find a socket which doesn't have a valid destination i.e, struct dst_entry *dst = READ_ONCE(sk->sk_rx_dst) is NULL since sk_rx_dst seems to be populated on a successful 3 way handshake and hence for SYN ACK this will still be NULL. But the only different is in the IP lookup pipeline. Here is the call trace for a curl from host instance to a pod with and without branch ENI -

Host - branch ENI pod


  1. TCP SYN (host to pod) -> inet_ehash_nolisten/inet_ehash_insert will insert the socket information into the hash]
    <Contents of sk->(src 192.168.94.234 dest 192.168.78.231 sport 50110 dport 80 netns 4026531993), SK->DST->[is_null 1 flags 0 ob 0]>
    but sk->sk_rx_dst is still NULL

  2. This goes through regular IP table lookup and goes out of eth0

  3. TCP SYN re-enters on vlan.eth.3 routes to branch eni vlan990070403b2 and the packet is sent to pod.

  4. TCP SYN ACK (pod to host) will enter on vlan990070403b2 and goes through ip tables lookup. In ip_rcv_finish -> tcp_v4_early_demux which searches for the socket and it find the socket of step 1
    <Contents of sk->(src 192.168.94.234 dest 192.168.78.231 sport 50110 dport 80 netns 4026531993), DST data[is_null 1 flags 0 obs 0]>

  5. Since the dst is not valid, kernel will invoke route table lookup. Here since the route is not local, ip_route_input_rcu -> ip_route_input_slow->ip_mkroute_input sets rth->dst.input = ip_forward

  6. in Ip_forward the lookup fails since skb->sk is valid.

if (res->type == RTN_LOCAL) {
		err = fib_validate_source(skb, saddr, daddr, tos,
					  0, dev, in_dev, &itag);
		if (err < 0)
			goto martian_source;
		goto local_input;
	}

	if (!IN_DEV_FORWARD(in_dev)) {
		err = -EHOSTUNREACH;
		goto no_route;
	}
	if (res->type != RTN_UNICAST)
		goto martian_destination;

	err = ip_mkroute_input(skb, res, in_dev, daddr, saddr, tos);

Host - regular pod


  1. TCP SYN (host to pod) - inet_ehash_nolisten/inet_ehash_insert
    <Contents of sk->(src 192.168.64.14 dest 192.168.95.235 sport 60914 dport 80 netns 4026531993) , SK->DST->[is_null 1 flags 0 ob 0]>

  2. This goes through regular IP table lookup and goes out of eni797ffc04d9d and the packet is sent to pod.

  3. TCP SYN ACK (pod to host) will enter on eni797ffc04d9d and goes through ip tables lookup. In ip_rcv_finish -> tcp_v4_early_demux which searches for the socket and it find the socket of step 1
    <Contents of sk->(src 192.168.64.14 dest 192.168.95.235 sport 60914 dport 80 netns 4026531993) , DST data[is_null 1 flags 0 obs 0]>

  4. Since the dst is not valid, kernel will invoke route table lookup. This time it will hit local route . SYN ACK goes ahead with ip tables look up in ipt_do_table

Since this was the difference between regular and branch eni pod, we will have to do the below steps to avoid disabling tcp demux ->

  1. Add a local route in the branch eni table. [Only this won't be sufficient since the packet will be classified as martian source because of rpf check fail]

    local 192.168.94.234 dev vlan.eth.3 proto kernel scope host

  2. net.ipv4.conf.all.rp_filter = 0

  3. net.ipv4.conf.vlan990070403b2.rp_filter=0

  4. net.ipv4.conf.vlan/eth/3.rp_filter = 2

  5. ipv4.conf.vlan/eth/3.accept_local=1

/* Ignore rp_filter for packets protected by IPsec. */
int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
			u8 tos, int oif, struct net_device *dev,
			struct in_device *idev, u32 *itag)
{
	int r = secpath_exists(skb) ? 0 : IN_DEV_RPFILTER(idev);

	if (!r && !fib_num_tclassid_users(dev_net(dev)) &&
	    IN_DEV_ACCEPT_LOCAL(idev) &&
	    (dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))) {
		*itag = 0;
		return 0;
	}
	return __fib_validate_source(skb, src, dst, tos, oif, dev, r, idev, itag);
}

@srini-ram
Copy link
Contributor

For completeness, lets include node port and service IP data path logs and explain the reason for tcp_early_demux not getting invoked (i.e., dnat / dport translation leading to socket hash not being same for tcp syn and tcp syn+ack )

@jayanthvn
Copy link
Contributor

Curl to Service IP ->

  1. This time inet_ehash_nolisten will cache SIP - 192.168.94.234 DIP - 10.100.164.240 sport 45564 dport 80
192.168.94.234 -> 10.100.164.240 inet_ehash_nolisten -> 0 -> 0 [skb_sk_is_empty 0 iif 0]
Contents of sk->(src 192.168.94.234 dest 10.100.164.240 sport 45564 dport 80 netns 4026531993)
TCP HDR->[src 0 dest 0 seq 0 ack 0]                           
SK->DST->[is_null 1 flags 0 ob 0]                                
  1. SYN ACK will have the POD SIP and in tcp_v4_early_demux we won't find matching socket. So in ip_forward the kernel won't complain skb->sk is valid since it is NULL. Then it continues with regular ip lookup pipeline.
192.168.78.231 -> 192.168.158.209 T_ACK,SYN tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 11]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0) 
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939]
SDK buffer DST data[is_null 1 flags 0 obs 0]

192.168.78.231 -> 192.168.158.209 T_ACK,SYN e__tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 11]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939]
SDK buffer DST data[is_null 1 flags 0 obs 0]

192.168.78.231 -> 192.168.158.209 T_ACK,SYN ip_forward -> 0 -> 0 [skb_sk_is_empty 1 iif 11]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939]
SDK buffer DST data[is_null 0 flags 0 obs 255]

192.168.78.231 -> 192.168.158.209 T_ACK,SYN ipt_do_table -> 0 -> 0 [skb_sk_is_empty 1 iif 11
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939] 
SDK buffer DST data[is_null 0 flags 0 obs 255] 

192.168.78.231 -> 192.168.158.209  e__ipt_do_table:mangle.FORWARD is ACCEPT

@jayanthvn
Copy link
Contributor

jayanthvn commented Sep 17, 2020

Curl to branch eni pod on remote instance ->

  1. SYN packet received is not cached in the socket buffer
vlan.eth.3   192.168.158.209 -> 192.168.78.231 T_SYN ip_rcv -> 0 -> 0 [skb_sk_is_empty 1 iif 12]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0) 
TCP HDR->[src 55972 dest 80 seq 3085782938 ack 0]
SDK buffer DST data[is_null 1 flags 0 obs 0]  
  1. SYN packet goes through IP lookup -> tcp_v4_early_demux doesn't find any cached socket and since not local route ip_forward is invoked and continues with iptables lookup pipeline.
192.168.158.209 -> 192.168.78.231 T_SYN tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 12]
]Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 55972 dest 80 seq 3085782938 ack 0]
 SDK buffer DST data[is_null 1 flags 0 obs 0] 

192.168.158.209 -> 192.168.78.231 T_SYN e__tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 12] 
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 55972 dest 80 seq 3085782938 ack 0]
SDK buffer DST data[is_null 1 flags 0 obs 0] 

192.168.158.209 -> 192.168.78.231T_SYN ip_forward -> 0 -> 0 [skb_sk_is_empty 1 iif 12] 
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 55972 dest 80 seq 3085782938 ack 0]
SDK buffer DST data[is_null 0 flags 0 obs 255]
  1. SYN ACK received will be inserted into the socket cache ->
192.168.78.231 -> 192.168.158.209  inet_ehash_insert -> 0 -> 0 [skb_sk_is_empty 0 iif 0] 
Contents of sk->(src 192.168.78.231 dest 192.168.158.209 sport 0 dport 55972 netns 4026532666)
TCP HDR->[src 0 dest 0 seq 0 ack 0]
SK->DST->[is_null 0 flags 0 ob 0] 
  1. But the SYN ACK won't find the socket in tcp_v4_early_demux and so in ip_forward the kernel won't complain skb->sk is valid since it is NULL. Then it continues with regular ip lookup pipeline.
192.168.78.231 -> 192.168.158.209 T_ACK,SYN tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 11]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939]
SDK buffer DST data[is_null 1 flags 0 obs 0]

192.168.78.231 -> 192.168.158.209 T_ACK,SYN e__tcp_v4_early_demux -> 0 -> 0 [skb_sk_is_empty 1 iif 11]  
 Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0)
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939] 
SDK buffer DST data[is_null 1 flags 0 obs 0]

192.168.78.231 -> 192.168.158.209 T_ACK,SYN ip_forward -> 0 -> 0 [skb_sk_is_empty 1 iif 11]
Contents of sk->(src 0.0.0.0 dest 0.0.0.0 sport 0 dport 0 netns 0) 
TCP HDR->[src 80 dest 55972 seq 396975263 ack 3085782939] 
SDK buffer DST data[is_null 0 flags 0 obs 255]                        

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants