Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch/optimize(bpf): improve lan hijack datapath performance #466

Merged
merged 3 commits into from
Mar 8, 2024

Conversation

jschwinger233
Copy link
Member

@jschwinger233 jschwinger233 commented Mar 2, 2024

Background

这个 PR 引入了三项针对 lan 的性能优化。先回顾 datapath:

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: 做分流决策:直连流量放行进入网络栈,分流流量调用 bpf_redirect 重定向给 dae0
b. bpf_peer_ingress: 只有分流流量才可能到达这里,调用 bpf_skc_lookup 和 bpf_sk_assign 把流量指定给 dae socket
c. bpf_dae0_ingress: 只有分流流量的 **回复** 才可能到达这里,调用 bpf_redirect 把它重定向回 wan0

优化 1:a 和 b 处的 bpf 程序都解析了一遍二三四层的包头,其实没有必要解析两次,在 a 出解析完了之后可以通过 skb->cb 把 b 处需要知道的信息夹带过去。
优化 2:b 处的 peer_ingress bpf 没有必要对 established tcp 调用 bpf_skc_lookup 查询 socket,因为内核本身就可以完成 socket lookup。在开启 tcp_early_demux 的情况下还可以避免路由决策直接做 local delivery。
优化 3:a 处的 lan_ingress 可以调用 bpf_redirect_peer 直接重定向给 netns 内部的 peer,避免 enqueue_to_backlog 造成的性能影响。


Background

This PR introduces 3 performance optimizations. First, let's review the datapath:

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: Make split routing decisions: Direct traffic enters the network stack, and split traffic is redirected to dae0 using bpf_redirect.
b. bpf_peer_ingress: Only split traffic can reach this point, using bpf_skc_lookup and bpf_sk_assign to assign traffic to the dae socket.
c. bpf_dae0_ingress: Only split traffic **replies** can reach this point, using bpf_redirect to redirect it back to wan0.

Optimization 1: Both the BPF programs at points a and b have parsed the packet headers up to layers two, three, and four. It's unnecessary to parse them twice. After parsing at point a, the information needed at point b can be passed using skb->cb.

Optimization 2: The peer_ingress BPF at point b doesn't need to perform socket lookup for established TCP connections using bpf_skc_lookup because the kernel itself can handle socket lookup. With tcp_early_demux enabled, it can also avoid routing decisions and perform local delivery directly.

Optimization 3: The lan_ingerss at point a redirects the skb from wan0 to dae0, which then goes through netns to reach the peer. This step can be simplified using bpf_redirect_peer: redirect the skb directly from lan0 to the peer inside the netns, avoiding performance impact from enqueue_to_backlog.

Recommendation: Review by commit.

Checklist

Full Changelogs

  • [Implement ...]

Issue Reference

Closes #[issue number]

Test Result

@jschwinger233 jschwinger233 changed the title perf: Improve hijack datapath performance perf(bpf): Improve hijack datapath performance Mar 2, 2024
@jschwinger233 jschwinger233 changed the title perf(bpf): Improve hijack datapath performance perf(bpf): improve hijack datapath performance Mar 2, 2024
@jschwinger233 jschwinger233 changed the title perf(bpf): improve hijack datapath performance patch/optimize(bpf): improve hijack datapath performance Mar 2, 2024
@jschwinger233 jschwinger233 marked this pull request as ready for review March 2, 2024 09:42
@jschwinger233 jschwinger233 requested a review from a team as a code owner March 2, 2024 09:42
@sdgrfe
Copy link

sdgrfe commented Mar 2, 2024

测试通过

@amtoaer
Copy link

amtoaer commented Mar 2, 2024

image 正常工作

@Mitsuhaxy
Copy link

It's working fine
image

dae-prow[bot]
dae-prow bot previously approved these changes Mar 2, 2024
Copy link
Contributor

@dae-prow dae-prow bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 Since the PR has been fully tested, please consider merging it.

@dae-prow
Copy link
Contributor

dae-prow bot commented Mar 3, 2024

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

@dae-prow
Copy link
Contributor

dae-prow bot commented Mar 3, 2024

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

Previously we parsed skb->data for twice: wan_egress/lan_ingress and
dae0peer_ingress. This is because the limit of bpf_sk_assign: we have to
call it within the netns where the socket is.

This patch manages to parse skb->data only once at
wan_egress/lan_ingress, where we leave a value in skb->cb[1] to tell
dae0peer_ingress:
1. if skb->cb[1] == TCP, then it's a new TCP conn, assign skb to TCP
   listener;
2. if skb->cb[1] == UDP, then it's a UDP, assign skb to UDP listener;
3. else it's an establised TCP conn, stack can take care of socket
   lookup;
@douglarek
Copy link
Contributor

douglarek commented Mar 4, 2024

Tested in the following environment, works very well.

A router: Linux ImmortalWrt 6.1.78 #0 SMP PREEMPT Mon Feb 19 15:48:41 2024 aarch64 GNU/Linux
A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

@jschwinger233 jschwinger233 requested a review from a team as a code owner March 6, 2024 17:45
@jschwinger233 jschwinger233 force-pushed the pr/gray/datapath-perf branch 2 times, most recently from 9db0e34 to 026f23b Compare March 6, 2024 17:54
@jschwinger233
Copy link
Member Author

Thank all folks who keep testing this PR, 5badabf is the last low-hanging fruit whose temptation I can't resist. Hope this small patch doesn't break anything 🤞

The lpc2020 had a talk introducing this bpf_redirect_peer which allows ingress to ingress redirection without going through CPU's backlog queue. Cilium sees +1.3Gbit/sec perf boost by using it.

@douglarek
Copy link
Contributor

douglarek commented Mar 7, 2024

After binding docker0 to the LAN and testing 5badabf, everything works perfectly. There are no issues with direct connection diversion. Well done.

A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

@amtoaer
Copy link

amtoaer commented Mar 7, 2024

使用最新 CI build在以下环境测试成功:

Linux GracPC 6.7.5-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 17 Feb 2024 14:02:21 +0000 x86_64 GNU/Linux
Linux NAS 6.7.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 05 Feb 2024 22:07:49 +0000 x86_64 GNU/Linux

@jschwinger233 jschwinger233 changed the title patch/optimize(bpf): improve hijack datapath performance patch/optimize(bpf): improve lan hijack datapath performance Mar 7, 2024
@jschwinger233
Copy link
Member Author

Benchmark (lan only)

1. Env: Linux 6.6.17 KVM, 4 cores, 12G memory.

2. Setup

Run two docker containers, one has dae inside, the other has v2ray. It's almost the same as dae's github action test: just see two containers as two nodes.

I am using sockperf:

  1. Run sockperf server on the v2ray side: (for UDP test, delete --tcp)
nsenter -t $(pidof v2ray) -n sockperf server -i 172.18.0.3 --tcp --daemonize
  1. Run sockperf client inside the "pod" to emulate lan proxy: (for UDP test, delete --tcp)
nsenter -t $(pidof pod) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10

3. TCP

dae-0.4.0: avg-latency=37.310 (std-dev=7.352)
this pr: avg-latency=36.792 (std-dev=7.437)

avg-latency improves by 1.3%.

This seems not too much, because the testing environment is clean and free from netfilter.

After adding a simple iptables rule on the dae node:

iptables -t raw -A PREROUTING -p tcp -m tcp --dport 11111 -j ACCEPT

dae-0.4.0 will perform worse, sometimes avg-latency could go as high as 38+, while dae-next (this pr) won't be affected at all because of stack bypass implementation. In the case, it's 3.1% improvement.

4. UDP

The normal UDP test result is:

dae-0.4.0: avg-latency=58.275 (std-dev=50.721)
dae-next: avg-latency=55.927 (std-dev=48.332)

4% boost.

However, it is also known that dae-0.4.0 uses encapsulation to avoid port conflict if there is a process already listening on 53, which damages performance badly. When that fallback takes place, dae-0.4.0's avg-latency will drop to 60.412 (std-dev=47.764), and dae-next has 7%+ better result.

Copy link
Contributor

@sumire88 sumire88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your groundbreaking work!

@sumire88 sumire88 added the tested label Mar 8, 2024
Copy link
Contributor

@dae-prow dae-prow bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 Since the PR has been fully tested, please consider merging it.

Copy link
Contributor

@mzz2017 mzz2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant code!

@jschwinger233 jschwinger233 merged commit 49f576e into main Mar 8, 2024
30 checks passed
@jschwinger233 jschwinger233 deleted the pr/gray/datapath-perf branch March 8, 2024 15:28
@dae-prow dae-prow bot mentioned this pull request Apr 2, 2024
@dae-prow dae-prow bot mentioned this pull request Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants