patch/optimize(bpf): improve lan hijack datapath performance #466

jschwinger233 · 2024-03-02T09:35:52Z

Background

这个 PR 引入了三项针对 lan 的性能优化。先回顾 datapath：

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: 做分流决策：直连流量放行进入网络栈，分流流量调用 bpf_redirect 重定向给 dae0
b. bpf_peer_ingress: 只有分流流量才可能到达这里，调用 bpf_skc_lookup 和 bpf_sk_assign 把流量指定给 dae socket
c. bpf_dae0_ingress: 只有分流流量的 **回复** 才可能到达这里，调用 bpf_redirect 把它重定向回 wan0

优化 1：a 和 b 处的 bpf 程序都解析了一遍二三四层的包头，其实没有必要解析两次，在 a 出解析完了之后可以通过 skb->cb 把 b 处需要知道的信息夹带过去。
优化 2：b 处的 peer_ingress bpf 没有必要对 established tcp 调用 bpf_skc_lookup 查询 socket，因为内核本身就可以完成 socket lookup。在开启 tcp_early_demux 的情况下还可以避免路由决策直接做 local delivery。
优化 3：a 处的 lan_ingress 可以调用 bpf_redirect_peer 直接重定向给 netns 内部的 peer，避免 enqueue_to_backlog 造成的性能影响。

Background

This PR introduces 3 performance optimizations. First, let's review the datapath:

                ┌──────────────────┐ 
  1             │ 2                │ 
┌────┐     ┌────┼────┐      ┌───┐  │ 
│    ├─────►    │    ├──────►   │  │ 
│lan0│     │dae0│peer│      │dae│  │ 
│    ◄─────┤    │    ◄──────┤   │  │ 
└────┘     └────┼────┘      └───┘  │ 
             3  │     dae netns    │ 
                └──────────────────┘ 

a. bpf_lan_ingress: Make split routing decisions: Direct traffic enters the network stack, and split traffic is redirected to dae0 using bpf_redirect.
b. bpf_peer_ingress: Only split traffic can reach this point, using bpf_skc_lookup and bpf_sk_assign to assign traffic to the dae socket.
c. bpf_dae0_ingress: Only split traffic **replies** can reach this point, using bpf_redirect to redirect it back to wan0.

Optimization 1: Both the BPF programs at points a and b have parsed the packet headers up to layers two, three, and four. It's unnecessary to parse them twice. After parsing at point a, the information needed at point b can be passed using skb->cb.

Optimization 2: The peer_ingress BPF at point b doesn't need to perform socket lookup for established TCP connections using bpf_skc_lookup because the kernel itself can handle socket lookup. With tcp_early_demux enabled, it can also avoid routing decisions and perform local delivery directly.

Optimization 3: The lan_ingerss at point a redirects the skb from wan0 to dae0, which then goes through netns to reach the peer. This step can be simplified using bpf_redirect_peer: redirect the skb directly from lan0 to the peer inside the netns, avoiding performance impact from enqueue_to_backlog.

Recommendation: Review by commit.

Checklist

The Pull Request has been fully tested
There's an entry in the CHANGELOGS
There is a user-facing docs PR against https://github.com/daeuniverse/dae

Full Changelogs

[Implement ...]

Issue Reference

Closes #[issue number]

Test Result

sdgrfe · 2024-03-02T10:15:44Z

测试通过

amtoaer · 2024-03-02T15:51:16Z

正常工作

Mitsuhaxy · 2024-03-02T16:22:20Z

It's working fine

dae-prow

🧪 Since the PR has been fully tested, please consider merging it.

dae-prow · 2024-03-03T03:21:26Z

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

dae-prow · 2024-03-03T09:16:02Z

❌ Your branch is currently out-of-sync to main. No worry, I will fix it for you.

Previously we parsed skb->data for twice: wan_egress/lan_ingress and dae0peer_ingress. This is because the limit of bpf_sk_assign: we have to call it within the netns where the socket is. This patch manages to parse skb->data only once at wan_egress/lan_ingress, where we leave a value in skb->cb[1] to tell dae0peer_ingress: 1. if skb->cb[1] == TCP, then it's a new TCP conn, assign skb to TCP listener; 2. if skb->cb[1] == UDP, then it's a UDP, assign skb to UDP listener; 3. else it's an establised TCP conn, stack can take care of socket lookup;

douglarek · 2024-03-04T13:25:51Z

Tested in the following environment, works very well.

A router: Linux ImmortalWrt 6.1.78 #0 SMP PREEMPT Mon Feb 19 15:48:41 2024 aarch64 GNU/Linux

A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

Because apt.k8s.io no longer exists: https://kubernetes.io/blog/2023/08/31/legacy-package-repository-deprecation/

jschwinger233 · 2024-03-06T18:50:31Z

Thank all folks who keep testing this PR, 5badabf is the last low-hanging fruit whose temptation I can't resist. Hope this small patch doesn't break anything 🤞

The lpc2020 had a talk introducing this bpf_redirect_peer which allows ingress to ingress redirection without going through CPU's backlog queue. Cilium sees +1.3Gbit/sec perf boost by using it.

douglarek · 2024-03-07T02:11:07Z

After binding docker0 to the LAN and testing 5badabf, everything works perfectly. There are no issues with direct connection diversion. Well done.

A workstation: Linux Manjaro 6.7.7-1-MANJARO #1 SMP PREEMPT_DYNAMIC Fri Mar  1 18:26:06 UTC 2024 x86_64 GNU/Linux

amtoaer · 2024-03-07T03:49:39Z

使用最新 CI build在以下环境测试成功：

Linux GracPC 6.7.5-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 17 Feb 2024 14:02:21 +0000 x86_64 GNU/Linux

Linux NAS 6.7.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 05 Feb 2024 22:07:49 +0000 x86_64 GNU/Linux

jschwinger233 · 2024-03-07T18:37:30Z

Benchmark (lan only)

1. Env: Linux 6.6.17 KVM, 4 cores, 12G memory.

2. Setup

Run two docker containers, one has dae inside, the other has v2ray. It's almost the same as dae's github action test: just see two containers as two nodes.

I am using sockperf:

Run sockperf server on the v2ray side: (for UDP test, delete --tcp)

nsenter -t $(pidof v2ray) -n sockperf server -i 172.18.0.3 --tcp --daemonize

Run sockperf client inside the "pod" to emulate lan proxy: (for UDP test, delete --tcp)

nsenter -t $(pidof pod) -n sockperf ping-pong -i 172.18.0.3 --tcp --time 10

3. TCP

dae-0.4.0: avg-latency=37.310 (std-dev=7.352)
this pr: avg-latency=36.792 (std-dev=7.437)

avg-latency improves by 1.3%.

This seems not too much, because the testing environment is clean and free from netfilter.

After adding a simple iptables rule on the dae node:

iptables -t raw -A PREROUTING -p tcp -m tcp --dport 11111 -j ACCEPT

dae-0.4.0 will perform worse, sometimes avg-latency could go as high as 38+, while dae-next (this pr) won't be affected at all because of stack bypass implementation. In the case, it's 3.1% improvement.

4. UDP

The normal UDP test result is:

dae-0.4.0: avg-latency=58.275 (std-dev=50.721)
dae-next: avg-latency=55.927 (std-dev=48.332)

4% boost.

However, it is also known that dae-0.4.0 uses encapsulation to avoid port conflict if there is a process already listening on 53, which damages performance badly. When that fallback takes place, dae-0.4.0's avg-latency will drop to 60.412 (std-dev=47.764), and dae-next has 7%+ better result.

sumire88

Thanks for your groundbreaking work!

dae-prow

🧪 Since the PR has been fully tested, please consider merging it.

mzz2017

Brilliant code!

dae-prow bot assigned jschwinger233 Mar 2, 2024

jschwinger233 requested a review from mzz2017 March 2, 2024 09:36

jschwinger233 changed the title ~~perf: Improve hijack datapath performance~~ perf(bpf): Improve hijack datapath performance Mar 2, 2024

jschwinger233 changed the title ~~perf(bpf): Improve hijack datapath performance~~ perf(bpf): improve hijack datapath performance Mar 2, 2024

jschwinger233 changed the title ~~perf(bpf): improve hijack datapath performance~~ patch/optimize(bpf): improve hijack datapath performance Mar 2, 2024

jschwinger233 marked this pull request as ready for review March 2, 2024 09:42

jschwinger233 requested a review from a team as a code owner March 2, 2024 09:42

sumire88 added patch optimize labels Mar 2, 2024

sumire88 requested a review from a team March 2, 2024 10:24

sumire88 added the not-yet-tested label Mar 2, 2024

wanlce added tested and removed not-yet-tested labels Mar 2, 2024

dae-prow bot previously approved these changes Mar 2, 2024

View reviewed changes

jschwinger233 dismissed dae-prow[bot]’s stale review via d6d4408 March 3, 2024 03:21

jschwinger233 force-pushed the pr/gray/datapath-perf branch from a29b998 to d12323b Compare March 3, 2024 09:15

jschwinger233 force-pushed the pr/gray/datapath-perf branch from 8c54f88 to 769d2a4 Compare March 4, 2024 06:46

wanlce removed the tested label Mar 4, 2024

bpf: use bpf_redirect_peer for lan_ingress!!!

5badabf

jschwinger233 requested a review from a team as a code owner March 6, 2024 17:45

jschwinger233 force-pushed the pr/gray/datapath-perf branch 2 times, most recently from 9db0e34 to 026f23b Compare March 6, 2024 17:54

ci: update lvh-images

8cc3e8a

Because apt.k8s.io no longer exists: https://kubernetes.io/blog/2023/08/31/legacy-package-repository-deprecation/

jschwinger233 force-pushed the pr/gray/datapath-perf branch from 026f23b to 8cc3e8a Compare March 6, 2024 18:11

jschwinger233 changed the title ~~patch/optimize(bpf): improve hijack datapath performance~~ patch/optimize(bpf): improve lan hijack datapath performance Mar 7, 2024

sumire88 approved these changes Mar 8, 2024

View reviewed changes

sumire88 added the tested label Mar 8, 2024

dae-prow bot approved these changes Mar 8, 2024

View reviewed changes

mzz2017 approved these changes Mar 8, 2024

View reviewed changes

akiooo45 approved these changes Mar 8, 2024

View reviewed changes

jschwinger233 merged commit 49f576e into main Mar 8, 2024
30 checks passed

jschwinger233 deleted the pr/gray/datapath-perf branch March 8, 2024 15:28

dae-prow bot mentioned this pull request Mar 8, 2024

chore(sync): keep upstream source up-to-date daeuniverse/dae-wing#142

Merged

LostAttractor mentioned this pull request Mar 9, 2024

[Bug Report] pname(xxx) -> must_rules 无论是否匹配都会触发 #474

Open

3 tasks

dae-prow bot mentioned this pull request Apr 2, 2024

[Release Changelogs] v0.6.0rc1 #490

Closed

dae-prow bot mentioned this pull request Apr 15, 2024

[Release Changelogs] v0.6.0rc2 #501

Closed

dae-prow bot mentioned this pull request Jun 11, 2024

[Release Changelogs] v0.6.0 #534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patch/optimize(bpf): improve lan hijack datapath performance #466

patch/optimize(bpf): improve lan hijack datapath performance #466

jschwinger233 commented Mar 2, 2024 •

edited

Loading

sdgrfe commented Mar 2, 2024

amtoaer commented Mar 2, 2024

Mitsuhaxy commented Mar 2, 2024

dae-prow bot left a comment

dae-prow bot commented Mar 3, 2024

dae-prow bot commented Mar 3, 2024

douglarek commented Mar 4, 2024 •

edited

Loading

jschwinger233 commented Mar 6, 2024

douglarek commented Mar 7, 2024 •

edited

Loading

amtoaer commented Mar 7, 2024

jschwinger233 commented Mar 7, 2024

sumire88 left a comment

dae-prow bot left a comment

mzz2017 left a comment

patch/optimize(bpf): improve lan hijack datapath performance #466

patch/optimize(bpf): improve lan hijack datapath performance #466

Conversation

jschwinger233 commented Mar 2, 2024 • edited Loading

Background

Background

Checklist

Full Changelogs

Issue Reference

Test Result

sdgrfe commented Mar 2, 2024

amtoaer commented Mar 2, 2024

Mitsuhaxy commented Mar 2, 2024

dae-prow bot left a comment

Choose a reason for hiding this comment

dae-prow bot commented Mar 3, 2024

dae-prow bot commented Mar 3, 2024

douglarek commented Mar 4, 2024 • edited Loading

jschwinger233 commented Mar 6, 2024

douglarek commented Mar 7, 2024 • edited Loading

amtoaer commented Mar 7, 2024

jschwinger233 commented Mar 7, 2024

Benchmark (lan only)

1. Env: Linux 6.6.17 KVM, 4 cores, 12G memory.

2. Setup

3. TCP

4. UDP

sumire88 left a comment

Choose a reason for hiding this comment

dae-prow bot left a comment

Choose a reason for hiding this comment

mzz2017 left a comment

Choose a reason for hiding this comment

jschwinger233 commented Mar 2, 2024 •

edited

Loading

douglarek commented Mar 4, 2024 •

edited

Loading

douglarek commented Mar 7, 2024 •

edited

Loading