Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does OpenOnload + AF_XDP + Intel NIC support nginx multiple workers ? #70

Open
ligong234 opened this issue Mar 24, 2022 · 24 comments
Open

Comments

@ligong234
Copy link

Hello Onload Team,

I am follow the instructions in https://www.xilinx.com/publications/onload/sf-onload-nginx-proxy-cookbook.pdf, and
managed to make OpenOnload work on CentOS Linux release 7.6.1810 (Core) plus CentOS 8.4 kernel 4.18.0-305.12.1, all
onload drivers were loaded successfully, and I can register the Intel XXV710 NIC to onload, I want to do nginx proxy
benchmark test, start with four worker process, from another machine I uses wrk to generate the http requests, I noticed there
is only one nginx process is handling requests, while others are all idle, if I kill this busy nginx, a new nginx process
is forked and start handling http requests, the other three nginx are always idle, so the question is: does OpenOnload + AF_XDP

  • Intel NIC support nginx multiple workers ?

my environment setup is as below:

[root@localhost openonload]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)

[root@localhost openonload]# uname -a
Linux localhost 4.18.0-305.12.1.el7.centos.x86_64 #1 SMP Wed Aug 25 14:27:38 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost openonload]# ethtool -i eth1
driver: i40e
version: 4.18.0-305.12.1.el7.centos.x86_
firmware-version: 6.01 0x8000354e 1.1747.0
expansion-rom-version:
bus-info: 0000:5e:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[root@localhost openonload]# lspci -s 5e:00.1 -v
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
Subsystem: Intel Corporation Ethernet Network Adapter XXV710
Physical Slot: 3
Flags: bus master, fast devsel, latency 0, IRQ 657, NUMA node 0
Memory at c3000000 (64-bit, prefetchable) [size=16M]
Memory at c5800000 (64-bit, prefetchable) [size=32K]
Expansion ROM at c5e00000 [disabled] [size=512K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 78-d4-1c-ff-ff-b7-a6-40
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Capabilities: [1a0] Transaction Processing Hints
Capabilities: [1b0] Access Control Services
Kernel driver in use: i40e
Kernel modules: i40e

[root@localhost openonload]# rpm -qi nginx
Name : nginx
Epoch : 1
Version : 1.16.0
Release : 1.el7.ngx
Architecture: x86_64
Install Date: Tue 04 Jan 2022 04:04:30 PM CST
Group : System Environment/Daemons
Size : 2811760
License : 2-clause BSD-like license
Signature : RSA/SHA1, Tue 23 Apr 2019 11:13:55 PM CST, Key ID abf5bd827bd9bf62
Source RPM : nginx-1.16.0-1.el7.ngx.src.rpm
Build Date : Tue 23 Apr 2019 10:36:28 PM CST
Build Host : centos74-amd64-builder-builder.gnt.nginx.com
Relocations : (not relocatable)
Vendor : Nginx, Inc.
URL : http://nginx.org/
Summary : High performance web server
Description :
nginx [engine x] is an HTTP and reverse proxy server, as well as
a mail proxy server.

[root@localhost openonload]# cat /usr/libexec/onload/profiles/latency-af-xdp.opf
onload_set EF_POLL_USEC 100000
onload_set EF_AF_XDP_ZEROCOPY 0
onload_set EF_TCP_SYNRECV_MAX 8192
onload_set EF_MAX_ENDPOINTS 8192
onload_set EF_TCP_FASTSTART_INIT 0
onload_set EF_TCP_FASTSTART_IDLE 0

[root@localhost openonload]# cat /etc/nginx/nginx-proxy-node0-4-worker.conf
user root root;
worker_processes 4;
worker_rlimit_nofile 8388608;
worker_cpu_affinity 01 010 0100 01000 ;
pid /var/run/nginx-node0_4.pid;
events {
multi_accept off;
accept_mutex off;
use epoll;
worker_connections 200000;
}
error_log /var/log/error-node0_4.log debug;
http {
default_type application/octet-stream;
access_log off;
error_log /dev/null crit;
sendfile on;
proxy_buffering off;
keepalive_timeout 300s;
keepalive_requests 1000000;
server {
listen 10.19.1.43:80 reuseport;
listen 10.19.1.43:81 reuseport;
server_name localhost;
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
upstream backend {
server 10.96.10.21:80 ;
keepalive 500;
}
}

steps to reproduce the problem:

  1. load onload diver
    [root@localhost openonload]# onload_tool reload
    onload_tool: /sbin/modprobe sfc
    onload_tool: /sbin/modprobe onload

  2. register nic

[root@localhost openonload]# ethtool -K eth1 ntuple on
[root@localhost openonload]# ethtool -k eth1 | grep ntuple
[root@localhost openonload]# echo eth1 > /sys/module/sfc_resource/afxdp/register

  1. start nginx with onload

[root@localhost openonload]# /bin/onload -p latency-af-xdp /sbin/nginx -c /etc/nginx/nginx-proxy-node0-4-worker.conf
oo:nginx[32964]: Using Onload 20211221 [7]
oo:nginx[32964]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[33000]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360
oo:nginx[33000]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360
oo:nginx[33002]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360
oo:nginx[33002]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360
oo:nginx[33004]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360
oo:nginx[33004]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360
oo:nginx[33007]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360
oo:nginx[33007]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360

[root@localhost openonload]# ps -ef | grep nginx
root 32999 1 0 11:28 ? 00:00:00 nginx: master process /sbin/nginx -c /etc/nginx/nginx-proxy-node0-4-worker.conf
root 33000 32999 1 11:28 ? 00:00:00 nginx: worker process
root 33002 32999 2 11:28 ? 00:00:00 nginx: worker process
root 33004 32999 2 11:28 ? 00:00:00 nginx: worker process
root 33007 32999 1 11:28 ? 00:00:00 nginx: worker process
root 33013 55380 0 11:28 pts/1 00:00:00 grep --color=auto nginx

[root@localhost openonload]# oo:nginx[33007]: Using Onload 20211221 [0]
oo:nginx[33007]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

  1. start wrk on another machine

[root@localhost benchmark]# wrk -c 3200 -d 60 -t 32 --latency http://10.19.1.43/1kb.bin

  1. top show only one nginx process is busy

[root@localhost openonload]# top
top - 11:19:29 up 1 day, 19:13, 3 users, load average: 1.44, 1.31, 1.25
Tasks: 771 total, 2 running, 389 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.6 us, 0.8 sy, 0.0 ni, 98.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 26350249+total, 22556315+free, 27453884 used, 10485452 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 23326888+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
33007 root 20 0 335736 270784 151832 R 81.7 0.1 0:23.72 nginx <================
27999 root 0 -20 0 0 0 I 11.6 0.0 0:03.19 kworker/u132:0-
14505 root 20 0 90520 5144 4276 S 0.3 0.0 0:17.66 rngd
28034 root 20 0 162716 5208 3804 R 0.3 0.0 0:00.12 top

Can someone point out what is wrong with my setup, or share some information regarding AF_XDP + XXV710 , any help is appreciated.

Best Regards
Ligong

@maciejj-xilinx
Copy link
Contributor

maciejj-xilinx commented Mar 24, 2022

Hello Ligong,

Thanks for the interest and detailed info.
The whitepaper is based on using Solarflare NIC which gave fairly smooth experience.
Also note that profile for nginx-proxy given in the whitepaper had quite few specific options enabled.

With your set-up do you use two NICs? One for upstream and one for downstream?
And eth1 - the NIC you are accelerating with onload - is upstream or downstream (or both)?

Accelerating upstream and downstream are somewhat separate problems and best to tackle them one by one.
It might take few steps to get where you want to be.

Firstly, on downstream side to be able to receive traffic to multiple stacks a RSS support is required.
We have not tried using RSS on non-solarlafer NICs.
And i40e ntuple fitlering has limitation - no RSS.
The workaround is to not using ntuple filter but rely on existing kernel MAC filter with RSS enabled.

Before jumping to RSS yet I'd advise tuning Nginx in single worker mode to see whether you get appropriate performance.

Best Ragards,
Maciej

@shirshen12
Copy link

shirshen12 commented Mar 24, 2022

Hi @maciejj-xilinx , this explains the problem we see with memcached in multi-threaded mode on Mellanox NiCs as well. When memcached is offloaded via Onload-on-AF_XDP, we see only one thread processing traffic and rest all else idle. On deeper inspection, we see ethtool -S <ifname> | grep xdp_redirect showing xdp_redirect counters increasing only on 1 queue and rest is all zero.

Is there any patch you can apply quickly to use RSS on non-SFC NiCs or can you give the instructions for using kernel MAC filters with RSS enabled ?

@ligong234
Copy link
Author

Hello Ligong,

Thanks for the interest and detailed info. The whitepaper is based on using Solarflare NIC which gave fairly smooth experience. Also note that profile for nginx-proxy given in the whitepaper had quite few specific options enabled.

With your set-up do you use two NICs? One for upstream and one for downstream? And eth1 - the NIC you are accelerating with onload - is upstream or downstream (or both)?

Accelerating upstream and downstream are somewhat separate problems and best to tackle them one by one. It might take few steps to get where you want to be.

Firstly, on downstream side to be able to receive traffic to multiple stacks a RSS support is required. We have not tried using RSS on non-solarlafer NICs. And i40e ntuple fitlering has limitation - no RSS. The workaround is to not using ntuple filter but rely on existing kernel MAC filter with RSS enabled.

Before jumping to RSS yet I'd advise tuning Nginx in single worker mode to see whether you get appropriate performance.

Best Ragards, Maciej

Hi Maciej,

Thanks for your quick reply.
Basically we want to get a onload number quickly, my setup is pretty simple, there are two machines both equiped with
two Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz CPU, and 256G RAM, machine one act as onload nginx proxy, machine two act
as nginx origin and run wrk, the nginx proxy machine installs a Intel Corporation Ethernet Controller XXV710 for
25GbE SFP28 (rev 02) dual ports adapter, the nginx origin machine installs a Mellanox Technologies Stand-up ConnectX-4 Lx EN,
25GbE dual-port SFP28 adapter. each pair of the adapter ports connected with a dedicated network switch. both machines
installed with CentOS 7.6 and Linux kernel 4.18.0-305.12.1, and the latest version of nginx.

 +-----------------------+                                     +-----------------------+
 |                       |-eth0  ------- switch 1 ------- eth0-| nginx origin (CPU 0~4)|
 |  onload nginx proxy   |                                     |                       |
 |       (CPU 0~4)       |-eth1 -------- switch 2 ------- eth1-| wrk (CPU 32~63)       |
 +-----------------------+                                     +-----------------------+
         machine one                                                  machine two

for this setup onload nginx proxy uses two XXV710 NIC ports, and eth1 is nginx proxy downstream, eth0 is upstream, run wrk on
CPU 3263 to make it generates more requests, and both nginx run on CPU 04, run onload on machine one only.

One thing to mention before this test I also run one nginx worker test, the Onload number is pretty good, outperform a lot of
the Linux kernel.

for this setup I make two tests, one let onload acclerates both eth0 and eth1, and two acclaretes eth1 only, both tests show
only one onload acclerated nginx is busy (top CPU is high), now I am focusing on how to make nginx downstream spread traffic
to multiple workers.

for the i40 NIC you mentioned the workaronud by rely on kernel MAC filter with RSS enabled, what is configure option or
linux command I can use to active it ?

I have take a look at the onload AF_XDP support code (point me out if I am wrong), the kernel side efhw\af_xdp.c nic init hw
load XDP prog and attach to NIC, and af_xdp_init create AF_XDP socket on behalf of onload acclerated process, and then register
UMEM and rings, and grab its kernel mapping address, then the user space part libonload.so mmap rings to userspace, now both
kernel module and user space process can operate on AF_XDP rings. the AF_XDP socket bind to one NIC and one queue id, and rely
on NIC hardware filters to redirect ingress traffic to that queue.

The Linux kernel Documentation\networking\af_xdp.rst mentioned the ring structure are single-consumer/single-producer,
for nginx single worker case, I believe onload will take care of it and allow either user space or kernel touch the rings
while not interfered each other. The problem raise when nginx fork multiple worker process, then there are multiple copy of
the rings, I want to know how onload coordinates the concurrent ring access while not breaking Linux kernel ring assumption of
single-consumer/single-producer ? and from hardware point of view, there is only on NIC queue is utilized, does this will became
a bottleneck ? and how can we utilize multiple NIC queues or we do not need to care about it ?

Best Regards,
Ligong

@shirshen12
Copy link

Hi @maciejj-xilinx

Can you please respond to this question:

from hardware point of view, there is only on NIC queue is utilized, does this will became
a bottleneck ? and how can we utilize multiple NIC queues or we do not need to care about it ?

and

to this question:

kernel MAC filter with RSS enabled, what is configure option or linux command I can use to active it ?

@shirshen12
Copy link

@ligong234 I think @maciejj-xilinx maybe talking about this:
#28 (comment)

@shirshen12
Copy link

shirshen12 commented Mar 29, 2022

Well it looks like echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters does not add filters, but ethtool -S ens5 | grep xdp shows that redirect clause is being triggered on only one queue still

ethtool -S ens5 | grep xdp
Screenshot 2022-03-29 at 11 59 41 PM

@maciejj-xilinx
Copy link
Contributor

I have given this some though but was not able to put together all the steps and test. This might take a bit of time.
Worth noting that to give a smoother experience it would take some code changes. In the meantime we can try how feasible this would be.

@shirshen12 is this with single Onload stack? to capture traffic on multiple queues, multiple stacks are needed.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
EF_IRQ_CHANNEL=2 ...

Best to start as many instances as many RX channels your device got setup - not less not more - this is crucial to cover entire traffic. The number of channels can be adjusted: e.g. ethool -l combined 2 for just two channels.

I would expect this to work for @ligong234 with Intel NIC, with Mellanox ... probably - easy to check.

The test is to do several connection attempts from the peer host and see if both onload stacks show some traffic.

for((i=0;i<16;++i)); do wget host:12345 & done

In my case both stacks opened some half of the connections:

onload_stackdump lots | grep listen2synrecv
listen2synrecv: 2
listen2synrecv: 2

@ligong234
Copy link
Author

@maciejj-xilinx, @shirshen12, Thanks your guys for share the valuable information, I will give it a try and post the result shortly.
Ligong

@ligong234
Copy link
Author

unfortunately, my setup does not have the parameter "enable_af_xdp_flow_filters", and onload source code does not contains this string. my setup is derived from " 2021-12-15 [Jamal Mulla] ON-13728: Fixes the missing CTPIO ptr issue (#856)", @maciejj-xilinx which onload version are you running ?

[root@localhost ~]# lsmod | grep onload
onload                827392  4 
sfc_char              118784  1 onload
sfc_resource          192512  2 onload,sfc_char
[root@localhost ~]#
[root@localhost ~]# ll /sys/module/sfc_resource/parameters/
total 0
-rw-r--r-- 1 root root 4096 Mar 30 10:40 enable_accel_by_default
-rw-r--r-- 1 root root 4096 Mar 30 10:40 enable_driverlink
-r--r--r-- 1 root root 4096 Mar 30 10:40 force_ev_timer
-r--r--r-- 1 root root 4096 Mar 30 10:40 pio
[root@localhost ~]#
[root@localhost ~]# find /sys/module/ -name enable_af_xdp_flow_filters
[root@localhost ~]# 

[root@localhost onload]# grep -rn enable_af_xdp_flow_filters .
[root@localhost onload]#

@shirshen12
Copy link

Have you registered your NiC to AF_XDP @ligong234 ?

@ligong234
Copy link
Author

@shirshen12 Yes I do.

[root@localhost openonload]# onload_tool reload
onload_tool: /sbin/modprobe sfc
onload_tool: /sbin/modprobe onload
[root@localhost openonload]#

[root@localhost openonload]# cat do-register-nic 
dmesg -c

ethtool -K eth0 ntuple on
ethtool -K eth1 ntuple on
ethtool -k eth0 | grep ntuple
ethtool -k eth1 | grep ntuple

#echo eth0 > /sys/module/sfc_resource/afxdp/register
echo eth1 > /sys/module/sfc_resource/afxdp/register

dmesg -c

[root@localhost openonload]# . do-register-nic 
[ 3554.208521] Efx driverlink unregistering resource driver
[ 3554.224503] Solarflare driverlink driver unloading
[ 3564.259164] Solarflare driverlink driver v5.3.12.1008 API v33.0
[ 3564.263516] Solarflare NET driver v5.3.12.1008
[ 3564.286037] Efx driverlink registering resource driver
[ 3564.327892] [onload] Onload 20211221
[ 3564.327915] [onload] Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
[ 3564.432495] onload_cp_server[57195]: Spawned daemon process 57225
ntuple-filters: on
ntuple-filters: on
[ 3578.875681] [sfc efrm] efrm_nondl_register_device: register eth1
[ 3578.875889] [sfc efrm] eth1 type=4:
[ 3579.103249] irq 749: Affinity broken due to vector space exhaustion.
[ 3579.103262] irq 750: Affinity broken due to vector space exhaustion.
[ 3579.103276] irq 751: Affinity broken due to vector space exhaustion.
[ 3579.103289] irq 752: Affinity broken due to vector space exhaustion.
[ 3579.103302] irq 753: Affinity broken due to vector space exhaustion.
[ 3579.103314] irq 754: Affinity broken due to vector space exhaustion.
[ 3579.103326] irq 755: Affinity broken due to vector space exhaustion.
[ 3579.103339] irq 756: Affinity broken due to vector space exhaustion.
[ 3579.103352] irq 757: Affinity broken due to vector space exhaustion.
[ 3579.103364] irq 758: Affinity broken due to vector space exhaustion.
[ 3579.103377] irq 759: Affinity broken due to vector space exhaustion.
[ 3579.103389] irq 760: Affinity broken due to vector space exhaustion.
[ 3579.103402] irq 761: Affinity broken due to vector space exhaustion.
[ 3579.103415] irq 762: Affinity broken due to vector space exhaustion.
[ 3579.103428] irq 763: Affinity broken due to vector space exhaustion.
[ 3579.103441] irq 764: Affinity broken due to vector space exhaustion.
[ 3579.103633] irq 781: Affinity broken due to vector space exhaustion.
[ 3579.103644] irq 782: Affinity broken due to vector space exhaustion.
[ 3579.103657] irq 783: Affinity broken due to vector space exhaustion.
[ 3579.103670] irq 784: Affinity broken due to vector space exhaustion.
[ 3579.103683] irq 785: Affinity broken due to vector space exhaustion.
[ 3579.103696] irq 786: Affinity broken due to vector space exhaustion.
[ 3579.103708] irq 787: Affinity broken due to vector space exhaustion.
[ 3579.103721] irq 788: Affinity broken due to vector space exhaustion.
[ 3579.103733] irq 789: Affinity broken due to vector space exhaustion.
[ 3579.103748] irq 790: Affinity broken due to vector space exhaustion.
[ 3579.103761] irq 791: Affinity broken due to vector space exhaustion.
[ 3579.103775] irq 792: Affinity broken due to vector space exhaustion.
[ 3579.103786] irq 793: Affinity broken due to vector space exhaustion.
[ 3579.103800] irq 794: Affinity broken due to vector space exhaustion.
[ 3579.103814] irq 795: Affinity broken due to vector space exhaustion.
[ 3579.103826] irq 796: Affinity broken due to vector space exhaustion.
[ 3579.104432] [sfc efrm] eth1 index=0 ifindex=3
[ 3579.104438] [onload] oo_nic_add: ifindex=3 oo_index=0
[root@localhost openonload]# 
[root@localhost openonload]# ll /sys/module/sfc_resource/parameters/ 
total 0
-rw-r--r-- 1 root root 4096 Mar 30 11:29 enable_accel_by_default
-rw-r--r-- 1 root root 4096 Mar 30 11:29 enable_driverlink
-r--r--r-- 1 root root 4096 Mar 30 11:29 force_ev_timer
-r--r--r-- 1 root root 4096 Mar 30 11:29 pio
[root@localhost openonload]# 

@shirshen12
Copy link

shirshen12 commented Mar 30, 2022

Is it a Intel NiC or Mellanox NiC ? If its a Intel NiC, you need to enable flow director. I can give you exact instructions per the NiC make.

@ligong234
Copy link
Author

@shirshen12 it is Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)

@shirshen12
Copy link

shirshen12 commented Mar 30, 2022

please see instructions for i40e below, please follow as is. Its for Ubuntu 21.04 LTS

Driver: ixgbe
OS: Ubuntu 21.04 LTS

upgrade to latest OS kernel

apt update -y
apt upgrade -y
apt full-upgrade -y

reboot into new kernel
reboot

Install dependencies

apt install build-essential net-tools unzip libcap-dev linux-tools-common linux-tools-generic netperf libevent-dev libnl-route-3-dev tk bison tcl libnl-3-dev flex libnl-route-3-200 dracut python2 libpcap-dev -y
apt install initramfs-tools -y

build the intel driver, ixgbe

wget https://downloadmirror.intel.com/682680/ixgbe-5.13.4.tar.gz
tar zxf ixgbe-5.13.4.tar.gz
cd ixgbe-5.13.4/src/
make install

build the intel driver, i40e

wget http://downloadmirror.intel.com/709707/i40e-2.17.15.tar.gz
tar zxf i40e-2.17.15.tar.gz
cd i40e-2.17.15/src/
make install

The binary will be installed as:
/lib/modules/<KERNEL VER>/updates/drivers/net/ethernet/intel/ixgbe/ixgbe.ko

Load the ixgbe module using the modprobe command.
rmmod ixgbe; modprobe ixgbe

Load the i40e module using the modprobe command.
rmmod i40e; modprobe i40e

update the initrd/initramfs file to prevent the OS loading old versions of the ixgbe driver.
update-initramfs -u

reboot again, just for safety
reboot

Install Onload:

git clone https://github.com/Xilinx-CNS/onload.git
cd onload
scripts/onload_mkdist --release
cd onload-<version>/scripts/
./onload_install
./onload_tool reload

register the NiC with AF_XDP driver interface
echo enp1s0 > /sys/module/sfc_resource/afxdp/register

turn on Intel Flow Director
ethtool --features enp1s0 ntuple on

You are set!

@ligong234
Copy link
Author

@shirshen12 Thanks for sharing the detail instructions, that is exactly what I did, a little bit difference is that my system is CentOS and use kernel in tree i40e driver, the single worker onload nginx is working fine. The problem I have right now is my onload resource driver does not have the "enable_af_xdp_flow_filters" parameter as maciejj-xilinx pointed out, which prevent me from running multiple instances of onload nginx.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &

my env and instructions are as below:

[root@localhost openonload]# ethtool -i eth1
driver: i40e
version: 4.18.0-305.12.1.el7.centos.x86_
firmware-version: 6.01 0x8000354e 1.1747.0
expansion-rom-version: 
bus-info: 0000:5e:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[root@localhost openonload]# lspci -s 0000:5e:00.1 -v
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Ethernet Network Adapter XXV710
        Physical Slot: 3
        Flags: bus master, fast devsel, latency 0, IRQ 582, NUMA node 0
        Memory at c3000000 (64-bit, prefetchable) [size=16M]
        Memory at c5800000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at c5e00000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 78-d4-1c-ff-ff-b7-a6-40
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

[root@localhost openonload]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 

[root@localhost openonload]# uname -a
Linux localhost 4.18.0-305.12.1.el7.centos.x86_64 #1 SMP Wed Aug 25 14:27:38 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost openonload]# modinfo i40e
filename:       /lib/modules/4.18.0-305.12.1.el7.centos.x86_64/kernel/drivers/net/ethernet/intel/i40e/i40e.ko.xz
version:        4.18.0-305.12.1.el7.centos.x86_64
license:        GPL v2
description:    Intel(R) Ethernet Connection XL710 Network Driver
author:         Intel Corporation, <[email protected]>
rhelversion:    8.4
srcversion:     78E81CDBAAC80E980F550F5
alias:          pci:v00008086d0000158Bsv*sd*bc*sc*i*
...
alias:          pci:v00008086d00001572sv*sd*bc*sc*i*
depends:        
intree:         Y
name:           i40e
vermagic:       4.18.0-305.12.1.el7.centos.x86_64 SMP mod_unload modversions 
parm:           debug:Debug level (0=none,...,16=all), Debug mask (0x8XXXXXXX) (uint)
[root@localhost openonload]# 

[root@localhost openonload]# ethtool -K eth1 ntuple on
[root@localhost openonload]# onload_tool reload
onload_tool: /sbin/modprobe sfc
onload_tool: /sbin/modprobe onload
[root@localhost openonload]# echo eth1 > /sys/module/sfc_resource/afxdp/register
[root@localhost openonload]# ls -l  /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
ls: cannot access /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters: No such file or directory
[root@localhost openonload]#

@shirshen12
Copy link

Can you move to Red Hat 8 or CentOS 8 ?I know you are on 4.18+ but it looks like somehow the eBPF VM is not baked into CentOS 7.6 (it was in preview mode last I knew, not production grade)

@abower-amd
Copy link
Collaborator

Hi @ligong234,

The problem I have right now is my onload resource driver does not have the "enable_af_xdp_flow_filters" parameter

You need to update your source tree. This feature got added with b8ba4e2 on 28 Feb 2022.

Andy

@ligong234
Copy link
Author

@abower-xilinx Thanks you I will give it a try

@shirshen12
Copy link

Did it work for you @ligong234 . I thought you were using latest master branch of onload.

@ligong234
Copy link
Author

@shirshen12 I have not try the latest master branch, my test is basing on Onload 2021-12-15 commit, I will try the latest master and report the result.

@shirshen12
Copy link

I have given this some though but was not able to put together all the steps and test. This might take a bit of time. Worth noting that to give a smoother experience it would take some code changes. In the meantime we can try how feasible this would be.

@shirshen12 is this with single Onload stack? to capture traffic on multiple queues, multiple stacks are needed.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
EF_IRQ_CHANNEL=2 ...

Best to start as many instances as many RX channels your device got setup - not less not more - this is crucial to cover entire traffic. The number of channels can be adjusted: e.g. ethool -l combined 2 for just two channels.

I would expect this to work for @ligong234 with Intel NIC, with Mellanox ... probably - easy to check.

The test is to do several connection attempts from the peer host and see if both onload stacks show some traffic.

for((i=0;i<16;++i)); do wget host:12345 & done

In my case both stacks opened some half of the connections:

onload_stackdump lots | grep listen2synrecv
listen2synrecv: 2
listen2synrecv: 2

Hi @maciejj-xilinx its for multhreaded memcached, memcached -t 4

@ligong234
Copy link
Author

Update, today I try the Onload latest master branch, and hit another error. multiple nginx instances start successfully, when I start wrk, nginx complain failed to allocate stack, and dmesg shows out of VI instances, and the error code is -EBUSY.

oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
[ 1582.957667] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)

# cat /usr/include/asm-generic/errno-base.h
...
#define EBUSY           16      /* Device or resource busy */

steps to reproduce, I turn off the xdp zerocopy, when it is on , onload failed to allocate UMEM

i40e 0000:5e:00.1: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 1
i40e 0000:5e:00.1: Failed to allocate some buffers on UMEM enabled Rx ring 1 (pf_q 66)

[root@localhost benchmark]# cat /usr/libexec/onload/profiles/latency-af-xdp.opf 
# SPDX-License-Identifier: BSD-2-Clause
# X-SPDX-Copyright-Text: (c) Copyright 2010-2019 Xilinx, Inc.

# OpenOnload low latency profile.

# Enable polling / spinning.  When the application makes a blocking call
# such as recv() or poll(), this causes Onload to busy wait for up to 100ms
# before blocking.
#
onload_set EF_POLL_USEC 100000

# enable AF_XDP for Onload
#onload_set EF_AF_XDP_ZEROCOPY 1
onload_set EF_AF_XDP_ZEROCOPY 0
onload_set EF_TCP_SYNRECV_MAX 8192
onload_set EF_MAX_ENDPOINTS 8192


# Disable FASTSTART when connection is new or has been idle for a while.
# The additional acks it causes add latency on the receive path.
onload_set EF_TCP_FASTSTART_INIT 0
onload_set EF_TCP_FASTSTART_IDLE 0

[root@localhost benchmark]# cat start-nginx-proxy-onload-workaround.sh 
#!/bin/bash
sysctl -w net.ipv4.ip_local_port_range='9000 65000';
sysctl -w vm.nr_hugepages=10000;
sysctl -w fs.file-max=8388608;
sysctl -w fs.nr_open=8388608;
ulimit -n 8388608;

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters

# Start Nginx proxy
function start_nginx() {
        local cpu=$1
        export EF_IRQ_CHANNEL=$cpu
        taskset -c $cpu /bin/onload -p latency-af-xdp \
                /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf \
                        -g "pid /var/run/nginx-worker-${cpu}.pid; error_log /var/log/nginx-worker-${cpu}-error.log; " \
        &
}

nginx_works=4
for ((i=0; i<$nginx_works; i++)) ; do
        start_nginx $i
done

ps -ef | grep nginx

[root@localhost benchmark]# . start-nginx-proxy-onload-workaround.sh
net.ipv4.ip_local_port_range = 9000 65000
vm.nr_hugepages = 10000
fs.file-max = 8388608
fs.nr_open = 8388608
root     21212 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-0.pid; error_log /var/log/nginx-worker-0-error.log;
root     21213 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-1.pid; error_log /var/log/nginx-worker-1-error.log;
root     21214 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-2.pid; error_log /var/log/nginx-worker-2-error.log;
root     21215 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-3.pid; error_log /var/log/nginx-worker-3-error.log;
root     21217 17533  0 09:03 pts/0    00:00:00 grep --color=auto nginx
[root@localhost benchmark]# oo:nginx[21215]: Using Onload 20220330 [0]
oo:nginx[21215]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21212]: Using Onload 20220330 [1]
oo:nginx[21212]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21214]: Using Onload 20220330 [2]
oo:nginx[21214]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21213]: Using Onload 20220330 [3]
oo:nginx[21213]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

[root@localhost benchmark]# onload_stackdump netstat
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
[root@localhost benchmark]#
[root@localhost benchmark]# oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure
oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure
oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure


[root@localhost benchmark]# dmesg -c
...
[ 1582.954945] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.955855] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.956723] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.957667] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.958605] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)

@shirshen12
Copy link

by default, onload does not get into generic mode of XDP in case driver support does is not available. Please dont turn off ZC support for AF_XDP. I also used to get this error @ligong234

@shirshen12
Copy link

I tested the SO_REUSEPORT thing and it works!!!! But yeah multithreaded apps with auto sensing of RSS is not there. So we can work around it this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants