Skip to content

How to Configure Lossless RoCE

Petr Machata edited this page Oct 13, 2021 · 1 revision

This page discusses real-life configuration of Mellanox Spectrum based Ethernet switches for Lossless RoCE and TCP traffic. The switch will be enabled with PFC and ECN.

Note: This page is a translation of an article titled How to Configure Mellanox Spectrum Switch for Lossless RoCE into the language of Linux Switch.

Table of Contents
  1. Overview of Configuration
  2. Configuring Shared Buffer Pools
  3. Configuring Traffic Prioritization
  4. Configuring Traffic Scheduling
  5. Configuring Priority Group Buffers
  6. Configure Mapping of Traffic to Pools
  7. Configure ECN
  8. Configure PFC
  9. Further Resources

Overview of Configuration

There are three principal traffic flows: RDMA, CNP, and everything else. This is an overview of the way the treatment of these types of traffic is configured. Traffic prioritization is based on trust DSCP.

Type DSCP Prio Buf Pools Scheduling
TCP 0 0, PFC off PG0 ing 0 / eg 4 TC0, WRR
RDMA 24 3, PFC on PG3 ing 1 / eg 5 TC3, WRR, ECN
CNP 48 6, PFC off PG6 ing 1 / eg 5 TC6, strict

Configuring Shared Buffer Pools

See the QoS page for details about configuration of shared buffer pools for lossless traffic and in general.

Pools 0 (ingress) and 4 (egress) will be used for lossy traffic, pools 1 (ingress) and 5 (egress) for losslesss traffic. Pools 0, 1 and 4 will use half of the available chip memory:

$ devlink -j sb show pci/0000:03:00.0 | jq '.sb[][0].size / 2'
7012352
$ devlink sb pool set pci/0000:03:00.0 pool 0 size 7012352 thtype dynamic # ingress lossy
$ devlink sb pool set pci/0000:03:00.0 pool 4 size 7012352 thtype dynamic # egress lossy
$ devlink sb pool set pci/0000:03:00.0 pool 1 size 7012352 thtype dynamic # ingress lossless

Pool 5 (egress lossless), will be as large as the chip permits:

$ devlink -j sb show pci/0000:03:00.0 | jq '.sb[][0].size'
14024704
$ devlink sb pool set pci/0000:03:00.0 pool 5 size 14024704 thtype dynamic # egress lossless

Note: In practice, the limit just needs to be large enough to not be a limiting factor. This is a good way of making sure it is.

Finally, configure the port-pool quotas of pools 1 (ingress lossless) and 5 (egress lossless) to not be a limiting factor as well:

$ devlink sb port pool set swp1 pool 1 th 16 # ingress lossless
$ devlink sb port pool set swp1 pool 5 th 16 # egress lossless

Configuring Traffic Prioritization

See the QoS page for details about configuration of traffic prioritization in general and Trust DSCP in particular.

Use iproute2 dcb to install prioritization rules for DSCP values of 0, CS3 and CS6:

$ dcb app flush dev swp1 dscp-prio
$ dcb app add dev swp1 dscp-prio 0:0
$ dcb app add dev swp1 dscp-prio CS3:3
$ dcb app add dev swp1 dscp-prio CS6:6

Configuring Traffic Scheduling

To turn the port headroom to the TC mode, and thus permit manual configuration of TC mapping and buffer sizes, the qdisc needs to be installed first. See the Queues Management page for details about qdiscs in general, and ETS in particular.

Add the ETS qdisc to configure strict and WRR TCs. The configuration described above needs TC6 (and therefore band 1) to be strict, but that means TC7 needs to be strict as well, and thus two strict bands are needed. The priority map should forward all priorities to TC0 (band 7), except for priority 3, which should go to TC3 (band 4) and priority 6, which should go to TC6 (band 1).

$ tc qdisc replace dev swp1 root handle 1: \
        ets bands 8 strict 2 quanta 250 250 2000 250 250 2000 \
	priomap 7 7 7 4 7 7 1 7

The quanta shown above are to a degree arbitrary. Only bands 4 and 7, both with a quantum of 2000, will see WRR traffic. The 2000 / 2000 split simply means that both bands should have the same weight. Now the HW is configured using percentages, not using quanta. The quanta for the remaining bands are chosen so that each band's quantum ends up being a nice even percentage of the total. Here, the HW will be configured 5% : 5% : 40% : 5% : 5% : 40%. With no traffic hitting those 5% traffic classes, the split among the relevant ones is 1:1.

In this particular case, we could have left the quanta configuration out altogether. By default, each DRR bands gets a quantum of MTU. But because 100% does not split evenly among 6 bands, the HW configuration is 16% : 17% : 17% : 16% : 17% : 17%. The relevant bands are therefore 1:1 again, but we have to rely on the knowledge of how the algorithm splits the rounding error among bands, so the result is not self-evident like above.

Configuring Priority Group Buffers

See the QoS page for details about configuration of priority group buffers.

Traffic with priority 0 should go to the PG buffer 0, priority 3 to PG3 and priority 6 to PG6. This has to be configured using iproute2 dcb. PG3 needs to be given non-zero size to cover traffic that arrives after Xoff is transmitted. Other PGs can be set to zero, so that mlxsw autoconfigures them according to the port MTU at the time the command is issued.

To determine the size of the buffer, one needs to take into consideration all the traffic to be accomodated after the need to emit a PAUSE or PFC frame is identified by the chip, but before the frame is actually emitted, received, and acted upon. This has to take into account e.g. line rate, cable length, MTU and various delays and latencies. There correspondingly is no "rule of thumb" value to use. The necessary size can be determined using a hdroom_sz tool:

$ hdroom_sz --asic spc1 --linerate 100G --mtu 9000 --cable-length 1.0
xon_thresh	19456
xoff_thresh	19456
headroom_size	96432
$ dcb buffer set dev swp1 prio-buffer all:0 3:3
$ dcb buffer set dev swp1 buffer-size all:0 3:97K

Configure Mapping of Traffic to Pools

See the QoS page for details about configuration of pool binding.

Lossy traffic from PG0 and TC0 should go to pools 0 and 4, respectively:

$ devlink sb tc bind set swp1 tc 0 type ingress pool 0 th 11 # ingress lossy
$ devlink sb tc bind set swp1 tc 0 type egress pool 4 th 13  # egress lossy

Lossless PG3 and TC3 should go to, respectively, pool 1 (ingress lossless) and pool 5 (egress lossless). The egress pool quota should again be effectively infinite:

$ devlink sb tc bind set swp1 tc 3 type ingress pool 1 th 11 # ingress lossless
$ devlink sb tc bind set swp1 tc 3 type egress pool 5 th 16  # egress lossless

Lossy traffic from PG6 and TC6 will likewise go to this pool:

$ devlink sb tc bind set swp1 tc 6 type ingress pool 1 th 13  # ingress lossless
$ devlink sb tc bind set swp1 tc 6 type egress pool 5 th 13   # egress lossless

Configure ECN

See the description of RED qdisc for more details.

RED / ECN should be configured on TC3 (band 4, parent 1:5). In this example, we use minimum and maximum such that:

  • When the queue length reaches 150KB, some packets will randomly be marked with congestion on the ECN bits on the IP header.
  • When the queue length reaches 1500KB, all packets will be marked with congestion on the ECN bits on the IP header.
$ tc qdisc replace dev swp1 parent 1:5 handle 15: \
     red ecn limit 2M avpkt 1000 probability 0.1 min 150K max 1.5M

Configure PFC

See the QoS page for details about configuration of lossless traffic.

Use iproute2 dcb to enable PFC for priority 3:

$ dcb pfc set dev swp1 prio-pfc all:off 3:on

Further Resources

  1. How to Configure Mellanox Spectrum Switch for Lossless RoCE
  2. mlxsw Quality of Service
  3. mlxsw Queues Management
Clone this wiki locally