Skip to content

Queues Management

Ido Schimmel edited this page Dec 25, 2021 · 18 revisions
Table of Contents
  1. Qdisc
    1. Features by Version
    2. Qdiscs: Brief Introduction
    3. Creating a Qdisc
    4. Listing Qdiscs and Statistics
  2. DCB Incompatibility
  3. Qevents
    1. Configuring Qevents
  4. ETS
    1. ETS Bands
    2. Priomap
    3. Band Number Mapping
    4. Statistics
    5. TC Queue Depth
    6. Example
  5. PRIO
    1. Example
  6. Leaf Qdiscs
    1. Statistics
    2. Combining Leaf Qdiscs
  7. pFIFO, bFIFO
    1. FIFO Parameters
    2. UC and BUM Traffic Classes
    3. Statistics
    4. Example
  8. RED
    1. RED Parameters
    2. UC and BUM Traffic Classes
    3. Statistics
    4. Qevents
    5. Example
  9. TBF
    1. TBF Parameters
    2. UC and BUM Traffic Classes
    3. Port Shaper
    4. Statistics
    5. Example

Qdisc

Traffic control in Linux is managed by the TC subsystem. Documentation can be found here and in the TC man page.

Features by Version

Kernel Version
4.15 RED as root qdisc (ECN supported)
4.16 PRIO qdisc as root qdisc
4.17 RED as child of PRIO
5.6 ETS qdisc as root, RED and TBF as children of ETS or PRIO
5.7 FIFO stats offload, RED nodrop mode
5.9 early_drop qevent with actions mirred and trap
5.16 RED mark qevent with action mirred
5.16 Multi-level qdisc offload, Port Shaper

Qdiscs: Brief Introduction

Qdiscs, for "queuing disciplines", are entities that take care of queuing up and later scheduling of traffic to be transmitted by a network interface. From this point of view, a qdisc has two interesting operations: enqueue requests that a packet be queued up for later transmission; dequeue requests that one of the queued-up packets be chosen for immediate transmission. Since there is no one right way to manage packet queues, a number of qdiscs of different types exist.

Qdiscs conceptually form a tree: one qdisc is at the root of the tree, and it may have zero or more children, which in turn can have more children of their own. How many children any given qdisc permits, if any at all, depends on the qdisc in question. The points where children can be attached are called classes, and qdiscs that can have non-zero number of classes are called classful.

Classful qdiscs do not store any packets themselves. Instead, they pass enqueue and dequeue requests down to one of their children, according to criteria specific to the qdisc itself. Eventually this recursive message passing ends up at one of the leaves, where the packets are actually stored. (Or where the packets are picked up from in case of dequeuing.)

(To be fully correct: qdiscs actually form a DAG, directed acyclic graph. Some qdiscs can be attached at multiple classes. However most of the time the simple tree structure is all that is needed, and is the only one that mlxsw is capable of offloading.)

Each qdisc is identified by its handle, which is a 16-bit hexadecimal number with a colon attached, such as 1: or abcd:. That number is called qdisc major number. If a qdisc has any classes, their identifiers are formed as a pair of two numbers: <major>:<minor>, such as abcd:1. The numbering scheme for the minor numbers depends on the qdisc type. Sometimes the numbering is systematic, where the first class has the ID <major>:1, the second one <major>:2, and so on. Some qdiscs allow the user to set class minor number arbitrarily as the class is created.

Creating a Qdisc

To create a new qdisc and attach it at a given point, use "add" or "replace" command:

# tc qdisc [add | replace] dev <dev name> \
    [root | parent <parent ID>] [handle <handle ID>] \
    <qdisc type> [<qdiscs params>]

Handle ID is a 16 bit number, written as <major>:. If the user does not specify a handle by hand, a new one is picked automatically.

A qdisc can be set as a root qdisc or as a child of another qdisc. In the latter case, parent ID is the ID of the class where the qdisc should be attached.

The difference between "add" and "replace" operations is in handling of qdiscs that exist at the attachment point prior to the creation. When a new class is created, it always comes with either a pFIFO or bFIFO qdisc attached by default. "add" allows detaching this implicit qdisc and attaching a new one instead. As soon as a qdisc has been added explicitly, the "replace" command has to be used to replace that qdisc. In practice it is possible to just always use "replace".

Listing Qdiscs and Statistics

To list qdiscs at a given interface, use the "show" command:

# tc qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms

Pass the -s flag to see the statistics as well:

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

The meaning of the individual statistical counters is elaborated at each qdisc's section.

DCB Incompatibility

The TC subsystem allows configuration of traffic classification, of individual TCs, and traffic shaping. These aspects are configurable also through the DCB subsystem. Therefore configuring qdiscs may overwrite the corresponding DCB configuration present at the time, and vice versa--DCB configuration will overwrite any preexisting qdisc configuration.

In particular, if the egress path should be configured through qdiscs, it is important to make sure that lldpad is stopped and possibly disabled:

# systemctl disable --now lldpad.service

Qevents

Qdiscs may invoke user-configured actions when certain interesting events take place in the qdisc. The object through which these actions are configured is called a "qevent". Each qevent can either be unused, or can have a shared block attached to it. The filters at this block are executed when the corresponding interesting event takes place.

As an example, the RED qdisc supports an early_drop qevent. Packets that are early-dropped due to the RED algorithm are then passed through the filters at the block that is configured for this qevent.

mlxsw is capable of offloading filters added to qevent blocks as long as the following conditions are satisfied:

  • The switch ASIC is Spectrum-2 or above.
  • The qevent is supported (see below).
  • Only a single filter shall be attached at the configured block, at chain 0 (the default), and its classifier shall be matchall.
  • The filter shall have hw_stats set to disabled
  • The filter shall have a single action, which shall be supported (see below).

The following qevents are supported:

The following actions are supported:

  • mirred egress mirror, which configures a SPAN, RSPAN or ERSPAN session to which the matching packets are directed.

  • trap, which transfers the impacted packet to the CPU. This action is only offloaded on drop-like qevents. The trap under which the packet is reported depends on the qevent. This action is only offloaded with RED early_drop qevent.

Configuring Qevents

Qevents are configured when a qdisc is created. The general form is as follows:

# tc qdisc add dev swp1 root handle 1:             \
  <qdisc_kind> <qdisc parameters>                  \
  qevent <qevent_name> block <block-index>

This way, a shared block with a given index is bound to the given qevent. Then filters added to this block are considered for offloading:

# tc qdisc replace dev swp1 root handle 1: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M \
     qevent early_drop block 10

# tc filter add block 10 matchall skip_sw \
     action vlan pop hw_stats disabled
Error: Unsupported action.
We have an error talking to the kernel

# tc filter add block 10 matchall skip_sw \
     action mirred egress mirror dev swp6
Error: HW counters not supported on qevents.
We have an error talking to the kernel

# tc filter add block 10 matchall skip_sw \
     action mirred egress mirror dev swp6 hw_stats disabled

ETS

The ETS qdisc describes mapping of packets to traffic classes based on their priority, and scheduling of individual traffic classes relative to one another. There are correspondingly two components that are present in every ETS qdisc: bands describe the traffic classes, and priomap describes the classification function.

ETS will be offloaded when configured on a front panel port interface as a root qdisc or as a child of a port shaper TBF.

ETS Bands

Each ETS qdisc has a set of classes, called "bands". Each band represents one logical traffic class. Since each band is a qdisc class, a qdisc can be attached at each of them.

ETS bands are split to two groups: a (possibly empty) set of strict bands, followed by a (possibly empty) set of DWRR bands. The strict bands, if any, are always the lower-numbered ones.

The way ETS dequeues packets is that it first tries to dequeue traffic from the strict bands, if there are any. It proceeds in order of band number, first band 0, then band 1, and so on, until all strict bands are tried.

An ETS qdisc with just strict bands can be created this way:

# tc qdisc add dev swp1 root handle 1: \
     ets bands 8 strict 8
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

DWRR bands are tried next. A value, quantum, is assigned to each DWRR band. The value is number of bytes that the band is allowed to dequeue before it yields to the next DWRR band in row.

When creating DWRR bands, instead of mentioning the number of bands like was the case above with strict bands, one lists quanta for the individual bands.

This dequeuing algorithm is respected when the ETS qdisc is offloaded. Thus strict bands in the SW datapath represent strict TCs in the HW one, and DWRR bands DWRR TCs. For purposes of offloading, the quanta at DWRR bands are converted to percentage of available bandwidth, and the ASIC then aims to split the available bandwidths according to these percentages. At most 8 bands can be offloaded--if the qdisc has more bands, mlxsw will not be able to offload it.

The following example creates an ETS qdisc with 4 strict bands and 4 DWRR ones, where bandwidth is split 25% : 25% : 25% : 25%:

# tc qdisc add dev swp1 root handle 1: \
     ets bands 8 strict 4 quanta 2000 2000 2000 2000
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 4 quanta 2000 2000 2000 2000 priomap 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Priomap

ETS supports several traffic classification algorithms, but the only one offloaded by mlxsw is called priomap. priomap is composed of a list of numbers, one for each priority. The number indicates the number of band that the packets with that priority should go to: 0 for the first band, 1 for the second, and so on:

                        p7 ----------------.
                        ..                 |
                        p2 ------.         |
                        p1 ----. |   ...   |
                        p0 --. | |         |
                             | | |         |
                             v v v         v
# tc ... ets bands 4 priomap 3 3 2 2 1 1 0 0
                             | | |   ...   |
                             | | |         '-> band 0
                             | | |              ...
                             | | '-----------> band 2
                             | '-------------> band 3
                             '---------------> band 3

For details on how priority is assigned to packets, see Quality of Service.

Note that ETS supports up to 16 priorities in a priomap. For purposes of offloading, the only relevant priorities are 0-7. Priorities 8-15 are ignored and can be omitted when configuring ETS.

Band Number Mapping

mlxsw uses bands to denote logical traffic classes. Each band is mapped in the ASIC to a pair of TCs, one for known unicast traffic, the other one for BUM traffic (for broadcast, unknown unicast, multicast). The UC TC has strict priority over the BUM TC.

The unicast TC is derived from the band number as follows: band 0 maps to TC 7, band 1 to TC 6, etc., until band 7 maps to TC 0. The TC for BUM traffic is then the number of unicast TC + 8. The TC numbers are important for checking some per-TC ethtool counters and for shared buffer binding configuration.

For purposes of attaching a child to a band, the qdisc class ID of the band is its band number + 1.

The following table summarizes the band mapping described above:

Band no. Class ID UC TC BUM TC Priority
0 X:1 7 15 Highest
1 X:2 6 14
2 X:3 5 13
3 X:4 4 12
4 X:5 3 11
5 X:6 2 10
6 X:7 1 9
7 X:8 0 8 Lowest

Statistics

The tc -s show command will list the current ETS configuration including the full priomap and statistics:

$ tc -s qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7
 Sent 30510403042 bytes 20289261 pkt (dropped 5199870, overlimits 0 requeues 0)
 backlog 222720b 0p requeues 0

The statistics represent the sum of the statistics of all the bands. If RED is configured on any of ETS classes, the child drops will be counted by the parent as well.

When using strict bands, packets in lower priority bands will not be sent until all the higher-priority bands are empty. In this situation, packets in the HW datapath might be dropped due to switch lifetime timeouts. These drops are not counted towards the number of dropped packets.

TC Queue Depth

The backlog values reported on offloaded qdiscs is composed of values of the two constituent TCs. To find the value for individual TCs, it is necessary to inspect the ethtool counter tc_transmit_queue_tc_<TC>. That shows number of bytes queued up at individual traffic classes:

$ ethtool -S swp1 | grep tc_transmit_queue_tc
	tc_transmit_queue_tc_0: 0     \
	tc_transmit_queue_tc_1: 0      | UC TCs
	[...]                          |
	tc_transmit_queue_tc_7: 0     /
	tc_transmit_queue_tc_8: 0     \
	tc_transmit_queue_tc_9: 0      | BUM TCs
	[...]                          |
	tc_transmit_queue_tc_15: 0    /

Example

Add an ETS qdisc with handle 10:, with 8 bands, 4 of which are strict, and the remaining 4 split traffic 40% : 30% : 20% : 10%. The quanta sum up to 10000, which makes it easy to mentally map from the per-band quantum to the corresponding percentage.

Traffic is mapped to bands in a reversed 1:1 manner to make priority-0 traffic the least prioritized and priority-7 traffic the most prioritized. That means that priority 0 goes to TC 0, 1 goes to TC 1, and so on. (Except BUM traffic, which goes to TC 8, TC 9, and so on instead.)

# tc qdisc replace dev swp1 root handle 10: \
     ets bands 8 strict 4 quanta 4000 3000 2000 1000 \
     priomap 7 6 5 4 3 2 1 0

As indicated in the table above, band 0 has the class ID X:1, band 1 X:2, and so on. In order to attach a child qdisc to a band, use that ID as parent reference when creating a new qdisc. E.g. to attach RED to the first band and TBF to the second one:

# tc qdisc replace dev swp1 parent 10:1 handle 101: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M
# tc qdisc replace dev swp1 parent 10:2 handle 102: \
     tbf rate 400Mbit burst 128K limit 1M

PRIO

The PRIO qdisc is in most ways the same as ETS qdisc configured with only strict bands. The only differences are:

  • PRIO cannot be configured with fewer than 3 bands.
  • PRIO does not permit DWRR bands, all bands are always strict. Correspondingly, there is no strict keyword when creating the qdisc, just bands.
  • PRIO has different priomap defaults.

One can learn about how PRIO works by reading the ETS section above and focusing only on the parts that deal with strict priority.

Example

Create a PRIO qdisc that configures 8 (strict) bands, and maps traffic to bands in reversed 1:1 fashion.

# tc qdisc replace dev swp1 root handle 1: \
     prio bands 8 priomap 7 6 5 4 3 2 1 0

Leaf Qdiscs

Where ETS and PRIO describe assignment of traffic to TCs and relation of individual TCs to each other, qdiscs attached to individual bands configure the TC itself.

  • A RED child will enable RED or ECN for the UC TC associated with the band.
  • A TBF child will configure shaper for the pair of BUM and UC TCs associated with the band.
  • A FIFO child does not enable any functionality, but will be offloaded for the purpose of providing per-band statistics.

Additionally, each PRIO and ETS class, unless overridden, contains a FIFO qdisc child. Those are not shown by default in the "tc qdisc show" output, but can be shown by passing an invisible flag. Alternatively it is possible to replace them explicitly by qdiscs with non-null handles.

Statistics

The show command will show the current configuration including statistics. For offloaded qdiscs the values include the traffic in the HW datapath.

Note: On kernels older than 5.7, FIFO is not offloaded, and therefore counters on the bands that do not contain another qdisc do not include HW datapath numbers.

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

The reported counters are:

  • bytes, pkt - The number of bytes resp. packets that were sent through the qdisc.

    The Spectrum ASICs do not track the number of sent bytes and packets on per-TC basis, but rather on per-packet-priority basis. For child qdiscs, mlxsw deduces the per-TC counters from the priorities mapped to a given band by PRIO or ETS priomap. If the priomap is changed in a way that impacts bands that contain offloaded child qdiscs, these child qdiscs will lose HW stats accumulated prior to the change.

  • dropped - The number of packets dropped on either the unicast TC, or the BUM TC corresponding to the band that the qdisc is in.

  • overlimits - The meaning depends on the qdisc.

  • requeues - The meaning depends on the qdisc, and mlxsw does not currently set this counter.

  • backlog - The number of bytes and packets waiting in queue. In offloaded qdiscs, the number of bytes includes the HW datapath queue depth. However the number of packets always includes only the SW datapath, because the corresponding counters are not available on Spectrum ASICs.

Combining Leaf Qdiscs

RED and TBF qdiscs are classful, with each instance having one child. By default, that child is an invisible FIFO qdisc, but it is possible to replace the child form another qdisc. This way, it is possible to configure a complex of several qdiscs, where all the qdiscs in the complex affect and are affected by the traffic on the same TC.

Beginning with Linux 5.16, the following configurations are offloadable, so long as each qdisc in the complex is offloadable in isolation:

  • FIFO
  • RED-FIFO
  • TBF-FIFO
  • RED-TBF-FIFO
  • TBF-RED-FIFO

These qdisc complexes can then appear in one of three context:

  • On their own under root. Thus e.g. RED would be a root qdisc, TBF its child, and FIFO the ultimate leaf. In that case, the complex configures traffic on TC 0 (and BUM TC 8), and qdiscs therein show statistics for TC 0 as well.

  • Attached under a port shaper TBF, which is the same case as above.

  • Attached to an ETS or PRIO band. In that case, the complex configures traffic on the TCs corresponding to the band where it is attached. Individual qdiscs then show statistics for that TC as well.

For example, consider the following configuration:

# tc qdisc replace dev swp1 root handle 1: ets strict 3 priomap 2 2 2 1 1 1 0 0
# tc qdisc replace dev swp1 parent 1:1 handle 11: red limit 1M avpkt 1K
# tc qdisc replace dev swp1 parent 11:1 handle 111: tbf limit 1M burst 128K rate 400Mbit
# tc qdisc replace dev swp1 parent 111:1 handle 1111: bfifo limit 1M
# tc qdisc show dev swp1
qdisc ets 1: root refcnt 2 offloaded bands 3 strict 3 priomap 2 2 2 1 1 1 0 0 2 2 2 2 2 2 2 2
qdisc red 11: parent 1:1 offloaded limit 1Mb min 87381b max 256Kb
qdisc tbf 111: parent 11:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
qdisc bfifo 1111: parent 111:1 offloaded limit 1Mb

Which can be illustrated as follows:

+ swp1
  + ETS
    + band 0
      + RED
        + TBF
          + FIFO
    + band 1
    + band 2

In this case, both RED and TBF are configured on band 0, and correspondingly RED is configured on TC 7, and a shaper on TCs 7 and 15. RED, TBF and FIFO also all show statistics from TCs 7 and 15.

Note: Up until and including Linux 5.15, only the root qdisc and its children (if any) were offloaded.

pFIFO, bFIFO

pFIFO or bFIFO qdiscs are by default attached to newly-created qdisc classes and classful qdiscs. These default qdiscs have a handle of 0:0, and will only be shown by tc qdisc show when an invisible flag is passed. A pFIFO or bFIFO qdisc can also be created explicitly with non-null handle, in which case it is normally shown.

Note: Linux also supports related but distinct qdiscs pFIFO-head_drop and pFIFO-fast. These are not offloaded.

FIFO Parameters

No FIFO parameters are offloaded and there are no mandatory SW-datapath parameters.

In the software datapath, the parameter limit is used for configuration of the queue size. In the hardware datapath, queue limits are configured through shared buffer.

UC and BUM Traffic Classes

As described above, each ETS (and PRIO) band represents a pair of traffic classes.

When a FIFO qdisc is offloaded, its stats reflect HW datapath traffic flowing through the corresponding traffic class. Both UC and BUM traffic is counted.

Statistics

The show command will show the current configuration including FIFO's statistics. Pass the invisible flag in order to see qdiscs with a null handle.

# tc -s q sh dev swp7 invisible
qdisc ets 10: root refcnt 2 offloaded bands 3 strict 3 priomap 0 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 Sent 62736398886 bytes 7793372 pkt (dropped 29552253, overlimits 0 requeues 0)
 backlog 814464b 0p requeues 0
qdisc pfifo 0: parent 10:3 offloaded limit 1000p
 Sent 636 bytes 6 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc pfifo 0: parent 10:2 offloaded limit 1000p
 Sent 10215514 bytes 1270 pkt (dropped 22780159, overlimits 0 requeues 0)
 backlog 411264b 0p requeues 0
qdisc pfifo 0: parent 10:1 offloaded limit 1000p
 Sent 62726182736 bytes 7792096 pkt (dropped 6772094, overlimits 0 requeues 0)
 backlog 403200b 0p requeues 0

FIFO reports all the usual counters.

Example

In the following example, FIFO at ETS band 1 is replaced with one that has a non-zero handle, so that it is visible with normal tc dumps:

# tc qdisc add dev swp1 parent 10:1 handle 101: pfifo

RED

RED is a queuing discipline designed for congestion avoidance. It can run in one of three modes:

  • In RED mode, the qdisc drops packets according to a simple linear probability function described below.

  • In ECN mode, it uses the same probability function, but marks ECN-capable packets with ECN-CE (congestion encountered) tags instead of dropping them. Non-ECN-capable packets are still early-dropped. Unlike in RED mode, in ECN mode the queue can be filled completely, and excess packets are then tail-dropped.

  • ECN nodrop mode is like pure ECN mode, but does not drop non-ECN-capable packets. ECN nodrop therefore never early-drops, but can still tail-drop packets if the queue grows too large.

The probability to drop or mark a packet is zero until the queue's average size reaches the minimum limit. From there, the probability will rise linearly until it reaches the maximum probability at a point where the queue's average size reaches the maximum limit. When the queue's average size is above the maximum, the probability to drop a packet is 1 (See figure below).

figure 1

The drops due to the RED algorithm are called early drops. They differ from tail drops, which are caused by shared buffer quota exhaustion.

RED Parameters

The following parameters are offloaded:

  • min - The minimum limit.
  • max - The maximum limit.
  • probability - The probability to drop a packet when the average queue size is at maximum limit. 1.0 means 100%.
  • ecn - If set, puts the qdisc into ECN mode.
  • nodrop - If set together with ecn, puts the qdisc into "ECN nodrop" mode.

The following parameters are not offloaded, but are mandatory for the software datapath qdisc:

  • limit - Hard limit for the queue's size.
  • avpkt - Average queue size calculation parameter. 1000 is recommended.

UC and BUM Traffic Classes

As described above, each ETS (and PRIO) band represents a pair of traffic classes. Adding RED at a band configures RED only on the UC TC, not on BUM one. There is currently no way to configure RED on BUM traffic classes.

Statistics

The show command will show the current configuration including RED's statistics.

$ tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc red 101: parent 10:1 offloaded limit 2Mb min 500Kb max 1536Kb
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  marked 0 early 0 pdrop 0 other 0

Besides the usual counters, RED supports the following values:

  • dropped - On RED this counter includes the number of packets early-dropped on the UC TC. (No packets are early-dropped on BUM TCs.)

  • marked - Since Linux kernel 5.16, and only on Spectrum-3 and above, this statistic reports the number of packets that were ECN-marked on the TC where the RED qdisc is installed.

    Up until and including Linux kernel 5.6 this counter reports global number of all ECN-marked packets (despite being reported at a particular band). In 5.7 this global counter was moved to ethtool's ecn_marked counter.

  • overlimits - The number of packets that were early-dropped or ECN-marked (with the caveat mentioned at marked above).

  • early - The number of packets that were early-dropped.

  • pdrop - The number of packets that were tail-dropped. Note that the tail-dropped count does include the numbers from BUM TC.

  • other - This counter is not used for HW datapath.

Qevents

The following qevents can be offloaded for RED qdisc:

  • early_drop - Filters at the configured block are invoked on packets that are early-dropped by the RED algorithm. Packets trapped by this qevent are reported under the early_drop trap.
  • mark - Filters at the mark qevent are invoked on ECN-marked packets.

Example

The following example attaches a RED qdisc under band 0 of an ETS parent whose handle is 10:. Between the queue depths of 500KiB and 1.5MiB, the dropping probability will gradually rise from 0 to 10%.

# tc qdisc add dev swp1 parent 10:1 handle 101: \
     red limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M

The following example creates a RED qdisc with the same configuration, but puts it to "ecn nodrop" mode:

# tc qdisc add dev swp1 parent 10:1 handle 101: \
     red ecn nodrop limit 2M avpkt 1000 probability 0.1 min 500K max 1.5M

TBF

The TBF queuing discipline implements a shaper based on Token Bucket algorithm.

TBF Parameters

The following parameters are offloaded:

  • rate - The speed with which the queued traffic will be sent. The guaranteed granularity is 200Mbps.

  • burst - The number of bytes of traffic that is dequeued before the shaper rate takes effect. The value needs to be a power of 2. The range of valid values depending on system type is summarized below.

    Switch ASIC Valid range
    Spectrum-1 2K .. 2G
    Spectrum-2 128K .. 2G
    Spectrum-3 128K .. 2G

The following parameter is not offloaded, but is mandatory for the software datapath:

  • limit - Hard limit for the queue's size.

UC and BUM Traffic Classes

As described above, each ETS (and PRIO) band represents a pair of traffic classes. Configuring TBF at a band sets up a shaper that applies to both UC and BUM traffic together.

Note that besides the shaper configured through TBF, mlxsw also automatically adds a minimum shaper of 200Mbps at a BUM TC. Thus any BUM traffic is guaranteed to get at least 200Mbps. Only on top of that does the TBF shaper apply to the combination of both traffic types.

Port Shaper

Beginning with Linux 5.16, when TBF is installed as a root qdisc, it acts as a port shaper, and thus configures the traffic flowing through the whole port, instead of affecting individual TCs.

When in a root position, TBF does not limit installation of other qdiscs. It is therefore possible to configure ETS as a child of a port-shaper TBF:

+ swp1
  + TBF (port shaper)
    + ETS
      + band 0
        + RED
          + TBF (TC shaper)
            + FIFO
      + band 1
      + band 2

Statistics

The show command will show the current configuration including TBF's statistics.

# tc -s qdisc show dev swp1
qdisc ets 10: root refcnt 2 offloaded [...]
qdisc tbf 101: parent 10:1 offloaded rate 400Mbit burst 131050b lat 18.4ms
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

TBF reports all the usual counters.

Example

The following example attaches a TBF qdisc under band 0 of an ETS parent whose handle is 10:. It configures a 400Mbps shaper with a burst size of 128KiB.

# tc qdisc add dev swp1 parent 10:1 handle 101: \
     tbf rate 400Mbit burst 128K limit 1M

Further Resources

  1. man tc
  2. man tc-ets
  3. man tc-prio
  4. man tc-pfifo, man tc-bfifo
  5. man tc-red
  6. man tc-tbf
  7. Traffic Control HOWTO
Clone this wiki locally