Skip to content
Ido Schimmel edited this page Jul 13, 2023 · 57 revisions
Table of Contents
  1. TC Filtering
    1. Features by Version
    2. Filters
    3. Filter Management
    4. SW and HW Datapath
    5. Statistics
    6. Suppressing Per-Filter Statistics
    7. Filter Preference
    8. Matching on Protocol
    9. Specifying Multiple Actions
  2. Flower Classifier
    1. Masked Matches
    2. Matching on Ingress Device
    3. Matching on Layer 2 Miss
    4. Matching on VLAN ID
    5. Matching on L3 Protocol
    6. Matching on L4 Protocol
  3. Flower Actions
    1. Dropping Packets
    2. Trapping Packets to CPU
    3. Accepting Packets
    4. Forwarding Packets
    5. Mirroring Packets
    6. Sampling Packets
    7. VLAN Modify
    8. Chain Goto
    9. Priority Assignment
    10. Editing Packet Headers
    11. Policing
      1. Policing Limitations
  4. Matchall Classifier
  5. Filter Chains
    1. Chain Management
    2. Chain Templates
  6. Shared Blocks
    1. Qevents
  7. ACLs Prior to 4.14
  8. Further Resources

TC Filtering

The Linux TC subsystem takes care of policing, classifying, scheduling and shaping of forwarded traffic. The fundamental element of the TC architecture are qdiscs, which are in some detail discussed on Queues Management page. Closely related are then filters.

Features by Version

Kernel Version
4.11 Matching on protocol (ethtype)
Flower keys src_mac and dsc_mac, src_ip and dst_ip (both IPv4 and IPv6), ip_proto ("tcp" and "udp"), src_port and dst_port
Actions drop and mirred egress redirect
4.12 Flower keys vlan_prio, vlan_id. Action vlan modify
4.13 Flower key tcp_flags. Action trap
4.14 Flower keys ip_ttl, ip_tos. Action goto chain
4.15 Action pass
4.16 Action mirred egress mirror
5.3 Flower key indev
5.7 Action skbedit priority, pedit TOS / traffic_class
5.8 Action pedit tcp / udp sport / dport
5.9 Action police
5.13 Action sample
5.18 Action pedit ip / ip6 src / dst
6.5 Flower key l2_miss
6.6 Flower port range matching

Filters

Each TC filter has two main parts: a classifier and an action. The classifier describes a class of packets, depending on type of filter and its individual configuration. The action is what happens when a packet falls into the class described by the classifier, again depending on individual configuration.

When attached to a general classful qdisc, one possible action is to select a certain qdisc class to enqueue the packet to (the class_id action). However mlxsw does not offload this action currently.

Besides the class_id action, there is a broad range of programmed and control actions, some of which mlxsw may be able to offload. A qdisc specifically meant for attaching and evaluating filters is clsact.

Note: The clsact qdisc was not available until kernel 4.14. See below for how to configure ACLs on older kernels.

When added, the clsact qdisc allows attaching filters to egress and ingress of a netdevice. The ingress filters are run just after the packet ingresses the host. The egress filters run just before the packet is handed to the root qdisc of the egress device.

mlxsw will offload filters if:

  • The netdevice corresponds to a front panel port.

    That does NOT include uppers of a front-panel port netdevice, such as bridges, VLAN soft devices and others, only the front-panel port netdevices themselves.

  • The qdisc that the filter is added to is a clsact qdisc.

Filter Management

The following example first adds the clsact qdisc and then attaches at the ingress of a netdevice a filter that drops all packets:

# tc qdisc add dev swp6 clsact
# tc filter add dev swp6 ingress flower action drop

The flower keyword introduces the classifier. This example uses the flower classifier, which allows matching on packet headers using symbolic names. In this example the classifier did not get any arguments and will match on all packets.

The action keyword specifies the action that should take place on matched packets. drop means that the packet should be removed from the forwarding path.

To see the list of inserted filters, run one of these two commands, depending on which direction you are interested in:

# tc filter show dev swp6 ingress
# tc filter show dev swp6 egress

E.g.:

# tc filter show dev swp6 ingress
filter protocol all pref 49152 flower chain 0
filter protocol all pref 49152 flower chain 0 handle 0x1
  in_hw in_hw_count 2
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 10 sec used 0 sec
        used_hw_stats immediate

The example output shows a number of attributes of the filter that are assigned implicitly. The following sections will go through the interesting ones and discuss them.

A filter can be deleted using a delete command:

# tc filter del dev swp6 ingress pref 49152

SW and HW Datapath

When a filter is offloaded, an in_hw flag is shown in the dump (like in the example above). Offloaded filters have effect on packets that are both in the HW datapath as well as the SW datapath. If it is desirable that the filter exists only in the HW, or only in the SW datapath, the classifier should be passed either a skip_sw or skip_hw flag.

E.g. to insert a HW datapath-only filter:

# tc filter add dev swp6 ingress flower skip_sw action drop

Adding a SW datapath-only filter may make sense for classifiers or actions that are not supported by the device. See trapping for the details about how to get packets to the SW datapath.

Statistics

In order to observe statistics related to packets, bytes transmitted, or last time used, which are maintained on a per filter basis, add the -s flag to the filter show command:

# tc -s filter show dev swp6 ingress
filter protocol all pref 49152 flower chain 0
filter protocol all pref 49152 flower chain 0 handle 0x1
  in_hw in_hw_count 2
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 15 sec used 1 sec
        Action statistics:
        Sent 1456 bytes 18 pkt (dropped 18, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 1456 bytes 18 pkt
        backlog 0b 0p requeues 0
        used_hw_stats immediate

The individual statistics shown are:

  • installed -- How long ago was the filter installed.
  • used -- How long ago has the filter last matched.
  • sent software -- Number of bytes and packets matched in SW datapath.
  • sent hardware -- Likewise for the HW datapath.
  • used_hw_stats -- Shows what type of statistics are used for this action. By default this is immediate, in which case the statistics are always up to date. It can be disabled instead.

Technically, statistics are reported per-action, not per-filter. In offloaded filters, mlxsw by default allocates one counter for the whole filter. Therefore, if multiple actions are attached to the same filter, they will have identical packets and bytes statistics.

Suppressing Per-Filter Statistics

The number of counters is limited and potentially lower than the number of possible TC filters that can be programmed to the device. It is possible to disable the allocation of the hardware counters using hw_stats action command line option during filter addition.

# tc filter add dev swp6 ingress flower skip_sw \
     action drop hw_stats disabled

Disablement of per-flow counter only impacts the bytes and packets counters. When disabled, they always report zeroes. The installed and used times are still valid.

The default action when the hw_stats directive is not used, is to allocate an immediate counter. A way to request this behavior explicitly is to pass an immediate type:

# tc filter add dev swp6 ingress flower skip_sw \
     action drop hw_stats immediate

The current occupancy of counters in HW can be queried using "devlink-resource":

# devlink resource show $(devlink dev) | grep 'name flow'
      name flow size 24576 occ 12 unit entry dpipe_tables none
# tc filter add dev swp7 egress flower action trap
# devlink resource show $(devlink dev) | grep 'name flow'
      name flow size 24576 occ 14 unit entry dpipe_tables none

Filter Preference

Preference is the filter attribute that determines the order in which the filters are evaluated. Filters with lower preference are evaluated before filters with higher preference. If preference is not specified on the command line, the kernel assigns one, starting at 49152 and decreasing by one for each filter inserted without explicitly specified preference. To specify the preference, use the pref option:

# tc filter add dev swp6 ingress pref 123 prot ipv6 flower action drop

When several filters have the same preference, they are evaluated in the order of their addition.

To reduce the number of lookups, it is recommended to configure filters that share the same mask with the same preference. For example, if N flower filters that match on the desintation IP address are configured with N different preferneces, a packet can incur up to N lookups despite the fact that only a single filter can match. When all the filters are configured with the same prefernece, a packet will incur a single lookup.

Matching on Protocol

Unless otherwise specified, the added filters match on packets regardless of their EtherType. To match on packets with a specific EtherType, the filter needs to be added to the filter tree dedicated to that protocol, through a prot argument. E.g. to drop only IPv6 packets:

# tc filter add dev swp6 ingress prot ipv6 flower action drop

These protocol-specific filter trees exist independent of each other. Since flower does not have a way of matching on EtherType, there is no way to match on a packet from one protocol, and only if that match fails, proceed to a match on another protocol.

A protocol selector all can be used to explicitly select matching on packets regardless of their EtherType:

# tc filter add dev swp6 ingress prot all flower action drop

Note: In the SW datapath, the indicated protocol matches on the outermost EtherType. If the packet is VLAN tagged, the protocol value needs to be 802.1q, not ip, even if IP is what is inside the VLAN tag. Matching on the inner IP is then done through flower vlan_ethtype key. This is unlike the HW datapath, where both protocol ip and protocol 802.1q would match.

Specifying Multiple Actions

One filter can perform several actions on the matched packets. For some of the tc actions, such as tc-vlan, the default control action is pipe, which means that when no control action is specified, listing the actions one after another is all that needs to be done. For example, to change VLAN and redirect the packet, one would do:

# tc filter add dev swp6 ingress flower     \
     action vlan modify id 85               \
     action mirred egress redirect dev swp8

For other tc actions, such as tc-pedit, the default control action is pass, and therefore in order to connect a few actions together, pipe control action needs to be specified between every two actions. For example, to set both the source and destination IP of all packets sourced from swp6 and destined to 223.0.2.2, one would do:

# tc filter add dev swp6 egress prot ip flower dst_ip 223.0.2.2 skip_hw	\
     action pedit ex munge ip src set 1.1.1.1 pipe			\
     action pedit ex munge ip dst set 8.8.8.8

In order to avoid relying on the default behavior of various tc actions, it is recommended to always specify the pipe control action when the intention is to stitch actions together.

Flower Classifier

Flower is the major filter that mlxsw is capable of offloading, as long as the keys used for matching are supported. The list of supported keys is as follows:

  • indev -- Match on the port that the packet ingressed through.
  • l2_miss
  • src_mac, dst_mac -- Match on the MAC address.
  • vlan_ethtype, vlan_prio, vlan_id -- Match on 802.1Q header.
  • src_ip, dst_ip -- Match on source resp. destination IPv4 or IPv6 address.
  • ip_ttl -- Match on IPv4 TTL or IPv6 hop limit.
  • ip_tos -- Match on IPv4 TOS or IPv6 traffic class.
  • ip_proto -- Match on L4 protocol or IPv6 next header.
  • src_port, dst_port -- Match on TCP or UDP ports. Including range matching.
  • tcp_flags

Masked Matches

The flower classifier allows matching on just part of the selected field. For example, to match just the DSCP part of the TOS field:

# tc filter add dev swp6 ingress prot ip \
     flower ip_tos $((dscp << 2))/0xfc   \
     action drop

For the IP addresses, flower supports the usual address/length notation:

# tc filter add dev swp6 ingress prot ip \
     flower src_ip 192.0.2.16/28 \
     action drop

Matching on Ingress Device

The key indev matches packets that entered the switch through the indicated netdevice. A filter is not offloaded unless the netdevice corresponds to a front panel port.

# tc filter add dev swp7 egress flower indev swp6 action drop

Matching on Layer 2 Miss

The key l2_miss can be used to match on layer 2 miss in the bridge driver's FDB / MDB. When 1, match on packets that encountered a layer 2 miss. When 0, match on packets that were forwarded using an FDB / MDB entry. Note that broadcast packets do not encounter a miss since a lookup is not performed for them. The key can be used to implement non-DF (Designated Forwarder) filtering in EVPN multi-homing, as explained here.

# tc filter add dev swp7 egress flower l2_miss 1 action drop

Matching on VLAN ID

The flower key vlan_id matches on the VID in the 802.1q header:

# tc filter add dev swp1 ingress protocol 802.1q \
     flower vlan_id 95 skip_sw action drop

Note: Packets arriving without 802.1q TCI, or ones which are only priority-tagged, are assigned a bridge PVID by the hardware. Thus, a flower match on a VID equal to PVID will match untagged packets as well.

Matching on L3 Protocol

The keys src_ip and dst_ip are used for matching on source resp. destination address. Both IPv4 and IPv6 addresses are supported. The exact version depends on the matched EtherType, which can be done either by matching on protocol, or by using vlan_ethtype flower key.

For example:

# tc filter add dev swp1 ingress protocol 802.1q pref 10   \
     flower skip_sw vlan_ethtype ipv4 dst_ip 192.0.2.16/28 \
     action drop
# tc filter add dev swp1 ingress protocol ipv6 pref 10     \
     flower skip_sw dst_ip fe01::3                         \
     action drop

The same holds for other L3 headers. For example ip_ttl is not available unless the protocol is IP or IPv6, and ip_tos with IPv6 really matches on traffic class.

Note that matching partial IP addresses is possible using the usual mask/length notation. See above for more details.

Matching on L4 Protocol

The key ip_proto allows matching on the IPv4 L4 protocol and IPv6 next header. It also enables further matching on L4-specific keys. E.g. matching on the keys src_port and dst_port is not allowed unless there is also a match on ip_proto tcp or udp. For example:

# tc filter add dev swp1 ingress protocol ipv6 pref 10 \
     flower skip_sw ip_proto tcp dst_port 3333         \
     action drop

Note that matching on ip_proto itself is not possible until the packet is otherwise matched as IPv4 or IPv6, either through matching on protocol, or by using vlan_ethtype flower key.

It is possible to match on a range of source or destination ports by specifying the value of the src_port and dst_port keys as a range. For example:

# tc filter add dev swp1 ingress protocol ipv6 pref 10 \
     flower skip_sw ip_proto tcp dst_port 3333-4444    \
     action drop

Port range matching is implemented in the device using dedicated port range registers, which are limited in number. To overcome this limitation, the driver reuses a port range register across different filters if the filters match on the same range ({min, max}) and the same port type (source / destination).

The maximum number of port range registers as well as their current occupancy can be queried using "devlink-resource":

# devlink resource show $(devlink dev) | grep 'port_range'
  name port_range_registers size 16 occ 2 unit entry dpipe_tables none

Note: ip_proto match for IPv6 is not supported for following next header values: routing, fragment, destination, authentication, esp, mobility, hop_by_hop, host_identity_protocol, shim6. This is due to HW parser architecture.

Flower Actions

mlxsw can offload flower classifier with a number of actions.

Dropping Packets

The action drop causes matched packets to be removed from the pipeline.

Trapping Packets to CPU

The action trap removes matched packets from the HW pipeline and moves them to the CPU, where the normal Linux SW datapath takes over. Such packets can be observed through e.g. tcpdump or wireshark, unlike the normal HW-datapath packets.

Note that in the SW datapath, the trap action drops the packet. Thus the action likely does not make sense unless specified as skip_sw.

The trapped packets will appear to ingress through the netdevice that corresponds to the front panel port through which the packet entered the switch. I.e. even if the trap is on egress, the packet will appear on ingress again.

In the following example, UDP/IP packets with destination port of 1234 are trapped to slow path for further processing by SW-only U32 classifier:

# tc filter add dev swp1 ingress prot ip       \
     flower skip_sw ip_proto udp dst_port 1234 \
     action trap
# tc filter add dev swp1 ingress prot ip       \
     u32 skip_hw ...

Accepting Packets

The action pass accepts matched packets for further HW pipeline forwarding. Processing of more filters is thus avoided.

Forwarding Packets

The mirred egress redirect action serves to redirect a packet to the egress of a specified port. This action is not offloaded unless the following holds:

  • The filter is attached to the ingress of a netdevice.
  • The destination netdevice corresponds to a front panel port.

In the following example, packets that arrive to swp1 are forwarded to swp2:

# tc filter add dev swp1 ingress flower \
     action mirred egress redirect dev swp2

Mirroring Packets

The mirred egress mirror action causes packets to be copied to the egress of a specified port. The Port Mirroring page discusses the mirred offload in more detail.

Sampling Packets

The sample action samples packets according to a configured sampling rate (i.e., 1 out of N packets). Sampled packets are forwarded by the data path (software or hardware), but a copy can be sent to higher layers (e.g., user space) for inspection. The Packet Sampling page discusses sampling in more detail.

VLAN Modify

The action vlan modify allows changing of the VLAN ID:

# tc filter add dev swp1 parent ingress \
     flower action vlan modify id 85

Note: Packets which arrive without 802.1q TCI, or which are only priority-tagged, are assigned a bridge PVID by the hardware. Thus, a vlan modify to a non-PVID tag apparently pushes a VLAN tag on such packet, and likewise vlan modify to a PVID tag pops it. That is unlike the software pipeline, where vlan modify is only meaningful on packets which are already 802.1q-tagged.

Chain Goto

This action invokes further filters at a specified chain. See Filter Chains for further details.

Priority Assignment

Action skbedit priority is offloaded to assign priority to a packet. See ACL-Based Priority Assignment for more details.

Editing Packet Headers

The action pedit is offloaded to allow changing of some packet header fields. The following pseudocode example gives an idea of the syntax:

# tc filter add dev swp6 ingress prot <prot> flower skip_sw \
     action pedit ex munge <pedit-prot> <field> set <value> retain <mask>

Or, if the <mask> should cover the whole field:

# tc filter add dev swp6 ingress prot <prot> flower skip_sw \
     action pedit ex munge <pedit-prot> <field> set <value>

Note that for purposes of protocol matching (<prot> above), IPv6 is called ipv6, whereas for purposes of pedit (<pedit-prot> above) it is called ip6.

The following protocols and fields are offloaded:

  • IPv4 and IPv6 fields tos resp. traffic_class. The supported masks are 0xff (for the whole TOS / traffic class field), 0xfc (for just the DSCP subfield) and 0x03 (for the ECN subfield).

    The DSCP rewrite is covered on the Quality of Service page.

    As an example, to remove ECN marking of an IPv4 packet without touching the rest of the TOS field:

    # tc filter add dev swp6 ingress prot ip flower skip_sw \
         action pedit ex munge ip dsfield set 0 retain 0x3
    

    To change IPv6 DSCP:

    # tc filter add dev swp6 ingress prot ipv6 flower skip_sw \
         action pedit ex munge ip6 traffic_class set $((dscp << 2)) retain 0xfc
    
  • IPv4 and IPv6 src and dst fields on Spectrum-2 and above. Only full mask (e.g. 0xffffffff for IPv4 addresses) is supported.

    # tc filter add dev swp6 ingress prot ip flower skip_sw \
         action pedit ex munge ip src set 198.51.100.1
    
  • TCP and UDP fields sport resp. dport on Spectrum-2 and above. Only full mask (0xffff) is supported.

    # tc filter add dev swp6 ingress prot ip flower skip_sw \
         action pedit ex munge udp sport set 1
    

Policing

The action police is offloaded to allow policing of ingress or egress bandwidth. For example:

# tc filter add dev swp1 ingress prot ip pref 1 \
	flower skip_sw src_ip 192.0.2.1 \
	action police rate 1gbit burst 16k conform-exceed drop/ok

To query the number of packets that were dropped by the policer, run:

# tc -s filter show dev swp3 ingress prot ip pref 1
filter flower chain 0
filter flower chain 0 handle 0x1
  eth_type ipv4
  src_ip 192.0.2.1
  skip_sw
  in_hw in_hw_count 1
        action order 1:  police 0x1 rate 1Gbit burst 16250b mtu 2Kb action drop overhead 0b
        ref 1 bind 1 installed 54 sec used 0 sec
        Action statistics:
        Sent 6670013310 bytes 828985 pkt (dropped 365018, overlimits 0 requeues 0)
        Sent software 0 bytes 0 pkt
        Sent hardware 6670013310 bytes 828985 pkt
        backlog 0b 0p requeues 0
        used_hw_stats immediate

In the above example, 365018 packets were dropped by the policer.

Conforming packets can be piped to subsequent actions using pipe action. The exceed action must always be set to drop. For example, in order to mirror policed packets to a different port, run:

# tc filter add dev swp1 ingress prot ip \
	flower skip_sw src_ip 192.0.2.1 \
	action police rate 1gbit burst 16k conform-exceed drop/pipe \
	action mirred egress mirror dev swp2

Policers are a global resource and they can be shared by multiple filters. To do so, assign an index to a policer and then re-use it when installing more filters:

# tc filter add dev swp1 ingress prot ip \
	flower skip_sw src_ip 192.0.2.1 \
	action police rate 1gbit burst 16k conform-exceed drop/ok index 10

# tc filter add dev swp2 ingress prot ip \
	flower skip_sw src_ip 198.51.100.1 \
	action police index 10

The maximum number of supported policers and their current usage can be read via the single_rate_policers resource in devlink resource. Example:

# devlink resource show pci/0000:06:00.0
pci/0000:06:00.0:
...
  name global_policers size 2040 unit entry dpipe_tables none
    resources:
      name single_rate_policers size 1984 occ 0 unit entry dpipe_tables none

Policing Limitations

  1. Only the rate, burst and conform-exceed options are supported. The rest are ignored.

  2. While conforming packets can be piped to other actions, packets that exceed the policer's rate or burst size must be dropped.

  3. For optimal results, the configured burst size should be at least:

    min_burst = 0.4 * rate [bits]
    

    Where rate is the configured rate in kilobits per second. For example, if the configured rate is 5gbit, the minimum burst size should be:

    min_burst = 0.4 * 5000000 = 2000000 [bits] = 250000 [bytes] = 250 [kb]
    

Matchall Classifier

The matchall classifier simply matches all packets. It is offloaded only in a few specific cases:

  • No protocol matching is allowed.
  • Only sample and mirred egress mirror actions are supported.

The Port Mirroring and Packet Sampling pages discuss the mirred and sample offload, respectively.

Ingress matchall rules are executed in the device before ingress flower rules. Similarly, egress matchall rules are executed in the device after egress flower rules. This ordering is enforced by the driver:

# tc filter add dev swp1 ingress prot ip pref 2 \
	flower skip_sw src_ip 192.0.2.1 \
	action drop
# tc filter add dev swp1 ingress prot all pref 3 \
	matchall skip_sw \
	action sample rate 100 group 1 trunc 64
Error: Failed to add behind existing flower rules.
We have an error talking to the kernel
# tc filter add dev swp1 ingress prot all pref 1 \
	matchall skip_sw \
	action sample rate 100 group 1 trunc 64

Filter Chains

TC filters are put together into chains by order of priority (pref). Each chain can be looked at as a table of classifier-action rules.

To insert a filter into a specific chain, one has to use the chain parameter:

# tc filter add dev swp1 ingress chain 100 flower action drop

In this example, we added a filter into chain 100. If the chain parameter is omitted, the default chain 0 is assumed. Chain 0 is also the chain which is always processed first. If other chains should be processed, the action goto chain needs to be invoked.

# tc filter add dev swp1 ingress protocol ip \
     flower skip_sw dst_ip 192.168.101.1     \
     action goto chain 100

Chain Management

If a chain does not exist before a filter is added, it is implicitly created. Similarly, after the last filter is removed, implicitly created chains are destroyed. It is also possible to explicitly create and destroy chains:

# tc chain add dev swp1 ingress chain 11
# tc chain del dev swp1 ingress chain 11

If a chain contains filters when it is deleted, they are deleted as well. The delete command can be used for both implicitly and explicitly created chains.

To list existing chains, run:

# tc chain show dev swp1 ingress
chain parent ffff: chain 11

Note: There is a limit of 255 goto jumps that can be processed by the HW in a single packet processing. If more goto jumps are configured, the packet gets dropped.

Chain Templates

As a chain is created (whether the implicit chain 0 or any other), mlxsw needs to guess which keys the user will want to match on in the filters that will be on this chain. If the guess proves to be too narrow, insertion of certain filters might fail, depending on the order in which they are added. If the guess proves to be too broad, some TCAM space will be wasted, which impacts the number of filters that can be offloaded.

The user often knows in advance, what keys they will want to use on a given chain. For example, they may only need matching on a destination IP address.

Chain templates allow the user to specify the shape that filters on this chain are going to have. mlxsw can then leverage this knowledge to configure the HW optimally to support the requested matching keys.

The template is configured during explicit chain creation, like this:

# tc chain add dev swp1 ingress proto ip chain 11 \
     flower dst_ip 0.0.0.0/16

The template is then shown when listing chains:

# tc chain show dev swp1 ingress
chain parent ffff: flower chain 11
  eth_type ipv4
  dst_ip 0.0.0.0/16

Addition of filters that fit the template will be successful:

# tc filter add dev swp1 ingress proto ip chain 11 \
     flower dst_ip 10.0.0.0/8                      \
     action drop

Addition of filters that do not fit the template will fail:

# tc filter add dev swp1 ingress proto ip chain 11 \
     flower dst_ip 10.0.0.0/24                     \
     action drop
Error: cls_flower: Mask does not fit the template.
We have an error talking to the kernel, -1

Shared Blocks

By default, each qdisc has its own group of chains. This group of chains is called a block. Therefore two clsact qdiscs, each on a different device, will each have their own suite of filter chains, even if the filters themselves are otherwise exactly same. mlxsw currently does not attempt to deduplicate such cases automatically. So not only is such a setup harder to configure, it also wastes more TCAM resources, which may limit the scale of the solution.

Block sharing is a way to resolve the above issues. When creating a qdisc, it is possible to request a particular block that should be used for the ingress and egress chains:

# tc qdisc add dev swp1 ingress_block 22 egress_block 23 clsact
# tc qdisc add dev swp2 ingress_block 22 egress_block 23 clsact

These two commands added clsact qdiscs to two netdevices. The ingress_block and egress_block options indicate which shared block should be used in the respective direction. Since both qdiscs use the same numbers, the qdiscs end up using identical filter sets. The numbers are arbitrary and it is up to the user to keep track of which number corresponds to which block.

If you list the existing qdiscs, you see the block sharing info in the output:

# tc qdisc show dev swp1
qdisc clsact ffff: parent ffff:fff1 ingress_block 22 egress_block 23
# tc qdisc show dev swp2
qdisc clsact ffff: parent ffff:fff1 ingress_block 22 egress_block 23

To make it more visual, the situation looks like this:

       swp1 clsact qdisc            swp2 clsact qdisc
            (ing)(egr)                   (egr)(ing)
              |    |                       |    |
              |    +----->  block 23 <-----+    |
              |               + chain 0         |
              |                 + flower        |
              |                 + ...           |
              |                                 |
              +---------->  block 22  <---------+
                              + chain 0
                              + ...

There is no limitation to number of qdiscs that can share the same block.

Once the qdisc block is shared, it is no longer possible to manipulate the filters using the qdisc handle. One has to rather use the block index as a handle:

# tc filter add block 22 flower action drop

In order to implement device-specific filters in shared blocks, the indev flower key may be useful:

# tc filter add block 22 flower indev swp1 action drop

Qevents

Another feature that uses shared blocks is qevents. As above, a block attached to a qevent is implicitly created, and does not disappear until it is not referenced anymore, whether by a clsact qdisc or by a qevent.

Formally qevent blocks are simply shared blocks, and filters can be attached to them in the same way as to any other block. A single shared block can even be used by both a clsact instance and a qevent. However such configurations are unlikely to be really useful, because the set of filters permissible in both positions is very limited.

ACLs Prior to 4.14

On recent kernels, a single clsact qdisc holds both ingress and egress rules. On kernels prior to 4.14, one would instead use an ingress qdisc for ingress rules, and an arbitrary egress qdisc for egress rules. E.g.:

$ tc qdisc add dev swp1 handle ffff: ingress
$ tc qdisc add dev swp1 handle 1: root prio

And attaching the matchall classifier was done for ingress:

$ tc filter add dev swp1 parent ffff:		\
        matchall skip_sw                        \
        action mirred egress mirror dev swp2

And for egress:

$ tc filter add dev swp1 parent 1:		\
        matchall skip_sw                        \
        action mirred egress mirror dev swp2

Further Resources

  1. man tc
  2. man tc-flower
  3. man tc-actions
  4. QoS in Linux with TC and Filters by Phil Sutter (part of iproute documentation)
  5. Linux Traffic Control Classifier-Action Subsystem Architecture
  6. man tc-police
  7. man devlink-resource
  8. man tc-sample
Clone this wiki locally