-
Notifications
You must be signed in to change notification settings - Fork 39
Static Routing
- Basic Settings
- Router Interfaces
- Nexthop Routes
- Neighbours
- ECMP Routes
- IPv6 Source-Specific Routing
- Abort Mechanism
- Further Resources
In order for routing to work on a Linux system, forwarding must be enabled. To check if forwarding is enabled, run:
$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
$ sysctl net.ipv6.conf.all.forwarding
net.ipv6.conf.all.forwarding = 0
In this case, IPv4/IPv6 forwarding is disabled. To enable it, run:
$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
$ sysctl -w net.ipv6.conf.all.forwarding=1
net.ipv6.conf.all.forwarding = 1
To enable it permanently across reboots, run:
$ echo "net.ipv4.ip_forward = 1" > /etc/sysctl.d/forward.conf
$ echo "net.ipv6.conf.all.forwarding = 1" >> /etc/sysctl.d/forward.conf
In a similar way, the other IP sysctls can be adjusted. For example,
the maximum number of IPv4 neighbour entries can be adjusted via
net.ipv4.neigh.default.gc_thresh3
.
Whenever an IP address is assigned to a port netdevice or one of its uppers
(e.g. bridge, team, VLAN), a router interface is automatically created in the
hardware. In the following example, two router interfaces are created, one for
sw1p1
and one for sw1p2
:
$ ip addr add 192.168.0.1/24 dev sw1p1
$ ip link set dev sw1p1 up
$ ip addr add 192.168.1.1/24 dev sw1p2
$ ip link set dev sw1p2 up
For each address and its broadcast and network addresses, traps are inserted into the hardware which cause the appropriate packets to be delivered to the kernel.
Note: MAC addresses of all the router interfaces must have the same 38 MSBs.
By default, the kernel flushes all the IPv6 addresses upon interface down:
$ ip -6 address show dev sw1p1
28: sw1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2001:db8::1/32 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e61d:2dff:fe45:a9f1/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev sw1p1 down
$ ip -6 address show dev sw1p1
To be consistent with IPv4 and keep static global addresses with no expiration time upon interface down, run:
$ sysctl -w net.ipv6.conf.sw1p1.keep_addr_on_down=1
$ ip -6 address show dev sw1p1
inet6 2001:db8::1/32 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e61d:2dff:fe45:a9f1/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev sw1p1 down
$ ip -6 address show dev sw1p1
28: sw1p1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
inet6 2001:db8::1/32 scope global tentative
valid_lft forever preferred_lft forever
Note that the global address 2001:db8::1/32
is still configured on the
interface, while the link-local address fe80::e61d:2dff:fe45:a9f1/64
was
flushed.
To make this option the default for all the netdevices on the system upon boot
set net.ipv6.conf.all.keep_addr_on_down=1
in the sysctl configuration
files.
As previously stated, it is possible to create router interfaces on top of bridge netdevs by assigning them an IP address. In the case of the VLAN-aware bridge, a router interface can be created for each of its upper VLAN devices.
To create a router interface for the bridge netdev itself, run:
$ ip link add name br0 type bridge vlan_filtering 1
...
$ ip addr add 192.168.0.1/24 dev br0
And for one of its upper VLAN devices, run:
$ ip link add link br0 name br0.10 type vlan id 10
$ bridge vlan add dev br0 vid 10 self
$ ip addr add 192.168.1.1/24 dev br0.10
Once the router interface is created, it is possible to add routes:
$ ip route add 192.168.2.0/24 via 192.168.0.2 dev sw1p1
$ ip route add 192.168.3.0/24 via 192.168.1.2 dev sw1p2
To list the routes, run:
$ ip route
192.168.0.0/24 dev sw1p1 proto kernel scope link src 192.168.0.1 offload
192.168.1.0/24 dev sw1p2 proto kernel scope link src 192.168.1.1 offload
192.168.2.0/24 via 192.168.0.2 dev sw1p1 offload
192.168.3.0/24 via 192.168.1.2 dev sw1p2 offload
The offload
flag indicates that the route is offloaded to hardware.
A neighbour entry is created for each nexthop. To list the neighbour entries, run:
$ ip neigh
192.168.0.2 dev sw1p1 INCOMPLETE
192.168.1.2 dev sw1p2 INCOMPLETE
After neighbour discovery takes place, the output changes:
$ ip neigh
192.168.0.2 dev sw1p1 52:54:00:aa:bb:01 REACHABLE
192.168.1.2 dev sw1p2 52:54:00:aa:bb:02 REACHABLE
To add routes with multiple nexthops, run:
$ ip route add 192.168.5.0/24 nexthop via 192.168.0.2 dev sw1p1 weight 1 nexthop via 192.168.1.1 dev sw1p1 weight 1
Unlike IPv4, the kernel allows one to add and remove individual nexthops without the need to delete the entire ECMP route and re-add it with a modified nexthop configuration.
To add an ECMP route, run:
$ ip -6 route add 2001:db81::/32 \
nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1 \
nexthop via fe80::e61d:2dff:fea5:f365 dev sw1p2
$ ip -6 route show 2001:db81::/32
2001:db81::/32 metric 1024
nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1 weight 1 offload
nexthop via fe80::e61d:2dff:fea5:f365 dev sw1p2 weight 1 offload
To delete the first nexthop, run:
$ ip -6 route del 2001:db81::/32 nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1
$ ip -6 route show 2001:db81::/32
2001:db81::/32 via fe80::e61d:2dff:fea5:f365 dev sw1p2 metric 1024 offload pref medium
Alternatively, another nexthop can be added to the route.
Note: Beginning with kernel 4.14 the offload
indication is reported
on a per-nexthop basis and a matching iproute2
version is required in
order to display it.
By default, when the carrier of a netdevice goes down, the routing
subsystem does not invalidate the nexthops using it and therefore
continues to try and forward packets through them. Such nexthops
are marked using the linkdown
flag. For example:
$ ip route show 192.168.100.0/24
192.168.100.0/24
nexthop via 192.168.0.1 dev sw1p17 weight 1 offload linkdown
nexthop via 192.168.1.1 dev sw1p18 weight 1 offload
It is possible to make the kernel exclude such nexthops from its ECMP groups by setting the following sysctl:
$ sysctl -w net.ipv4.conf.sw1p17.ignore_routes_with_linkdown=1
With this sysctl set, when the carrier of sw1p17
goes down the kernel
starts forwarding packets via the nexthop using sw1p18
as its nexthop
device. It also marks the nexthop as dead
for various listeners in
the user space:
$ ip route show 192.168.100.0/24
192.168.100.0/24
nexthop via 192.168.0.1 dev sw1p17 weight 1 dead linkdown
nexthop via 192.168.1.1 dev sw1p18 weight 1 offload
To make this option the default for all the netdevices on the system
upon boot set net.ipv4.conf.default.ignore_routes_with_linkdown=1
in
the sysctl configuration files.
Note: The above is not reflected to the device in kernel versions prior to 4.11.
Note: The mlxsw driver currently does not support this functionality in IPv6.
When forwarding packets, the device performs the multi-path hash in accordance with the kernel's policy.
The packet fields used for the multi-path hash are controlled by the
net.ipv4.fib_multipath_hash_policy
sysctl. By default, it is set to
0
, which means only the source and destination IP addresses are used.
If the sysctl is set to 1
, a 5-tuple is used: The source and
destination IP addresses, the source and destination ports, and the IP
protocol.
Note: When the sysctl is set to 0
, the kernel performs the
multi-path hash for ICMP error packets according to the inner IP
addresses. Currently, this is not supported by the device.
Note: Layer 4 fields are not considered for fragmented packets.
Unlike IPv4, the kernel always performs the multi-path hash according to the same set of fields: The source and destination IP addresses, the flow label, and the next header field.
The kernel supports IPv6 source-specific routing, which allows packets to be forwarded according to the destination and source addresses. If a packet matches two routes matching the destination address without one being more specific than the other, then it is possible to use the route with the most specific source prefix to route the packet.
However, without resorting to ACLs, the ASIC performs routing solely based on the destination address. Therefore, upon the insertion of source-specific routes the abort mechanism is invoked and forwarding is performed by the kernel.
It is possible for the insertion of a prefix route into the hardware to fail.
The reason might be that maximal capacity is reached, missing feature, etc.
In that case, the abort
mechanism is initiated in kernel which leads to the
removal of all routes from the hardware resulting in all the packets being
processed in the kernel.
Note: Currently, this process irreversible. One has to reboot the system to re-enable routes offloading to hardware.
Note: There is an ongoing discussion regarding fixing the abort
mechanism and make the behaviour more sane for the end user.
Please refer to Routing intro in order to get essential information about routing setup in Linux.
- man ip
- man sysctl.d
Installation
System Maintenance
Network Interface Configuration
- Switch Port Configuration
- Netdevice Statistics
- Persistent Configuration
- Quality of Service
- Queues Management
- How To Configure Lossless RoCE
- Port Mirroring
- ACLs
- OVS
- Resource Management
- Precision Time Protocol (PTP)
Layer 2
Network Virtualization
Layer 3
- Static Routing
- Virtual Routing and Forwarding (VRF)
- Tunneling
- Multicast Routing
- Virtual Router Redundancy Protocol (VRRP)
Debugging