'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

karthikeyav · 2024-04-08T17:40:48Z

Description

Problem:
Every route-map section contains an exit in the config. In a scale setup where there are many route-maps, both frr reload and frr restart takes minutes to process all of them and bgpd process is busy 100% in vtysh_read and it fails.

RCA:
Issue # 1:
Initially frr.conf begins with multiple route-map's without any exit clause. when frr-reload is invoked it appends exits to every route-map section (Code changes are from this link) . This leads to northbound not being batched and hinders the libyang performance.

Issue # 2:
On a show run command the routemap_cli.c code would add in an exit() to every route-map… cli section. Code changes are from this link

void route_map_instance_show_end(struct vty *vty, struct lyd_node *dnode)
{
      + vty_out(vty, "exit\n");
	vty_out(vty, "!\n");
}

Version

switch# show ver
FRRouting 8.4.3 (switch) on Linux(6.1.0-cl-1-amd64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

How to reproduce

Configure 500+ route-map from test_1 to test_500 like below

switch#

route-map test_1 permit 200
 match ip address prefix-list test_1
 set community 64912:1 64912:100 64912:800 64912:3000 additive
!
route-map test_2 permit 200
 match ip address prefix-list test_2
 set community 64912:1 64912:100 64912:800 64912:3001 additive
!
..
..
...
route-map test_500 permit 200
 match ip address prefix-list test_500
 set community 64912:1 64912:100 64912:800 64912:3500 additive
!

switch:~# date ; systemctl reload frr.service ; date
Mon 25 Mar 2024 11:36:48 AM UTC
Mon 25 Mar 2024 11:45:34 AM UTC

Expected behavior

When frr reload or restart is done on a config which has many route-map's it should converge soon without hogging the CPU.

Actual behavior

Both frr reload and restart takes minutes to converge and some time fails because bgpd process is hogging the CPU 100% of the time trying to process exit for each route-map in scale setup.

Additional context

No response

Checklist

I have searched the open issues for this bug.
I have not included sensitive information in this report.

The text was updated successfully, but these errors were encountered:

ahmdzaki18 · 2024-04-08T23:48:08Z

#13412

Also happened on my machine, using 2000++ lines config with lot of route-map. Temporary solved (no permanent solution) with increase timeout for watchfrr and ignore bgpd for stucking 100%.

Changing with newer Ryzen cpu, vpp-dpdk (dedicated core for control plane) and now almost 4000 lines without crash or start-loop-reload again.

idryzhov · 2024-04-09T16:49:21Z

I think this should be fixed in 10.0 in this commit: 2574f03. 10.0 should be out in a couple of days. If it helps, we can probably backport the fix into previous versions.

donaldsharp · 2024-04-09T19:20:19Z

10.0 fixes the startup issue but leaves the reload issue in play still.

karthikeyav · 2024-04-09T23:00:38Z

@idryzhov I ported this commit: 2574f03 and I do not see any difference in performance with or without it.

With 600+ route-map, I did frr restart and I see bgpd taking 1 min 9 seconds to read the configuration. If I remove the exit's for route-map's, then frr restart takes only 1 sec.

switch# cat /var/log/frr/frr.log | grep Configuration
2024-04-09T22:08:05.186241+00:00 tor-1 zebra[15802]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:08
2024-04-09T22:09:06.246636+00:00 tor-1 bgpd[15830]: [VTVCM-Y2NW3] Configuration Read in Took: 00:01:09

Can you please test 2574f03 with 600+ route-map and try frr restart?

If a command is not marked as `YANG`-converted, the current command batching buffer is flushed before executing the command. We shouldn't flush the buffer when executing an `exit` command. It should only be flushed if the next command is not `YANG`-converted, which is checked by the command itself, not the previous `exit`. Fixes FRRouting#15706. Signed-off-by: Igor Ryzhov <[email protected]>

idryzhov · 2024-04-16T19:19:03Z

I pushed a proper fix - #15770. Please check if it works for you.

If a command is not marked as `YANG`-converted, the current command batching buffer is flushed before executing the command. We shouldn't flush the buffer when executing an `exit` command. It should only be flushed if the next command is not `YANG`-converted, which is checked by the command itself, not the previous `exit`. Fixes #15706. Signed-off-by: Igor Ryzhov <[email protected]> (cherry picked from commit 57811a5)

karthikeyav added the triage Needs further investigation label Apr 8, 2024

ton31337 added regression and removed triage Needs further investigation labels Apr 15, 2024

idryzhov mentioned this issue Apr 16, 2024

lib, zebra: fix exit commands #15770

Merged

Jafaral closed this as completed in #15770 Apr 18, 2024

mergify bot mentioned this issue Apr 18, 2024

lib, zebra: fix exit commands (backport #15770) #15785

Merged

donaldsharp mentioned this issue Apr 19, 2024

BGP CPU issue during route-map / community-list configuration / yang issue #15790

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

karthikeyav commented Apr 8, 2024 •

edited

Loading

ahmdzaki18 commented Apr 8, 2024

idryzhov commented Apr 9, 2024

donaldsharp commented Apr 9, 2024

karthikeyav commented Apr 9, 2024

idryzhov commented Apr 16, 2024

'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

Comments

karthikeyav commented Apr 8, 2024 • edited Loading

Description

Version

How to reproduce

Expected behavior

Actual behavior

Additional context

Checklist

ahmdzaki18 commented Apr 8, 2024

idryzhov commented Apr 9, 2024

donaldsharp commented Apr 9, 2024

karthikeyav commented Apr 9, 2024

idryzhov commented Apr 16, 2024

karthikeyav commented Apr 8, 2024 •

edited

Loading