Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'exit' in route-map section is causing frr reload and restart to fail in scale setups #15706

Closed
2 tasks done
karthikeyav opened this issue Apr 8, 2024 · 5 comments · Fixed by #15770
Closed
2 tasks done

Comments

@karthikeyav
Copy link

karthikeyav commented Apr 8, 2024

Description

Problem:
Every route-map section contains an exit in the config. In a scale setup where there are many route-maps, both frr reload and frr restart takes minutes to process all of them and bgpd process is busy 100% in vtysh_read and it fails.

RCA:
Issue # 1:
Initially frr.conf begins with multiple route-map's without any exit clause. when frr-reload is invoked it appends exits to every route-map section (Code changes are from this link) . This leads to northbound not being batched and hinders the libyang performance.

Issue # 2:
On a show run command the routemap_cli.c code would add in an exit() to every route-map… cli section. Code changes are from this link

void route_map_instance_show_end(struct vty *vty, struct lyd_node *dnode)
{
      + vty_out(vty, "exit\n");
	vty_out(vty, "!\n");
}

Version

switch# show ver
FRRouting 8.4.3 (switch) on Linux(6.1.0-cl-1-amd64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

How to reproduce

Configure 500+ route-map from test_1 to test_500 like below

switch#

route-map test_1 permit 200
 match ip address prefix-list test_1
 set community 64912:1 64912:100 64912:800 64912:3000 additive
!
route-map test_2 permit 200
 match ip address prefix-list test_2
 set community 64912:1 64912:100 64912:800 64912:3001 additive
!
..
..
...
route-map test_500 permit 200
 match ip address prefix-list test_500
 set community 64912:1 64912:100 64912:800 64912:3500 additive
!
switch:~# date ; systemctl reload frr.service ; date
Mon 25 Mar 2024 11:36:48 AM UTC
Mon 25 Mar 2024 11:45:34 AM UTC

Expected behavior

When frr reload or restart is done on a config which has many route-map's it should converge soon without hogging the CPU.

Actual behavior

Both frr reload and restart takes minutes to converge and some time fails because bgpd process is hogging the CPU 100% of the time trying to process exit for each route-map in scale setup.

Additional context

No response

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@karthikeyav karthikeyav added the triage Needs further investigation label Apr 8, 2024
@ahmdzaki18
Copy link

#13412

Also happened on my machine, using 2000++ lines config with lot of route-map. Temporary solved (no permanent solution) with increase timeout for watchfrr and ignore bgpd for stucking 100%.

Changing with newer Ryzen cpu, vpp-dpdk (dedicated core for control plane) and now almost 4000 lines without crash or start-loop-reload again.

@idryzhov
Copy link
Contributor

idryzhov commented Apr 9, 2024

I think this should be fixed in 10.0 in this commit: 2574f03. 10.0 should be out in a couple of days. If it helps, we can probably backport the fix into previous versions.

@donaldsharp
Copy link
Member

10.0 fixes the startup issue but leaves the reload issue in play still.

@karthikeyav
Copy link
Author

@idryzhov I ported this commit: 2574f03 and I do not see any difference in performance with or without it.

With 600+ route-map, I did frr restart and I see bgpd taking 1 min 9 seconds to read the configuration. If I remove the exit's for route-map's, then frr restart takes only 1 sec.

switch# cat /var/log/frr/frr.log | grep Configuration
2024-04-09T22:08:05.186241+00:00 tor-1 zebra[15802]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:08
2024-04-09T22:09:06.246636+00:00 tor-1 bgpd[15830]: [VTVCM-Y2NW3] Configuration Read in Took: 00:01:09

Can you please test 2574f03 with 600+ route-map and try frr restart?

@ton31337 ton31337 added regression and removed triage Needs further investigation labels Apr 15, 2024
idryzhov added a commit to idryzhov/frr that referenced this issue Apr 16, 2024
If a command is not marked as `YANG`-converted, the current command
batching buffer is flushed before executing the command. We shouldn't
flush the buffer when executing an `exit` command. It should only be
flushed if the next command is not `YANG`-converted, which is checked by
the command itself, not the previous `exit`.

Fixes FRRouting#15706.

Signed-off-by: Igor Ryzhov <[email protected]>
@idryzhov
Copy link
Contributor

I pushed a proper fix - #15770. Please check if it works for you.

mergify bot pushed a commit that referenced this issue Apr 18, 2024
If a command is not marked as `YANG`-converted, the current command
batching buffer is flushed before executing the command. We shouldn't
flush the buffer when executing an `exit` command. It should only be
flushed if the next command is not `YANG`-converted, which is checked by
the command itself, not the previous `exit`.

Fixes #15706.

Signed-off-by: Igor Ryzhov <[email protected]>
(cherry picked from commit 57811a5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants