Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[boot performance] SAI discovery process running after switch creation in fast/warm boot causes delays #13768

Open
stepanblyschak opened this issue Feb 10, 2023 · 2 comments
Assignees
Labels
NVIDIA Triaged this issue has been triaged

Comments

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Feb 10, 2023

Description

Steps to reproduce the issue:

  1. Perform fast/warm-reboot

Observe that after create_switch() SAI discovery process runs and takes (in this case 1.02 sec):

Feb 10 11:38:20.926013 r-panther-13 NOTICE syncd#SDK: :- discover: discover took 0.203495 sec
Feb 10 11:38:20.926309 r-panther-13 NOTICE syncd#SDK: :- discover: discovered objects count: 1386
Feb 10 11:38:20.926489 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_PORT: 33
Feb 10 11:38:20.926597 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VIRTUAL_ROUTER: 1
Feb 10 11:38:20.926722 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_STP: 1
Feb 10 11:38:20.926823 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP: 1
Feb 10 11:38:20.926943 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_QUEUE: 512
Feb 10 11:38:20.927045 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_SCHEDULER_GROUP: 512
Feb 10 11:38:20.927165 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP: 256
Feb 10 11:38:20.927267 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_HASH: 2
Feb 10 11:38:20.927387 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_SWITCH: 1
Feb 10 11:38:20.927520 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VLAN: 1
Feb 10 11:38:20.927711 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_VLAN_MEMBER: 32
Feb 10 11:38:20.927813 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_BRIDGE: 1
Feb 10 11:38:20.928017 r-panther-13 NOTICE syncd#SDK: :- discover: SAI_OBJECT_TYPE_BRIDGE_PORT: 33
Feb 10 11:38:20.928882 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: objects in ASIC state table present: 0
Feb 10 11:38:20.929008 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: putting ALL discovered objects to redis
Feb 10 11:38:21.601662 r-panther-13 NOTICE syncd#SDK: :- helperSaveDiscoveredObjectsToRedis: save discovered objects to redis took 0.673484 sec
Feb 10 11:38:21.602082 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x1 to Asic View and COLDVIDS
Feb 10 11:38:21.602592 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x100000026 to Asic View and COLDVIDS
Feb 10 11:38:21.603029 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10 to Asic View and COLDVIDS
Feb 10 11:38:21.603480 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x3 to Asic View and COLDVIDS
Feb 10 11:38:21.603693 r-panther-13 WARNING syncd#SDK: [SAI_UTILS.WARNING] mlnx_sai_utils.c[1691]- check_attribs_metadata: Not implemented attribute SAI_SWITCH_ATTR_DEFAULT_OVERRIDE_VIRTUAL_ROUTER_ID (vendor data not found)
Feb 10 11:38:21.603769 r-panther-13 WARNING syncd#SDK: [SAI_UTILS.WARNING] mlnx_sai_utils.c[2060]- sai_get_attributes: Failed attribs check, key:Switch ID 1
Feb 10 11:38:21.603861 r-panther-13 WARNING syncd#SDK: :- helperGetSwitchAttrOid: failed to get SAI_SWITCH_ATTR_DEFAULT_OVERRIDE_VIRTUAL_ROUTER_ID: SAI_STATUS_ATTR_NOT_IMPLEMENTED_0
Feb 10 11:38:21.604488 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10010039 to Asic View and COLDVIDS
Feb 10 11:38:21.605191 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x11 to Asic View and COLDVIDS
Feb 10 11:38:21.608251 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x1c to Asic View and COLDVIDS
Feb 10 11:38:21.608999 r-panther-13 NOTICE syncd#SDK: :- redisSaveInternalOids: put switch internal discovered rid oid:0x10000001c to Asic View and COLDVIDS
Feb 10 11:38:21.738942 r-panther-13 NOTICE syncd#SDK: :- helperLoadColdVids: read 1386 COLD VIDS
Feb 10 11:38:21.739078 r-panther-13 NOTICE syncd#SDK: :- SaiSwitch: constructor took 1.018046 sec

Describe the results you received:

SAI discover process took 1.02 sec, but we have seen different results for different platforms/configurations (up to 4 sec).

Describe the results you expected:

From fast/warm reboot design standpoint performing a lot of GET operations in the middle of switch booting delays the replay of configuration. Syncd could blindly replay the configuration as fast as possible and then discover default objects afterwards.

Output of show version:

SONiC Software Version: SONiC.master.0-3c1c7e23b
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
Build commit: 3c1c7e23b
Build date: Fri Feb 10 10:44:37 UTC 2023
Built by: stepanb@r-build-sonic03

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 11:52:06 up 14 min,  1 user,  load average: 1.35, 1.14, 0.90
Date: Fri 10 Feb 2023 11:52:06

Docker images:
REPOSITORY                    TAG                  IMAGE ID       SIZE
docker-syncd-mlnx             latest               c74a06aaefef   775MB
docker-syncd-mlnx             master.0-3c1c7e23b   c74a06aaefef   775MB
docker-orchagent              latest               e5c15be2b372   385MB
docker-orchagent              master.0-3c1c7e23b   e5c15be2b372   385MB
docker-fpm-frr                latest               ca590f8456b1   403MB
docker-fpm-frr                master.0-3c1c7e23b   ca590f8456b1   403MB
docker-teamd                  latest               63ae7f6b11fe   374MB
docker-teamd                  master.0-3c1c7e23b   63ae7f6b11fe   374MB
docker-macsec                 latest               b8b8f1ec35f8   376MB
docker-platform-monitor       latest               8a99e1c0d338   778MB
docker-platform-monitor       master.0-3c1c7e23b   8a99e1c0d338   778MB
docker-eventd                 latest               aa19834be0ed   357MB
docker-eventd                 master.0-3c1c7e23b   aa19834be0ed   357MB
docker-dhcp-relay             latest               ffb03963b964   366MB
docker-sonic-p4rt             latest               e6a0d2d3030c   927MB
docker-sonic-p4rt             master.0-3c1c7e23b   e6a0d2d3030c   927MB
docker-snmp                   latest               224060c4595c   397MB
docker-snmp                   master.0-3c1c7e23b   224060c4595c   397MB
docker-sonic-telemetry        latest               0ff254ca62cf   655MB
docker-sonic-telemetry        master.0-3c1c7e23b   0ff254ca62cf   655MB
docker-lldp                   latest               513b87c9af84   399MB
docker-lldp                   master.0-3c1c7e23b   513b87c9af84   399MB
docker-database               latest               707b19896280   357MB
docker-database               master.0-3c1c7e23b   707b19896280   357MB
docker-mux                    latest               b22673d61bd1   405MB
docker-mux                    master.0-3c1c7e23b   b22673d61bd1   405MB
docker-router-advertiser      latest               b9dfac24aae3   357MB
docker-router-advertiser      master.0-3c1c7e23b   b9dfac24aae3   357MB
docker-nat                    latest               f2a2c73a6f56   351MB
docker-nat                    master.0-3c1c7e23b   f2a2c73a6f56   351MB
docker-sflow                  latest               9038485e9854   349MB
docker-sflow                  master.0-3c1c7e23b   9038485e9854   349MB
docker-sonic-mgmt-framework   latest               bec28867667e   477MB
docker-sonic-mgmt-framework   master.0-3c1c7e23b   bec28867667e   477MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_r-panther-13_20230210_114940.tar.gz

@stepanblyschak stepanblyschak changed the title [boot performance] SAI discovery process running after switch creation causes delays [boot performance] SAI discovery process running after switch creation in fast/warm boot causes delays Feb 10, 2023
@kcudnik
Copy link
Contributor

kcudnik commented Mar 9, 2023

discovery must be done right after switch creation to see what objects exists, cannot be delayed later

@arfeigin
Copy link
Contributor

Hi @kcudnik,

We are working now on optimizations for fast-reboot flow for switches with high number of ports. We saw that for 256 ports SAI discover for each port consumes more than 8 seconds where in this time orchagent is idle and waiting syncd to finish creating ports.
Is SAI discover on post port creation required in fast-reboot init flow?
In fast-reboot flow there is no comparison logic since current view is empty. (https://github.com/sonic-net/SONiC/blob/4ab89a9fdba3ced17f4e4d7f97892f93045905d1/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md#42-syncd-point-of-view---initapply-view-framework)
We tried skipping SAI discover that follows ports creation, in fast-reboot flow (run the community fast-reboot test multiple times) on Nvidia platforms and at least in that case we saw that this saved 6.5~ seconds of dataplane down time which is more than 20% of the allowed disruption length. As well system was stable and no issues observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NVIDIA Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

6 participants