Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[action] [PR:8363] fix for mocked T0 DToR TC failures due to config push delta #11213

Merged
merged 1 commit into from
Jan 8, 2024

Conversation

mssonicbld
Copy link
Collaborator

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->

Description of PR

Summary:
Fixes # (issue)
(A) Test fails because mux toggle json file execution fails as swss container is not running
(B) Test fails because trigger happens before the mux toggle config is pushed from orchagent for all the 36 ports and took effect from sairedis. ports are selected randomly hence the issue is intermittent(if the ports selected out of 36, for that run has the config taken effect at sairedis by the time trigger happens). In ~10 runs, it's observed that it takes anywhere between 18-21s to finish the config at sairedis for all 36 ports(from the time ansible cmd for json is executed). In case of T0 mocked DToR we can not check the mux status so we're relying on sleep to finish config.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

(A) Test fails because mux toggle json file execution fails.
json file execution fails because swss is not running.
swss is not running because it's allowed to restart only 3 times in a 20 min interval and hits that limit.
restart limit is hit because in this test for ASIC type "gb" we restart swss 4 times(twice for each v4 and v6)
reset-failed is called for swss before restart but it does not seem to be flushing the restart rate counter for swss.
Log excerpts for issue (A):

Apr 20 14:55:42.355098 t0-yy38 INFO systemd[1]: swss.service: Scheduled restart job, restart counter is at 3.
Apr 20 14:55:42.355384 t0-yy38 INFO systemd[1]: Stopped switch state service.
Apr 20 14:55:42.355605 t0-yy38 WARNING systemd[1]: swss.service: Start request repeated too quickly.
Apr 20 14:55:42.355796 t0-yy38 WARNING systemd[1]: swss.service: Failed with result 'start-limit-hit'.
Apr 20 14:55:42.355978 t0-yy38 ERR systemd[1]: Failed to start switch state service.
Apr 20 14:55:42.356166 t0-yy38 WARNING systemd[1]: Dependency failed for SNMP container.
Apr 20 14:55:42.356353 t0-yy38 NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Apr 20 14:55:42.356546 t0-yy38 WARNING systemd[1]: swss.service: Start request repeated too quickly.
**Apr 20 14:55:42.356724 t0-yy38 WARNING systemd[1]: swss.service: Failed with result 'start-limit-hit'.**
Apr 20 14:55:42.356900 t0-yy38 ERR systemd[1]: Failed to start switch state service.
Apr 20 14:55:42.357047 t0-yy38 WARNING systemd[1]: Dependency failed for SNMP container.
Apr 20 14:55:42.357168 t0-yy38 NOTICE systemd[1]: snmp.service: Job snmp.service/start failed with result 'dependency'.
Apr 20 14:56:34.321778 t0-yy38 INFO dockerd[565]: time="2023-04-20T14:56:34.321374481Z" level=error msg="Error setting up exec command in container swss: Container d11306c4041ad1a0bf7d15f81a6ad1066e3879745d726234fc12e162164a7b33 is not running"

(B) #3 is happening before #2 in NOK run
1)when ansible command was executed(syslog)

syslog.1:Jun 6 **14:07:24.093000** mth-t0-64 INFO python[596206]: ansible-command Invoked with _uses_shell=True _raw_params=docker exec swss sh -c "swssconfig /muxactive.json" warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None
**last config push:from orchagent**
syslog.1:Jun 6 **14:07:43.66850**1 mth-t0-64 NOTICE swss#orchagent: :- addOperation: Mux State set to active for port Ethernet96 

2)when it took effect from sairedis(sairedis.rec)

sairedis.rec.1:2023-06-06.14:07:44.241430|c|SAI_OBJECT_TYPE_NEIGHBOR_ENTRY:\{"ip":"192.168.0.26","rif":"oid:0x600000000099d","switch_id":"oid:0x21000000000000"}|SAI_NEIGHBOR_ENTRY_ATTR_DST_MAC_ADDRESS=40:A6:B7:43:75:27

sairedis.rec.1:2023-06-06.14:07:44.242265|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000ae9|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=192.168.0.26|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x600000000099d
**last one:**
sairedis.rec.1:2023-06-06.**14:07:44.278459**|c|SAI_OBJECT_TYPE_NEXT_HOP:oid:0x4000000000afb|SAI_NEXT_HOP_ATTR_TYPE=SAI_NEXT_HOP_TYPE_IP|SAI_NEXT_HOP_ATTR_IP=192.168.0.9|SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID=oid:0x600000000099d

3)when did trigger happen(test log)

06/06/2023 **14:07:45** testutils.verify_packet L2400 DEBUG | Checking for pkt on device 0, port 39

How did you do it?

(A) config.bcm generation in not required for Cisco gb platform so just skipped one restart to avoid hitting restart limit error.
(B) Introduced a delay of 30s between mux toggle on DUT and send packet from T1(PTF)

How did you verify/test it?

Verified that mux config json is executed successfully and packets are sent to DUT after config is finished and test case passes.

Any platform specific information?

While applying dtor mock config to the dut, we do not need 2 swss restarts in case of Cisco platforms as one of the restart is for generating config.bcm which is Bcm specific

Supported testbed topology if it's a new test case?

Documentation

…t#8363)

* fix for failures in orchagent_standby_tor_downstream script

* Update test_orchagent_standby_tor_downstream.py

* fix for mocked T0 DToR TC failures due to config push delta
@mssonicbld
Copy link
Collaborator Author

Original PR: #8363

@mssonicbld mssonicbld merged commit 354b592 into sonic-net:202305 Jan 8, 2024
12 checks passed
@mssonicbld mssonicbld deleted the cherry/202305/8363 branch February 4, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants