Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] ARP table is not restored after fast-reboot #5580

Closed
bingwang-ms opened this issue Oct 10, 2020 · 4 comments
Closed

[fast-reboot] ARP table is not restored after fast-reboot #5580

bingwang-ms opened this issue Oct 10, 2020 · 4 comments

Comments

@bingwang-ms
Copy link
Contributor

bingwang-ms commented Oct 10, 2020

Description
I noticed that ARP table is not restored correctly after fast-reboot when debugging test_fast_reboot. I believe the swssconfig has loaded the /arp.json because following logs are found in syslog:

Oct  9 10:55:25.725825 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/fdb.json...
Oct  9 10:55:26.058036 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/arp.json...
Oct  9 10:55:26.223676 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/default_routes.json...
Oct  9 10:55:26.716437 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/etc/swss/config.d/00-copp.config.json...
Oct  9 10:55:27.726715 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/etc/swss/config.d/ipinip.json...
Oct  9 10:55:28.572902 str-7260cx3-acs-2 INFO swss#supervisord 2020-10-09 10:55:25,504 INFO spawned: 'swssconfig' with pid 67
Oct  9 10:55:28.572957 str-7260cx3-acs-2 INFO swss#supervisord 2020-10-09 10:55:25,504 INFO success: swssconfig entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Oct  9 10:55:28.739325 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/etc/swss/config.d/ports.json...
Oct  9 10:55:29.868372 str-7260cx3-acs-2 NOTICE swss#swssconfig: :- main: Loading config from JSON file:/etc/swss/config.d/switch.json...
Oct  9 10:55:38.582075 str-7260cx3-acs-2 INFO swss#supervisord 2020-10-09 10:55:30,871 INFO exited: swssconfig (exit status 0; expected)

However, there is no restored ARP entry in db and system.

admin@str-7260cx3-acs-2:~$ redis-cli -n 0 keys "NEIGH_TABLE*"
 1) "NEIGH_TABLE:PortChannel0004:fc00::2e"
 2) "NEIGH_TABLE:eth0:10.64.247.30"
 3) "NEIGH_TABLE:PortChannel0002:fc00::26"
 4) "NEIGH_TABLE:eth0:10.64.247.234"
 5) "NEIGH_TABLE:eth0:10.64.247.12"
 6) "NEIGH_TABLE:PortChannel0001:fc00::22"
 7) "NEIGH_TABLE:PortChannel0002:10.0.0.35"
 8) "NEIGH_TABLE:PortChannel0001:10.0.0.33"
 9) "NEIGH_TABLE:PortChannel0003:10.0.0.37"
10) "NEIGH_TABLE:PortChannel0003:fc00::2a"
11) "NEIGH_TABLE:PortChannel0004:10.0.0.39"
12) "NEIGH_TABLE:eth0:10.64.247.11"
13) "NEIGH_TABLE:eth0:10.64.246.1"
admin@str-7260cx3-acs-2:~$ show arp
Address        MacAddress         Iface            Vlan
-------------  -----------------  ---------------  ------
10.0.0.33      52:54:00:a5:78:8b  PortChannel0001  -
10.0.0.35      52:54:00:d1:f3:cf  PortChannel0002  -
10.0.0.37      52:54:00:e1:7e:6c  PortChannel0003  -
10.0.0.39      52:54:00:ca:7f:6b  PortChannel0004  -
10.64.246.1    00:e0:ec:83:b8:0f  eth0             -
10.64.247.11   28:99:3a:6b:71:f0  eth0             -
10.64.247.12   28:99:3a:16:ee:24  eth0             -
10.64.247.30   80:3f:5d:08:0d:8e  eth0             -
10.64.247.234  28:99:3a:a2:64:28  eth0             -
10.64.247.239  28:99:3a:17:19:f9  eth0             -
Total number of entries 10 

The issue is possible cause by /usr/bin/restore_neighbors.py because following logs are found in syslog:

Oct  9 10:55:33.480411 str-7260cx3-acs-2 INFO swss#restore_neighbor: restore_neighbors service is started
Oct  9 10:55:33.481690 str-7260cx3-acs-2 INFO swss#restore_neighbor: restore_neighbors service is skipped as warm restart not enabled

Steps to reproduce the issue:

  1. Run test case test_fast_reboot, and the case will fast-reboot DUT after FDB and ARP are built completely.
    2.Check the ARP table on DUT after fast-reboot with show arp, and the ARP table is not restored correctly. As a result, the test case failed because a relatively long period (about 100 seconds) is needed to populate ARP table.

Describe the results you received:
The test case test_fast_reboot doesn't pass because ARP table is not restored after fast-reboot.

Describe the results you expected:
The test case test_fast_reboot passed, and ARP table is completely restored after fast-reboot.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
 SONiC Software Version: SONiC.20191130.50
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: a3a83a5ff
Build date: Wed Sep 30 10:53:09 UTC 2020
Built by: sonicbld@jenkins-slave-phx-2

Platform: x86_64-arista_7260cx3_64
HwSKU: Arista-7260CX3-D108C8
ASIC: broadcom
Serial Number: SSJ17432414
Uptime: 01:45:32 up 30 min,  1 user,  load average: 2.25, 2.77, 2.61

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-snmp-sv2            20191130.50         84501f20a868        348MB
docker-snmp-sv2            latest              84501f20a868        348MB
docker-fpm-frr             20191130.50         3d4f0939d315        335MB
docker-fpm-frr             latest              3d4f0939d315        335MB
docker-acms                20191130.50         ec9d71a4c1d6        182MB
docker-acms                latest              ec9d71a4c1d6        182MB
docker-lldp-sv2            20191130.50         aa9ab71441fe        312MB
docker-lldp-sv2            latest              aa9ab71441fe        312MB
docker-orchagent           20191130.50         39883f44a24a        333MB
docker-orchagent           latest              39883f44a24a        333MB
docker-teamd               20191130.50         e3d666566daa        314MB
docker-teamd               latest              e3d666566daa        314MB
docker-syncd-brcm          20191130.50         5f62566533cd        436MB
docker-syncd-brcm          latest              5f62566533cd        436MB
docker-platform-monitor    20191130.50         f6d63e4e5335        357MB
docker-platform-monitor    latest              f6d63e4e5335        357MB
docker-sonic-telemetry     20191130.50         a94d370a4037        353MB
docker-sonic-telemetry     latest              a94d370a4037        353MB
docker-database            20191130.50         6508e2881177        289MB
docker-database            latest              6508e2881177        289MB
docker-router-advertiser   20191130.50         26f8342ff859        289MB
docker-router-advertiser   latest              26f8342ff859        289MB
docker-dhcp-relay          20191130.50         94448cdcc144        299MB
docker-dhcp-relay          latest              94448cdcc144        299MB
k8s.gcr.io/pause           3.2                 80d28bedfe5d        683kB

Attach debug file sudo generate_dump:
syslog.zip

@anshuv-mfst
Copy link

Fix merged, issue can be closed. @bingwang-ms

@bingwang-ms
Copy link
Contributor Author

Fix merged, issue can be closed. @bingwang-ms

I think this issue is not fixed by sonic-net/sonic-utilities#1164. Because there is no unicode issue in fast-reboot-dump.py in 201911 branch, and the arp.json has been backed up successfully. I guess the problem is caused by some incorrect logic in restoring ARP. So reopen it.

@bingwang-ms bingwang-ms reopened this Oct 15, 2020
@bingwang-ms
Copy link
Contributor Author

This issue is also observed on master branch.

qiluo-msft pushed a commit to sonic-net/sonic-swss that referenced this issue Nov 18, 2020
…s enable. (#1498)

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

**What I did**
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by ```swssconfig``` from ```arp.json``` after ```fast-reboot``` is cleared by ```neighsyncd``` mistakenly at the initial stage. This PR adds a check for ```WarmStart``` before cleaning up, and only do that if ```WarmStart``` is enable.

**Why I did it**
This PR is to fix the issue that arp table is not recovered after fast-reboot.

**How I verified it**
Verified on Arista-7260, running 201911 image.
1. Run some test to populate ARP entries on DUT, such as ```test_fast_reboot```
2. Issue a fast-reboot
3. Verify the ```arp.json``` backed up by ```fast-reboot-dump.py``` is loaded and NEIGH_TABLE is restored.
@bingwang-ms
Copy link
Contributor Author

Fixed in sonic-net/sonic-swss#1498

abdosi pushed a commit to sonic-net/sonic-swss that referenced this issue Dec 4, 2020
…s enable. (#1498)

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

**What I did**
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by ```swssconfig``` from ```arp.json``` after ```fast-reboot``` is cleared by ```neighsyncd``` mistakenly at the initial stage. This PR adds a check for ```WarmStart``` before cleaning up, and only do that if ```WarmStart``` is enable.

**Why I did it**
This PR is to fix the issue that arp table is not recovered after fast-reboot.

**How I verified it**
Verified on Arista-7260, running 201911 image.
1. Run some test to populate ARP entries on DUT, such as ```test_fast_reboot```
2. Issue a fast-reboot
3. Verify the ```arp.json``` backed up by ```fast-reboot-dump.py``` is loaded and NEIGH_TABLE is restored.
daall pushed a commit to daall/sonic-swss that referenced this issue Dec 7, 2020
…s enable. (sonic-net#1498)

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

**What I did**
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by ```swssconfig``` from ```arp.json``` after ```fast-reboot``` is cleared by ```neighsyncd``` mistakenly at the initial stage. This PR adds a check for ```WarmStart``` before cleaning up, and only do that if ```WarmStart``` is enable.

**Why I did it**
This PR is to fix the issue that arp table is not recovered after fast-reboot.

**How I verified it**
Verified on Arista-7260, running 201911 image.
1. Run some test to populate ARP entries on DUT, such as ```test_fast_reboot```
2. Issue a fast-reboot
3. Verify the ```arp.json``` backed up by ```fast-reboot-dump.py``` is loaded and NEIGH_TABLE is restored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants