Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routeorch calls SAI API set attribute for a routing entry which was just removed from SAI in the same bulk operation. Should recreate it in such case. #9434

Closed
stephenxs opened this issue Dec 2, 2021 · 9 comments · Fixed by sonic-net/sonic-swss#2071

Comments

@stephenxs
Copy link
Collaborator

stephenxs commented Dec 2, 2021

Description

Orchagent exited due to setting a non-existing routing entry during bgp speaker test with the latest swss
The issue was observed during azure pipeline kvm t0 bgp speaker test of PR #9397.

Steps to reproduce the issue:

Describe the results you received:

The following error observed during bgp speaker test

Nov 30 18:18:03.998387 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"193.9.80.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:03.998979 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:03.999571 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.236.72.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.000069 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.000690 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.254.80.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.001128 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1000, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.001549 vlab-01 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"192.188.120.128/25","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000022"} doesn't exist
Nov 30 18:18:04.001984 vlab-01 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 321, status: SAI_STATUS_INVALID_PARAMETER
Nov 30 18:18:04.003547 vlab-01 ERR swss#orchagent: :- addRoutePost: Failed to set route 0.0.0.0/0 with packet action forward, -23
Nov 30 18:18:04.003869 vlab-01 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED

route orchagent operating flow on the routing entries:

  • set,
  • then remove
  • and then set again.

So for the second set operation, route orchagent should call "create" instead of "set".
However, according to sairedis.rec, it called "set" for the second set operation when the routing entries didn't exist in SAI, which caused SAIredis failure.

ROUTE_TABLE in swss.rec:

2021-11-30.18:17:48.627923|ROUTE_TABLE:192.188.120.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:49.928608|ROUTE_TABLE:192.236.72.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:50.119089|ROUTE_TABLE:192.254.80.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:50.380784|ROUTE_TABLE:193.9.80.128/25|SET|nexthop:10.0.0.59,10.0.0.61,10.0.0.63|ifname:PortChannel0002,PortChannel0003,PortChannel0004
2021-11-30.18:17:58.852071|ROUTE_TABLE:192.188.120.128/25|DEL
2021-11-30.18:17:58.872891|ROUTE_TABLE:192.254.80.128/25|DEL
2021-11-30.18:17:58.901643|ROUTE_TABLE:193.9.80.128/25|DEL
2021-11-30.18:17:58.941370|ROUTE_TABLE:192.236.72.128/25|DEL
2021-11-30.18:18:00.723525|ROUTE_TABLE:192.236.72.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:00.997692|ROUTE_TABLE:193.9.80.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:01.004276|ROUTE_TABLE:192.254.80.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001
2021-11-30.18:18:01.455867|ROUTE_TABLE:192.188.120.128/25|SET|nexthop:10.0.0.57|ifname:PortChannel0001

Describe the results you expected:

Should not be error.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The flow should be like this.
During the previous test, it announced the routing entries at the beginning and withdrew them after test finished. And then the next test started, the same set of routing entries were announced again.
So for each routing entry P, fpmsyncd issued the next operations:

  • add route P
  • remove route P
  • add route P

routeorch should handle the notifications as below:

  • add route P: call SAI API create_route in bulk mode and then insert P into m_syncdRoutes after bulk call returned
  • remove route P: call SAI API remove_route in bulk mode and then remove it from m_syncdRoutes after bulk call returned
  • add route P: call SAI API create_route in bulk mode and then insert P into m_syncdRoutes after bulk call returned

However, since there were a large amount of routing entries being notified, the notifications of P for remove and the second add were packed and handled in the same bulk call.

Then the flow was:

  • add route P: called SAI API gRouteBulker.create_entry in bulk mode and then inserted P into m_syncdRoutes after bulk call returned
  • remove route P: called SAI API gRouteBulker.remove_entry in bulk mode, the call was pending.
  • add route P: called SAI API gRouteBulker.set_entry_attribute since P had not been removed from m_syncdRoutes, which is problematic because when this operation would be handled by SAI P had been removed.
  • called bulk.flush and then handled the result:

When SAI handled the operations,

  • remove_entry: notified vendor SAI to remove P and removed P from m_saiObjectCollection as well.
  • set_entry_attribute: reported error since P had been removed from m_saiObjectCollection when remove_entry was called for P. Therefore, it notified orchagent to exit.

I think this is a bug in the routeorch for a while but it occurs only in the very rare scenarios.
It is observed after PR sonic-net/sonic-swss#1992 was merged because in that PR there is an optimization that pushes notifications from redis-db table to m_toSync as many as possible:
The old logic:

        std::deque<KeyOpFieldsValuesTuple> entries;
        subTable->pops(entries);
        return addToSync(entries);

The new logic:

        do
        {
            std::deque<KeyOpFieldsValuesTuple> entries;
            subTable->pops(entries);
            update_size = addToSync(entries);
            total_size += update_size;
        } while (update_size != 0);
        return total_size;

So in the new logic, there is a higher probability for orchagent to merge all notifications together and for remove and the second set to be bundled in one bulk, and therefore, it is more likely to trigger the bug.

A solution can be:

  • Introduce a flag indicating whether a prefix is pending removal.
  • In add route process, if a prefix with the pending removal flag is about to be added, don't handle it until the bulk has been flushed.

WAs can be:

  • Revert the optimization
  • Add delay between two tests where orchagent was aborted.
@prsunny
Copy link
Contributor

prsunny commented Dec 2, 2021

Can you check if the failure is due to a regression in code or test? Not sure if there was any recent change in route_orch for this flow.

@stephenxs
Copy link
Collaborator Author

stephenxs commented Dec 4, 2021

@prsunny Looks like it's a production issue in route orch introduced by bulk operation but was hardly to be trigerred. Now with sonic-net/sonic-swss#1992 merged it is much easier to be.

@stephenxs
Copy link
Collaborator Author

This issue is reproducible by using gdb.

@stephenxs stephenxs changed the title Orchagent exited due to setting a non-existing routing entry during bgp speaker test with the latest swss Routeorch call SAI API set attribute for a routing entry which was was just removed from SAI in the same bulk operation. Should recreate it in such case. Dec 6, 2021
@stephenxs stephenxs changed the title Routeorch call SAI API set attribute for a routing entry which was was just removed from SAI in the same bulk operation. Should recreate it in such case. Routeorch calls SAI API set attribute for a routing entry which was was just removed from SAI in the same bulk operation. Should recreate it in such case. Dec 6, 2021
@stephenxs stephenxs changed the title Routeorch calls SAI API set attribute for a routing entry which was was just removed from SAI in the same bulk operation. Should recreate it in such case. Routeorch calls SAI API set attribute for a routing entry which was just removed from SAI in the same bulk operation. Should recreate it in such case. Dec 6, 2021
@prsunny
Copy link
Contributor

prsunny commented Dec 7, 2021

@shi-su , could you please take a look?

@liat-grozovik
Copy link
Collaborator

@shi-su could you please update?
this is blocking swss submodule pr to get into sonic master

@shi-su
Copy link
Contributor

shi-su commented Dec 7, 2021

I'll check it today.

@abdosi
Copy link
Contributor

abdosi commented Dec 8, 2021

@abdosi following this.

@shi-su
Copy link
Contributor

shi-su commented Dec 8, 2021

@stephenxs I drafted a PR sonic-net/sonic-swss#2071 that would potentially resolve the issue. Would you mind adding a test case for the issue?

@abdosi Let me know if you have any thoughts or comments.

@liat-grozovik
Copy link
Collaborator

@stephenxs I drafted a PR Azure/sonic-swss#2071 that would potentially resolve the issue. Would you mind adding a test case for the issue?

@abdosi Let me know if you have any thoughts or comments.

Not sure i understand your request to add test.
The issue has been raised while test was running so we have a test. "azure pipeline kvm t0 bgp speaker test."
If you have new flow in your draft PR i suggest the test will be enhanced in that PR

shi-su added a commit to sonic-net/sonic-swss that referenced this issue Dec 14, 2021
…set operations (#2071)

What I did
Check if there are items pending removal in bulk before calling bulk set API.

Fixes sonic-net/sonic-buildimage#9434

Why I did it
When there are items pending removal in bulk before calling set API, it means the item will be removed before the set and it should do create instead.
preetham-singh pushed a commit to preetham-singh/sonic-swss that referenced this issue Aug 6, 2022
…set operations (sonic-net#2071)

What I did
Check if there are items pending removal in bulk before calling bulk set API.

Fixes sonic-net/sonic-buildimage#9434

Why I did it
When there are items pending removal in bulk before calling set API, it means the item will be removed before the set and it should do create instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants