Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fabricportsorch] Add fabric support #1459

Merged
merged 8 commits into from
May 14, 2021
Merged

Conversation

ngoc-do
Copy link
Contributor

@ngoc-do ngoc-do commented Oct 7, 2020

This code is to add support for fabric asics and NPU with fabric ports enabled.

What I did

  1. Create FabricOrchDaemon for fabric asics, instead of using OrchDaemon which is used for NPU.

  2. Create FabricPortsOrch to manage fabric ports. It collects information about port state, peer switch id and peer lane (stored in STATE_DB), sets up port stats and queue states. In future, it will also be used to enable/disable erroneous fabric ports.

  3. Fabric port and queue stats are setup to be collected via FlexCounters.

One thing to note here: The code is waiting for #1431 to start fabric flow for fabric asic/NPU with enabled fabric ports. So right now, there is no code that creates objects for FabricOrchDaemon/FabricPortsOrch yet. We need to know asic type (and if asic needs fabric ports to be enabled) to do that. These information should come from orchagent/main.cpp, and will be available after #1431 merges.

Why I did it

Initialize fabric asics and support traffic across multiple NPUs.

How I verified it

Tested in systems with fabric asics (BCM 88790) and NPU with fabric ports (BCM 88690).

Details if related

This code will require to update a few other things:

(1) syncd: to ask FlexCounter to collect fabric counters (sonic-net/sonic-sairedis#669)

(2) docker-orchagent: disable orchagent-portsyncd dependency. In this pull request, after fabric asic is initialized (by SAI), all fabric ports will be enabled by default. So portsyncd is not used for fabric asic at all, and orchagent should not wait for portsyncd (sonic-net/sonic-buildimage#5569)

This pull request doesn't include:

  • Cli commands to show portstat and queuestat for fabric ports. Typically fabric is used in multi-asic environment, and portstat, queuestat seem not to be supported for multi asic systems yet. We will add them later in next requests.

The implementation is based on the design doc proposed in sonic-net/SONiC#668.

@ghost
Copy link

ghost commented Oct 7, 2020

CLA assistant check
All CLA requirements met.

@lgtm-com
Copy link

lgtm-com bot commented Oct 7, 2020

This pull request introduces 1 alert and fixes 1 when merging 6162bbe into e4dfb37 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@lguohan lguohan requested a review from rlhui October 8, 2020 13:35

if (attr.value.booldata)
{
attr.id = SAI_PORT_ATTR_FABRIC_ATTACHED_SWITCH_ID;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the crc error collection here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we detect if a link flapped between the polling period? Is that a count that can be maintained in STATE_DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add counter list, one counter for one port, that counts the number of times the link flaps. I also maintain the last time the link flaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRC error collection will be added in another PR. It's because sonic-sairedis now doesn't understand attribute type which is used for getting fabric port error list. So we will need to modify sonic-sairedis, and then add CRC in sonic-swss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds good to me.

@anshuv-mfst
Copy link

@minionatwork (Suresh) - please add as reviewer.

@lguohan
Copy link
Contributor

lguohan commented Nov 4, 2020

wonder why this pr does not have swss vs test cases added?

@lguohan lguohan added the chassis label Nov 4, 2020
Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add swss vs tests.

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Nov 5, 2020

please add swss vs tests.

Sure, I will look into that.

@lgtm-com
Copy link

lgtm-com bot commented Nov 10, 2020

This pull request introduces 1 alert and fixes 1 when merging 4770917 into b9084a7 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@lgtm-com
Copy link

lgtm-com bot commented Feb 8, 2021

This pull request introduces 1 alert and fixes 1 when merging 91df50a into 189a964 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Feb 9, 2021

please add swss vs tests.

Hello @lguohan: vstest for fabric ports has been updated now.

There are PRs in sonic-sairedis which merged to add support fabric ports in vslib - sonic-net/sonic-sairedis#737 and sonic-net/sonic-sairedis#769.

The above PRs merged, but their code doesn't get invoked until sonic-net/sonic-buildimage#6185 merges. That is to set up fabric lane mapping, and so to create fabric ports. Because if we merge it, the following tests will fail:

test_virtual_chassis.TestVirtualChassis.test_connectivity
test_virtual_chassis.TestVirtualChassis.test_voq_switch
test_virtual_chassis.TestVirtualChassis.test_chassis_app_db_sync
test_virtual_chassis.TestVirtualChassis.test_chassis_system_interface
test_virtual_chassis.TestVirtualChassis.test_chassis_system_neigh

Why is that? Because now vslib is fed with fabric lane mapping, it will create 16 fabric ports for VOQ asic. Together with 32 front panel ports, VOQ asic will have 32 + 16 = 48 ports (type SAI_OBJECT_TYPE_PORT). But the tests expect only 32 ports only. So they fail.

conftest.py:490: in check_swss_ready
    asic_db.wait_for_n_keys("ASIC_STATE:SAI_OBJECT_TYPE_PORT", num_ports + 1)  # +1 CPU Port

In this PR's swss tests, I updated the expected number of ports with fabric ports. But since sonic-net/sonic-buildimage#6185 can't merge, this PR also has the above test failures (in the opposite way: it expects 48 ports, but only 32 ports created - no fabric).

So one possible solution is that in this PR, I won't update the expected number with fabric ports (keep 32 only for front panel ones). So the tests won't fail. And we will merge it up first. Once it merges, we disable the tests, merge sonic-net/sonic-buildimage#6185, and then make another PR to reenable the tests. sonic-net/sonic-buildimage#6185 is fairly simple, so it won't take much down time for the tests.

Please let me know what you think?

@abdosi
Copy link
Contributor

abdosi commented Mar 1, 2021

can you add design document link to this PR in description

};

static const vector<sai_queue_stat_t> queue_stat_ids =
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these valid for Fabric ports ?

Copy link
Contributor Author

@ngoc-do ngoc-do Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are valid for fabric ports. But currently we are not able to query those stats yet until jupiter-sai completely supports them. I keep them here though.

Those stats will be setup in generateQueueStats(), which is currently empty. For reference purpose, generateQueueStats() will do the below thing.

For each port:

            // Maintain queue map and install flex counters for queue stats
            vector<FieldValueTuple> laneQueueMap;
            for (size_t q = 0; q < m_queue_ids.size(); q ++)
            {
                std::ostringstream name;
                name << "FabricPort" << lane << ":" << q;
                laneQueueMap.emplace_back(name.str(), sai_serialize_object_id(m_queue_ids[q]));

                std::unordered_set<string> counter_stats;
                for (const auto& it: queue_stat_ids)
                {
                    counter_stats.emplace(sai_serialize_queue_stat(it));
                }
                queue_stat_manager.setCounterIdList(m_queue_ids[q], CounterType::QUEUE, counter_stats);
            }
            m_laneQueueCounterTable->set("", laneQueueMap);

I'll update generateQueueStats() to use queue_stat_ids once we have fully support from jupiter-sai.

Orch(appl_db, tableNames),
port_stat_manager(FABRIC_PORT_STAT_COUNTER_FLEX_COUNTER_GROUP, StatsMode::READ,
FABRIC_PORT_STAT_FLEX_COUNTER_POLLING_INTERVAL_MS, true),
queue_stat_manager(FABRIC_QUEUE_STAT_COUNTER_FLEX_COUNTER_GROUP, StatsMode::READ,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is for queue_stat_ids ? So again is this needed for fabric port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, once jupiter-sai supports queue stats query, queue_stat_manager is used to set up queue_stat_ids in FLEX_COUNTER_DB. syncd periodically does its job by reading those counters and updating COUNTERS_DB.

m_stateTable = unique_ptr<Table>(new Table(m_state_db.get(), FABRIC_PORT_TABLE));

m_counter_db = shared_ptr<DBConnector>(new DBConnector("COUNTERS_DB", 0));
m_laneQueueCounterTable = unique_ptr<Table>(new Table(m_counter_db.get(), COUNTERS_QUEUE_NAME_MAP));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does COUNTERS_QUEUE_NAME_MAP and COUNTERS_QUEUE_PORT_MAP looks like for fabric port ?
Can you add comment in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once jupiter-sai supports port and queue stats completely for fabric ports on both fabric and voq asic, this will look like below in COUNTERS_DB:

127.0.0.1:6379[2]> select 2
OK
127.0.0.1:6379[2]> HGETALL "COUNTERS_QUEUE_PORT_MAP"
  1) "FabricPort0"
  2) "oid:0x401000000000006"
127.0.0.1:6379[2]> HGETALL "COUNTERS:oid:0x401000000000006"
 1) "SAI_PORT_STAT_IF_OUT_FABRIC_DATA_UNITS"
 2) "0"
 3) "SAI_PORT_STAT_IF_IN_ERRORS"
 4) "0"
 5) "SAI_PORT_STAT_IF_IN_OCTETS"
 6) "0"
 7) "SAI_PORT_STAT_IF_IN_FABRIC_DATA_UNITS"
 8) "0"
 9) "SAI_PORT_STAT_IF_IN_FEC_SYMBOL_ERRORS"
10) "0"
11) "SAI_PORT_STAT_IF_OUT_OCTETS"
12) "0"
13) "SAI_PORT_STAT_IF_IN_FEC_CORRECTABLE_FRAMES"
14) "0"
15) "SAI_PORT_STAT_IF_IN_FEC_NOT_CORRECTABLE_FRAMES"
16) "0"
127.0.0.1:6379[2]> HGETALL "COUNTERS_QUEUE_NAME_MAP"
  1) "FabricPort0:0"
  2) "oid:0x4150000000000c6"
127.0.0.1:6379[2]> HGETALL "COUNTERS:oid:0x4150000000000c6"
1) "SAI_QUEUE_STAT_CURR_OCCUPANCY_LEVEL"
2) "0"
3) "SAI_QUEUE_STAT_WATERMARK_LEVEL"
4) "24"
5) "SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES"
6) "0"

@@ -49,6 +50,7 @@ OrchDaemon::OrchDaemon(DBConnector *applDb, DBConnector *configDb, DBConnector *
m_chassisAppDb(chassisAppDb)
{
SWSS_LOG_ENTER();
m_select = new Select();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we doing here ?

ok got it. we have moved to base class for select object initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had to move it to base class since it's used for both npu and fabric.

@@ -68,6 +70,7 @@ OrchDaemon::~OrchDaemon()
for(; it != m_orchList.rend(); ++it) {
delete(*it);
}
delete m_select;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous what is this change for ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok got it. we have moved to base class for select object initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had to move it to base class since it's used for npu and fabric too.


if (m_getFabricPortListDone)
{
updateFabricPortState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add link to the PR description where these State DB fields are define. Also looks like current VS test case added as part of this do not cover state db changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I updated the description with the link.

The VS test already verifies the number of fabric ports in STATE_DB which is 16 fabric ports created for voq asic. Please look at the test part:

        # Verify that fabric ports are monitored in STATE_DB
        if metadata.get('switch_type', 'npu') in ['voq', 'fabric']:
            self.get_state_db()
            self.state_db.wait_for_n_keys("FABRIC_PORT_TABLE", 16)

Below is what we see from a container of a linecard created by a vs test.

root@a5959e6e0a4a:/# sonic-db-cli STATE_DB HGETALL 'FABRIC_PORT_TABLE|PORT10'
{'STATUS': 'down'}

In addition to the change in the test that verifies STATE_DB, sonic-net/sonic-sairedis#737 merged to vslib to create fabric ports for voq asics.

Note that we don't test fabric asics because it will require significant changes in the tests - all tests now assume testing asic is NPU with front panel ports.

Copy link
Contributor

@abdosi abdosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Have some comments for clarification.

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Mar 10, 2021

Hello @abdosi, could you please check my responses, and see if you can approve?

One question is actually here that I will need to comment out 5 failing tests. Are you OK with that?

The reason is that there are PRs in sonic-sairedis which merged to add support fabric ports in vslib - sonic-net/sonic-sairedis#737 and sonic-net/sonic-sairedis#769.

The above PRs merged, but their code doesn't get invoked until sonic-net/sonic-buildimage#6185 merges. That is to set up fabric lane mapping, and so to create fabric ports. Because if we merge it, the following tests will fail:

test_virtual_chassis.TestVirtualChassis.test_connectivity
test_virtual_chassis.TestVirtualChassis.test_voq_switch
test_virtual_chassis.TestVirtualChassis.test_chassis_app_db_sync
test_virtual_chassis.TestVirtualChassis.test_chassis_system_interface
test_virtual_chassis.TestVirtualChassis.test_chassis_system_neigh

Why is that? Because now vslib is fed with fabric lane mapping, it will create 16 fabric ports for VOQ asic. Together with 32 front panel ports, VOQ asic will have 32 + 16 = 48 ports (type SAI_OBJECT_TYPE_PORT). But the tests expect only 32 ports only. So they fail.

conftest.py:490: in check_swss_ready
asic_db.wait_for_n_keys("ASIC_STATE:SAI_OBJECT_TYPE_PORT", num_ports + 1) # +1 CPU Port

In this PR's swss tests, I updated the expected number of ports with fabric ports. But since sonic-net/sonic-buildimage#6185 can't merge, this PR also has the above test failures (in the opposite way: it expects 48 ports, but only 32 ports created - no fabric).

So the solution is I will disable the above tests, merge this PR, and then merge sonic-net/sonic-buildimage#6185, and then reenable the tests.

Does it sound good to you? If yes, I will add change to disable the tests in this PR.

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Mar 31, 2021

retest vs please

@lgtm-com
Copy link

lgtm-com bot commented Apr 2, 2021

This pull request introduces 1 alert and fixes 1 when merging e368411 into cba6576 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Apr 2, 2021

OK, so with the latest commit, I just disabled checking fabric port in vs test, instead of disabling all failing tests. Voq code is still under testing.

When sonic-net/sonic-buildimage#6185 merges, I will just make another PR that changes FABRIC_NUM_PORTS = 16, and fabric code testing will be covered.

@lguohan @abdosi please review. Thanks.

@lgtm-com
Copy link

lgtm-com bot commented Apr 28, 2021

This pull request introduces 1 alert and fixes 1 when merging cbff2cb into ee7a735 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@ngoc-do
Copy link
Contributor Author

ngoc-do commented Apr 29, 2021

/azpw run

@lgtm-com
Copy link

lgtm-com bot commented Apr 30, 2021

This pull request introduces 1 alert and fixes 1 when merging 03d9e26 into ee7a735 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

eswaranb
eswaranb previously approved these changes May 3, 2021
@anshuv-mfst
Copy link

Hi @lguohan - could you please review again, comments taken care off. Thanks

lguohan
lguohan previously approved these changes May 10, 2021
@lgtm-com
Copy link

lgtm-com bot commented May 11, 2021

This pull request introduces 1 alert and fixes 1 when merging f054f8b into e29d566 - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@rlhui rlhui removed their request for review May 11, 2021 02:59
@ngoc-do
Copy link
Contributor Author

ngoc-do commented May 11, 2021

retest vs please

@anshuv-mfst
Copy link

Hi @abdosi - could you please initiate azp run.

@ngoc-do
Copy link
Contributor Author

ngoc-do commented May 13, 2021

@abdosi

@abdosi Could you please help rerun the tests?

@ngoc-do ngoc-do dismissed stale reviews from lguohan and eswaranb via 38ddc7a May 14, 2021 16:09
@lgtm-com
Copy link

lgtm-com bot commented May 14, 2021

This pull request introduces 1 alert and fixes 1 when merging 38ddc7a into 73ffd5f - view on LGTM.com

new alerts:

  • 1 for Non-virtual destructor in base class

fixed alerts:

  • 1 for Resource not released in destructor

@ngoc-do
Copy link
Contributor Author

ngoc-do commented May 14, 2021

@lguohan I just resolved conflict one more time, and the tests got run again. All passed.

Could you please approach it again? I think it's great if we can merge this PR asap. This PR recently keeps getting conflicts (twice last week).

@abdosi abdosi merged commit fb06c32 into sonic-net:master May 14, 2021
@abdosi
Copy link
Contributor

abdosi commented May 14, 2021

@ngoc-do

When Azure/sonic-buildimage#6185 merges, I will just make another PR that changes FABRIC_NUM_PORTS = 16, and fabric code testing will be covered.

have you done this or will do ?

@ngoc-do
Copy link
Contributor Author

ngoc-do commented May 14, 2021

@ngoc-do

When Azure/sonic-buildimage#6185 merges, I will just make another PR that changes FABRIC_NUM_PORTS = 16, and fabric code testing will be covered.

have you done this or will do ?

I will do this. Thanks for merging it up.

jleveque pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 23, 2021
…7629)

This PR is actually #6185 that was merged and then reverted to avoid test failure in swss. 

Now sonic-net/sonic-swss#1459 merged, and so this PR could merge to enable fabric test in swss.

Signed-off-by: ngocdo <[email protected]>
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
…onic-net#7629)

This PR is actually sonic-net#6185 that was merged and then reverted to avoid test failure in swss. 

Now sonic-net/sonic-swss#1459 merged, and so this PR could merge to enable fabric test in swss.

Signed-off-by: ngocdo <[email protected]>
raphaelt-nvidia pushed a commit to raphaelt-nvidia/sonic-swss that referenced this pull request Oct 5, 2021
This code is to add support for fabric asics and NPU with fabric ports enabled.

What I did

Create FabricOrchDaemon for fabric asics, instead of using OrchDaemon which is used for NPU.

Create FabricPortsOrch to manage fabric ports. It collects information about port state, peer switch id and peer lane (stored in STATE_DB), sets up port stats and queue states. In future, it will also be used to enable/disable erroneous fabric ports.

Fabric port and queue stats are setup to be collected via FlexCounters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants