Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chassis][LAG_ID] Address the same lagid been used in two different LCs issue #3303

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mlok-nokia
Copy link

@mlok-nokia mlok-nokia commented Sep 28, 2024

What I did
Create SYSTEM_LAG_IDS_FREE_LIST for assign lagId for all portchannel creation.

  1. Portchannel creation
    a) If Portchannel is created with a valid plagid
    * check if plagid is in free list, use plagid and remove it from SYSTEM_LAG_IDS_FREE_LIST. Add this lagid to the SYSTEM_LAG_ID_SET for debug info
    * If plagid is not in the FREE_LIST, lpop and use the first lagid from the SYSTEM_LAG_IDS_FREE_LIST. Add this lagid to the SYSTEM_LAG_ID_SET for debug info
    b) If Portchannel is created with invalid plagid or without any lagid
    • lpop and use the first lagid from the SYSTEM_LAG_IDS_FREE_LIST. Add this lagid to the SYSTEM_LAG_ID_SET for debug info
  2. Portchannel delection
    • Append the lagid to the end of SYSTEM_LAG_IDS_FREE_LIST. Also remove it from SYSTEM_LAG_ID_SET.

This PR works with the following 2 PRs:
sonic-net/sonic-platform-daemons#542
sonic-net/sonic-buildimage#20369

Why I did it
To address the issue of the same lagid could be used by two Portchannels in two different linecards. This issue occurs when reboot many Linecards together with 20 seconds delay in each LC reboot.

How I verified it

Details if related

local lagid = redis.call("lpop", "SYSTEM_LAG_IDS_FREE_LIST")
redis.call("hset", "SYSTEM_LAG_ID_TABLE", pcname, lagid)
redis.call("sadd", "SYSTEM_LAG_ID_SET", lagid)
if dblagid then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use dblag here?

Copy link
Author

@mlok-nokia mlok-nokia Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use dblag here?

This is just for handling the out of sync cases. As normal operation, this should not happen.
In the old method, when there is a valid dblagid number (it is from a broken entry in SYSTEM_LAG_ID_TABLE), if a new lagId is assigned, the dblagid needs to be removed from the SYSTEM_LAG_ID_SET.
With the new modification, we need to handle it as the same -- if old lagId (dblagId) is valid lag ID, we need to it to the free_list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do

redis.call("srem", "SYSTEM_LAG_ID_SET", tostring(dblagid))

here too ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In case of the old lagId (dblagid) still in the SYSTEM_LAG_ID_SET in a broken scenario, we just remove it. Anyway, this SYSTEM_LAG_ID_SET is just keeped for debugging purposes. We don't use it for any functionality. SYSTEM_LAG_ID_SET and SYSTM_LAG_IDS_FREE_LIST are mutual exclusive.

@arlakshm
Copy link
Contributor

can you update the UT as well.

@mlok-nokia mlok-nokia force-pushed the lag_id_conflict_issue branch 2 times, most recently from 80f2a59 to e9938d5 Compare October 1, 2024 17:33
@mlok-nokia
Copy link
Author

can you update the UT as well.

UT has been updated.

local lagid = redis.call("lpop", "SYSTEM_LAG_IDS_FREE_LIST")
redis.call("hset", "SYSTEM_LAG_ID_TABLE", pcname, lagid)
redis.call("sadd", "SYSTEM_LAG_ID_SET", lagid)
if dblagid then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code under "if dblagid then" is duplicated in both if and else

Copy link
Author

@mlok-nokia mlok-nokia Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same in the if and else. Not able to simplify them in one because of "if" case returns "tonumber(lagid)" while the "else" case returns "plagid"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants