Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmboot Manager HLD for warm reboot improvement #1485

Merged
merged 11 commits into from
Jun 4, 2024

Conversation

akarshgupta25
Copy link
Contributor

No description provided.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 23, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@akarshgupta25 akarshgupta25 marked this pull request as draft September 23, 2023 00:52
@akarshgupta25 akarshgupta25 marked this pull request as ready for review September 23, 2023 00:52
@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao zhangyanzhao changed the title Add NSF Manager HLD Add NSF Manager HLD for warm reboot improvement Oct 12, 2023
@zhangyanzhao
Copy link
Collaborator

202311 release is only for HLD review, no code PR

@zhangyanzhao zhangyanzhao changed the title Add NSF Manager HLD for warm reboot improvement Add Non-Stop Forwarding (NSF) Manager HLD for warm reboot improvement Oct 12, 2023



* Separate binaries need to be developed for each component to send shutdown related notifications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean adding new scripts to handle shutdown per service?

The current approach does not require a separate script for all services. There is a templated way for all services. The build process auto-generates the scripts per service. A new script is needed only when a service requires special handling for the shutdown path.

Templates:
Start and stop sequence: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/service_mgmt.sh
warm reboot is handled in a generic fashion here).

Services need to only add a soft reference this template and separate binary is not needed. Eg.;
radv: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/radv.sh
snmp: https://github.com/sonic-net/sonic-buildimage/blob/master/files/scripts/snmp.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to the separate binaries that we have to send shutdown notifications to different components. For example, orchagent_restart_check for Orchagent freeze notification, syncd_request_shutdown for Syncd notifications etc.

We aren't proposing to have separate service management scripts.



* Separate binaries need to be developed for each component to send shutdown related notifications.
* The rich set of Redis DB features provided by swss-common library cannot be utilized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate here. Perhaps w/ some examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swss-common enables us to create a pub/sub model wherein we can send notifications and subscribe for warm-boot state updates from multiple applications in Redis DB as opposed to poll for this information for individual components.


* Separate binaries need to be developed for each component to send shutdown related notifications.
* The rich set of Redis DB features provided by swss-common library cannot be utilized.
* A framework to send notifications to components and wait for updates cannot be developed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this already not happening in the current bash script? Notifications are sent and are waited upon w/ a timeout?

I do not fully understand the concern here and may need your help to elaborate here on how the current script limits us from managing the shutdown process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown state machine for different components (proposed in this HLD) can be better monitored using a daemon that is leveraging swss-common libs as opposed to polling these warm-boot states of individual components using a bash script.


Additionally, the current SONiC warm shutdown algorithm has custom shutdown related notifications for different components that triggers their warm shutdown logic. For example, Orchagent uses _freeze_ notification and notifies its ready status via notification channel. Syncd uses _pre-shutdown_ and _shutdown_ notifications and notifies its status via Redis DB.

The proposal is to introduce a new daemon called NSF Manager that will be responsible for both shutdown orchestration and reconciliation monitoring during warm reboot. It will leverage Redis DB features provided by swss-common to create a common framework for warm-boot orchestration. As a result, there will be a unified framework for warm-boot orchestration for all switch stack components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean everything that happens today will still happen w/ this new design, only that a wrapper around the current process will unify/abstract the different ways in which notifications are sent/received today?

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This framework will co-exist with the existing shutdown orchestration. This framework introduces additional notification channels for the newer unified shutdown notifications. So components can choose to take actions based on the notification channel from which they receive the shutdown notification. Our proposal is to be compatible with the existing SONiC warm-boot orchestration.




NSF Manager will replace the existing [fast-reboot](https://github.com/sonic-net/sonic-utilities/blob/master/scripts/fast-reboot) and [finalize\_warmboot.sh](https://github.com/sonic-net/sonic-buildimage/blob/master/files/image_config/warmboot-finalizer/finalize-warmboot.sh) scripts to perform shutdown orchestration and reconciliation monitoring during warm reboot. It will use a registration based mechanism to determine the switch stack components that need to perform an orchestrated shutdown and bootup. This ensures that the NSF Manager design is generic, flexible and also avoids any hardcoding of component information in NSF Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling out the concern that was raised in meeting - we still would like to keep the ability to fix shutdown path on the fly. The script-based design enabled us to add adhoc fixes (after testing) at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate on the type of fixes that are introduced at runtime?


![alt_text](img/freeze.png)

NSF Manager will send freeze notification to all registered components and will wait for the quiescence of only those components that have set _freeze = true_. Upon receiving the freeze notification, the component will complete processing the current request queue and stop generating new intents. Stopping new intents from being generated means that boundary components should stop processing requests from external components (external events) and all components should stop their periodic timers that generate new requests (internal events). For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this simultaneously to all registered components may not be a good idea. We do want some components to keep operating:

  1. until a dependency is active (eg., syncd should work as long as teamd is working)
  2. as long as possible (eg., we do not want to freeze teamd even if OA is frozen or swss is stopped).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freeze notification means that components should stop generating events for which they are the source. All components will continue to serve requests received by them but will stop generating events that are originated by them. This is similar to saying "I will not ask questions, but I will continue to answer questions". This means that components will continue to serve internal requests. After a point of time, all components will stop generating events and thus the switch will become quiescent.

This means that during freeze phase, Orchagent and Syncd will continue to serve requests from other components such as Teamd.

* Syncd will stop listening to link state events from the BCM chip.
* Orchagent will stop periodic internal timers.
* BGP will stop exchanging packets with the peers.
* Teamd will stop exchanging LACP PDUs with the peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this is not a good idea. Preventing teamd to stop exchanging LACPDUs in phase 1 would put a lot of pressure to reconcile within LAG session standard timeout of 90s. If we spend too much time in phase 2+ then LAGs are guaranteed to flap in the recovery path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for syncd - since we want teamd to still continue to send LACPDUs as long as possible, we should not request syncd to reach quiescence at the same time when Orchagent does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, I don't think we can have all components reaching quiescence as a prerequisite for going into Phase 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern here. There are multiple ways to tackle this problem:

  1. After freeze notification, Teamd continue to exchange LACP PDUs but not forward any request to Orchagent/Syncd so that their quiescence doesn't change.
  2. Setup a control plane assistant that serves as a proxy for Teamd.
  3. Increase the LAG session timeout before stopping LACP PDU exchange.

* All timers that generate new events have been stopped.
* All components have completed processing their pending requests and thus there are no in-flight messages.

After receiving the freeze notification, the components will update their quiescent state in STATE DB when they receive a new request (i.e. they are no longer quiescent) and when they complete processing their current request queue (i.e. they become quiescent). NSF Manager will monitor the quiescent state of all components in STATE DB to determine that the switch has become quiescent and thus further state changes won’t occur in the switch. If all components are in quiescent state then NSF Manager will declare that the switch has become quiescent and thus the switch has attained its final state. NSF Manager will wait for a period of time for the switch to become quiescent after which it will determine that warm reboot failed and abort the warm reboot operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NSF Manager will wait for a period of time for the switch to become quiescent after which it will determine that warm reboot failed and abort the warm reboot operation.

The design should also account for restoration of components state (unfreeze) in the event of failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SONiC currently support unfreeze of applications in case of failures?


![alt_text](img/state-verification.png)

Since the switch is in quiescent state, this will be the final state of the switch before reboot. NSF Manager will trigger state verification to ensure that the switch is in a consistent state. Reconciling from an inconsistent state might cause traffic loss and thus it is important to ensure that the switch is in a consistent state before warm rebooting. NSF Manager will abort warm reboot operation if state verification fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this phase different from what was done as last step in Phase 1? Also, please define consistent state.

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 ensures that the switch reaches a stable state. Phase 2 (optional) audits the switch state. State verification enables auditing of the state of different components and ensuring that they are in an expected state. Consistency depends on the component. It can mean that the internal cache matches the switch state, or the switch intent matches the switch state etc. For example, BGP kernel routes match APPL DB (or ASIC DB) information, oper-status in Linux host interface matches APPL DB information etc.


![alt_text](img/shutdown-optimization.png)

Higher layer applications are generally independent and their shutdown is not dependent on quiescence of other switch stack components. For example, upon receiving a freeze notification Teamd can stop exchanging LACP packets, perform state verification (optional) and checkpoint LACP states i.e. transition from Phase 1 to Phase 3 without waiting for NSF Manager to send further notifications. The design allows applications to transition through these phases independently as long as they continue to update their warmboot state in STATE DB. NSF Manager will monitor these states and will handle the applications’ phase transitions. In such a scenario, applications need to set _freeze = true_ and _checkpoint = false_ and thus the freeze notification will result in the application to transition from Phase 1 to Phase 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design allows applications to transition through these phases independently

But in warm shutdown we do need dependent transition.
independent transition can cause confusion and open gates for multiple race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phase 1 ensures that the switch stack components continue to serve requests from other components. Also, having a state machine during shutdown helps alleviate the race conditions. I would be happy to look at individual examples of race conditions and check if this design is able to handle them.

@@ -0,0 +1,280 @@
# NSF Manager HLD
Copy link
Contributor

@qiluo-msft qiluo-msft Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prevent using the terminalogy "NSF" in this design doc? There is another active HLD relating gNOI API, and we are discussing use gNOI NSF terminalogy to represent SONIC fast-reboot instead of SONiC warm-reboot, since there is another gNOI WARM already. ref: https://github.com/openconfig/gnoi/blob/98d6b81c6dfe3c5c400209f82a228cf3393ac923/system/system.proto#L128

May I propose "Warm Reboot Manager" in this HLD? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added a comment in that design: I am not sure why WARM and NSF reboot methods are defined separately in the gNOI spec. In my opinion they both refer to the same operation. I would propose adding a new reboot method for fast reboot since it doesn't map to any of the current gNOI reboot methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prevent using the terminology "NSF" in this design doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow prefer using gnoi WARM terminology over NSF. Would you mind change all the NSF->WARM in this HLD including PR title and markdown headers. I will not block this PR if using WARM in HLD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced NSF with Warmboot. Updated the document tile, headings and diagrams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comment.



### SAI API

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add note that feature/new changes are independent of any SAI changes needed. There could be possible SAI changes needed for improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in SAI API section.

- [Restrictions/Limitations](#restrictions-limitations)
- [Testing Requirements/Design](#testing-requirements-design)
- [Open/Action items - if any](#open-action-items---if-any)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of sections can be added. one use cases covered ( LACP.. etc). Also sections which NPU types are tested or supported after testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added NPU types detail in Testing Requirements/Design section.

- [Definitions/Abbreviations](#definitions-abbreviations)
- [Overview](#overview)
- [Requirements](#requirements)
- [Architecture Design](#architecture-design)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also section to cover where the mgr runs and how to start it ? will it be docker or run on host image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added details in the Architecture Design section.

+ [Phase 1: Freeze Components & Wait for Switch Quiescence](#phase-1--freeze-components---wait-for-switch-quiescence)
+ [Phase 2: State Verification (Optional)](#phase-2--state-verification--optional-)
+ [Phase 3: Trigger Checkpointing](#phase-3--trigger-checkpointing)
+ [Phase 4: Prepare and Perform Reboot](#phase-4--prepare-and-perform-reboot)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sections to talk about transition of application from old to new reboot types and backward compatibility ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Backward Compatibility with these details.

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Srideep's comments.



* UMF will stop listening to new gRPC requests from the controller.
* P4RT will stop listening to new gRPC requests from the controller and stop packet I/O.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Step4 under p4rt block that controller needs to be informed about warmboot initiated. Also can you please add details on how and what needs to be informed to the P4RT client from P4RT server when warmboot is initiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design introduces the high-level principles for the new warm-boot orchestration framework. I wanted to limit the scope of this design to these core principles. That is why this section talks about the actions of different components at a very high-level, thereby providing a brief idea of stopping self-sourced events.

From P4RT perspective, the P4Runtime spec doesn't talk about reboots in general. During cold reboot, the client observes the same behavior as a server shutdown. Ideally, the client shouldn't observe different behavior during warm reboot. This is because from the P4Runtime lens, the server was shutdown (thus unavailable for sometime) and came back up in the same state as before thereby maintaining the read-write symmetry. I would be happy to discuss this further in the System Orchestration workgroup or in a separate thread.

An additional flag will be added in Redis DB which will be used NSF Manager to determine whether this phase is required during warm-boot orchestration.


##### Phase 3: Trigger Checkpointing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can NSF merge phase-2 and phase-3 together and let each component complete state verification and save internal state together so that we reduce turn around time with additional phase?
Each component could do state verification and/or save internal state db based on the component configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point :)

Some context:
For some switch stack components such as Syncd, checkpoint notification results in saving the internal state, disconnecting from the ASIC and exiting. This is similar to processing the shutdown notification being sent by /usr/bin/syncd_request_shutdown binary in the existing warm-boot orchestration.
Additionally, we are exploring whether we should support unfreeze during warm shutdown. One possible situation is that we send freeze, then state verification and unfreeze on state verification failure.

With this background, if we support state verification and checkpoint as a single phase then Syncd can exit after checkpointing and state verification can fail for some other application. In such a case, we wouldn't be able to freeze --> state verification failure --> unfreeze unless we ensure that checkpointing isn't a point of no return for a switch stack component.

So we need to analyze the pros and cons of merging these 2 phases. Let me discuss it internally and get back.


![alt_text](img/freeze.png)

NSF Manager will send freeze notification to all registered components and will wait for the quiescence of only those components that have set _freeze = true_. Upon receiving the freeze notification, the component will complete processing the current request queue and stop generating new intents. Stopping new intents from being generated means that boundary components should stop processing requests from external components (external events) and all components should stop their periodic timers that generate new requests (internal events). For example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect all components to finish phase-1, before NSF manager trigger phase-2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, boundary applications that are independent of state changes to other switch components such as UMF, P4RT etc. can aggregate Phase 1, 2 and 3 as a part of the freeze notification (see Application Shutdown Optimization section). But this is applicable for only a certain subset of the switch stack components. NSF Manager needs to wait for the entire switch to become quiescent before a global checkpointing can take place. For example, Syncd should only save internal states after all intent has been programmed i.e. the switch is quiescent.


![alt_text](img/warmboot-states.png)

Upon receiving a freeze notification, a component will transition to _frozen_ state when it has stopped generating new intents for which it is the source. Subsequently, it will transition to _quiescent_ state when it has completed processing all requests. It might transition between _frozen_ and _quiescent_ states when it receives a new request from some other switch component and after it has completed processing all requests i.e. a component might transition between _frozen_ and _quiescent_ states if it processes an intent in freeze mode. If a component fails to perform its freeze routine i.e. fails to stop generating new intents for which it is the source then it will transition to _failed_ state. Upon receiving a checkpoint notification, a component will transition to _checkpointed_ state after it has completed checkpointing or to _failed_ state if checkpointing failed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warmboot state failed is just a state or will it trigger any event to restart ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an initial NSF Manager design, the current proposal is to abort warm reboot if an application reports failed warm-boot state. However, we are exploring unfreeze on warm-boot failure as well. For example, if Phase 1 (freeze) fails then we can unfreeze the switch stack and return that warm reboot failed.

Copy link
Contributor Author

@akarshgupta25 akarshgupta25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed Ayyappa's comments.

@zhangyanzhao
Copy link
Collaborator

@akarshgupta25 can you please help to add the code PR to this HLD by referring to #806 ? Thanks.

@zhangyanzhao
Copy link
Collaborator

code PR is not ready, move to backlog for future release

vaibhavhd
vaibhavhd previously approved these changes May 15, 2024
Copy link
Contributor

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test plan HLD too. You had mentioned that you are working on it, do you have it ready for review now?
I wouldn't say that the testplan is out of scope for this design. SONiC works on TDD. So I would like to have test plan ready before we start merging code changes.

Just making a note: this feature is backward compatible by design. But that also needs to be reflected by the design changes needed in critical containers - swss, bgp, syncd, teamd etc. If the change in these services is not backward compatible, then the NSF manager design is not backward compatible as a whole.

@akarshgupta25 akarshgupta25 changed the title Add Non-Stop Forwarding (NSF) Manager HLD for warm reboot improvement Warmboot Manager HLD for warm reboot improvement May 16, 2024
@qiluo-msft qiluo-msft dismissed their stale review May 17, 2024 00:44

Not blocking anymore

Copy link

@anuthalmrvl anuthalmrvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akarshgupta25 for multiple detailed review sessions and addressing the comments.

@bhagatgit
Copy link
Contributor

Received approvals from Microsoft, Dell and Marvell

@bhagatgit bhagatgit merged commit ca736b9 into sonic-net:master Jun 4, 2024
1 check passed
@liat-grozovik
Copy link
Collaborator

@bhagatgit is there a plan to have the code PRs in 202411?
All vendors need to be ready to test it before it is merged and ensure no regression with this approach.
I dont see any reference to the app extension. any reason why the new approach is not taking it at least in the design part?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: MovedToBacklog
Development

Successfully merging this pull request may close these issues.

8 participants