Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gNOI Reboot HLD. #1489

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions doc/gnoi-reboot/gnoi_reboot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@

# **gNOI Reboot**


## Table of Content


<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) -->

- [Revision](#revision)
- [Scope](#scope)
- [Definitions/Abbreviations](#definitionsabbreviations)
- [Overview](#overview)
- [Requirements](#requirements)
- [Architecture Design](#architecture-design)
* [Framework Container](#framework-container)
- [High-Level Design](#high-level-design)
* [Telemetry &lt;-> Reboot BE API](#telemetry-lt-reboot-be-api)
* [Reboot Request Notification Schema](#reboot-request-notification-schema)
* [Reboot Response Notification Schema](#reboot-response-notification-schema)
* [CLI](#cli)
* [Sonic Host Services](#sonic-host-services)
* [SAI API {#sai-api}](#sai-api-sai-api)
* [Configuration and management {#configuration-and-management}](#configuration-and-management-configuration-and-management)
* [Warmboot and Fastboot Design Impact {#warmboot-and-fastboot-design-impact}](#warmboot-and-fastboot-design-impact-warmboot-and-fastboot-design-impact)
* [Restrictions/Limitations {#restrictions-limitations}](#restrictionslimitations-restrictions-limitations)
* [Testing Requirements/Design {#testing-requirements-design}](#testing-requirementsdesign-testing-requirements-design)

<!-- TOC end -->



<!-- TOC --><a name="revision"></a>



## Revision


Rev | RevDate | Author(s) | Change Description
---- | ---------- | ----------- | ------------------
v0.1 | 09/22/2023 | John Hamrick (Google) | Initial Version



## Scope

_This document describes the high level design for SONiC gNOI Reboot RPC support._


## Definitions/Abbreviations

API: Application Programming Interface

BE: Backend

FE: Frontend

gNOI: gRPC Network Operations Interface

RPC: remote procedure call

SAI: Switch Abstraction Interface

SONiC: Software for Open Networking in the Cloud

UMF: Unified Management Framework


## Overview

_Openconfig system.proto ([https://github.com/openconfig/gnoi/blob/main/system/system.proto](https://github.com/openconfig/gnoi/blob/main/system/system.proto)) defines reboot, cancel reboot and reboot status rpc’s within the system service. These rpc’s can be used to start, cancel and check status of reboot activities._


## Requirements

_NA_


## Architecture Design

The proposal is to create a Reboot Backend (BE) daemon in a new “Framework” container to handle a reboot request and forward it to the platform via DBUS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this backend daemon? Can we use GNOI server to invoke DBUS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1485 proposes daemonizing the warm-boot orchestration and thus will be the backend for warm-boot requests. Additionally, we want a common backend daemon for all types of reboots (cold, warm etc.). If we don't have this daemon then gNOI server would need to figure out the back-end based on the reboot request type. In my opinion, having a common backend provides a cleaner architecture, hence this backend daemon has been proposed.


This daemon provides a location to implement application level reboot functionality before forwarding a reboot request to the platform (in the spirit of “[Docker to Host Communications HLD](https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/Docker%20to%20Host%20communication.md)”.

The reboot request contains a delay parameter, method and subcomponent fields. For subcomponents we can have support for:
github76543 marked this conversation as resolved.
Show resolved Hide resolved



* system reboot (method = warm, cold, fast, etc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system reboot

Could you detailed the reboot design for all the methods? There is no straightforward mapping between these methods with gNOI enum RebootMethod. We prposed below mapping:

enum RebootMethod {
  UNKNOWN = 0;     // Invalid default method.
  COLD = 1;        // Shutdown and restart OS and all hardware.
  POWERDOWN = 2;   // Halt and power down, if possible.
  HALT = 3;        // Halt, if possible.
  WARM = 4;        // Reload configuration but not underlying hardware.
  NSF = 5;         // Non-stop-forwarding reboot, if possible.
  // RESET method is deprecated in favor of the gNOI FactoryReset.Start().
  reserved 6;
  POWERUP = 7;     // Apply power, no-op if power is already on.
}

For SONiC cold reboot, we can use COLD method.
For SONiC warm reboot, we can use WARM method.
For SONiC fast reboot, we can use NSF method.
For SONiC config reload, we can use reserved method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why WARM and NSF reboot methods are defined separately in the gNOI spec. In my opinion they both refer to the same operation. I would propose adding a new reboot method for fast reboot since it doesn't map to any of the current gNOI reboot methods.

* optic reboot
* etc

The immediate short term use case for the reboot backend daemon is to support system warm boot. During warmboot applications will be requested and confirmed to:



* freeze/quiesce
* verify state
* checkpoint (save) state before shutdown
* reconcile after boot
* verify state
* unfreeze





![drawing](./images/image3.png)

The controller can send a reboot request to the UMF gNOI Reboot Front End. This request will be forwarded to a Reboot Back End daemon executing in a container. The reboot back end daemon will then request host/platform level reboot actions via DBUS and a reboot sonic host service.

### Framework Container

There is no obvious container home for a reboot backend daemon: SWSS and PMON were considered. SWSS has the primary role of processing CONFIG_DB inputs (or kernel events) to APP_DB and ASIC_DB outputs. PMON, by name, exists to monitor the platform. Further, there are plans for a state verification daemon which also needs a home.

The Framework container is envisioned to provide a home for foundational services that interact with multiple containers/applications.

State verification daemon periodically (or by request) asks applications to verify their state. For example:



* SWSS Cfgmgr confirms CONFIG DB matches APPL DB.
github76543 marked this conversation as resolved.
Show resolved Hide resolved
* Syncd verifies ASIC DB matches SDK cache.

State verification daemon aggregates application results into a single state and writes overall state to the STATE DB.

It is proposed to create a new “Framework” container with reboot as its first daemon and state verification as its second.



![drawing](./images/image1.png)

One step further: the Reboot BE Daemon can have multiple state machines for different subcomponents. For example:



* System reboot
* Optic reboot
* Container reboot






![drawing](./images/image2.png)


## High-Level Design





![drawing](./images/image4.png)

The gNOI System service ([system.proto](https://github.com/openconfig/gnoi/blob/main/system/system.proto)) defines Reboot, RebootStatus and CancelReboot rpc’s. These rpc’s can be used to reboot a switch (or subcomponent) via gNOI.


### Telemetry &lt;-> Reboot BE API

There are three reboot rpc’s defined in [system.proto](https://github.com/openconfig/gnoi/blob/main/system/system.proto): Reboot, RebootStatus and CancelReboot.

JSON encoded RebootRequest, RebootStatusRequest and CancelRebootRequest will be send to the Reboot BE via a redis notification channel named: “`Reboot_Request_Channel".`

JSON encoded RebootResponse, RebootStatusResponse and CancelRebootResponse will be send to the gNOI Reboot FE via a redis notification channel named: “`Reboot_Response_Channel".`


### Reboot Request Notification Schema

Data is JSON encoded [RebootRequest](https://github.com/openconfig/gnoi/blob/98d6b81c6dfe3c5c400209f82a228cf3393ac923/system/system.proto#L103C13-L103C13), [RebootStatusRequest](https://github.com/openconfig/gnoi/blob/98d6b81c6dfe3c5c400209f82a228cf3393ac923/system/system.proto#L145) or [CancelRebootRequest](https://github.com/openconfig/gnoi/blob/98d6b81c6dfe3c5c400209f82a228cf3393ac923/system/system.proto#L137) (from [system.proto](http://google3/third_party/openconfig/gnoi/system/system.proto)) indicated by the Key value.


### Reboot Response Notification Schema

Where StatusCode is defined in [sonic-swss-common/common/status_code_util.h](https://github.com/sonic-net/sonic-swss-common/blob/master/common/status_code_util.h) and can be populated via swss::statusCodeToStr(StatusCode::SWSS_RC_SUCCESS).

Data is JSON encoded RebootRequestResponse, RebootStatusResponse or CancelRebootResponse (from [system.proto](https://github.com/openconfig/gnoi/blob/main/system/system.proto)) indicated by the Key value.


### CLI

A cli based tool/script can write/read to/from the “Reboot_Request_Channel” and “Reboot_Response_Channel” notification channels to provide cli based reboot support.


### Sonic Host Services

We propose adding a new sonic host service named gnoi_reboot. This host service will have 3 methods defined: reboot, reboot_status and cancel_reboot.

These three methods will take as input json formatted RebootRequest, RebootStatusRequest and CancelRebootRequest (defined in system.proto as input).

Similarly, these three methods will return json formatted RebootResponse, RebootStatusResponse and CancelRebootResponse.

Reboot function definition:

@host_service.method(host_service.bus_name(MOD_NAME), in_signature='as', out_signature='is')
def reboot(self, options):
request = json.loads(options[0])
reboot_response = self.do_reboot(request)
return 0, json.dumps(response)

Reboot status function definition:

@host_service.method(host_service.bus_name(MOD_NAME), in_signature='as', out_signature='is')
def reboot_status(self):
request = json.loads(options[0])
response = self.get_reboot_status(request)
return 0, json.dumps(response)

Cancel reboot function defintion:

@host_service.method(host_service.bus_name(MOD_NAME), in_signature='as', out_signature='is')
def cancel_reboot(self):
request = json.loads(options[0])
response = self.cancel_reboot(request)
return 0, json.dumps(response)


### SAI API {#sai-api}

_NA_


### Configuration and management {#configuration-and-management}

_NA_


### Warmboot and Fastboot Design Impact {#warmboot-and-fastboot-design-impact}

_No changes to existing functionality._

_This architecture provides a foundation for a gNOI based warm reboot with in switch software upgrade support. _


### Restrictions/Limitations {#restrictions-limitations}


### Testing Requirements/Design {#testing-requirements-design}

_Ondatra based gnoi reboot test will be designed/implemented._

_Ondatra is an open-source testing framework for network devices. It provides a way to write and run tests against network devices. Ondatra is written in Go and uses the gRPC protocol to communicate with the devices under test._

_More details about Ondatra can be found here [https://github.com/openconfig/ondatra/blob/main/README.md](https://github.com/openconfig/ondatra/blob/main/README.md)_


###
Open/Action items - if any {#open-action-items-if-any}

Binary file added doc/gnoi-reboot/images/image1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/gnoi-reboot/images/image2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/gnoi-reboot/images/image3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/gnoi-reboot/images/image4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.