-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dash-Sonic - Update for Scaling/Underlay Routing/ST/PL encoding #309
Changes from all commits
d48e670
60b0cad
53ef147
3214eae
863d7b4
859fdce
0acdcfa
204c918
e5a3272
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
# SONiC-DASH HLD | ||
## High Level Design Document | ||
### Rev 1.0 | ||
### Rev 1.1 | ||
|
||
# Table of Contents | ||
|
||
|
@@ -40,6 +40,7 @@ | |
| 0.6 | 04/20/2022 | Marian Pritsak | APP_DB to SAI mapping | | ||
| 0.8 | 09/30/2022 | Prabhat Aravind | Update APP_DB table names | | ||
| 1.0 | 10/10/2022 | Prince Sunny | ST and PL scenarios | | ||
| 1.1 | 01/09/2023 | Prince Sunny | Underlay Routing and ST/PL clarifications | | ||
|
||
# About this Manual | ||
This document provides more detailed design of DASH APIs, DASH orchestration agent, Config and APP DB Schemas and other SONiC buildimage changes required to bring up SONiC image on an appliance card. General DASH HLD can be found at [dash_hld](./dash-high-level-design.md). | ||
|
@@ -95,20 +96,25 @@ Warm-restart support is not considered in Phase 1. TBD | |
Following are the minimal scaling requirements | ||
| Item | Expected value | | ||
|--------------------------|-----------------------------| | ||
| VNETs | 1024 | | ||
| VNETs | 1024* | | ||
| ENI | 64 Per Card | | ||
| Routes per ENI | 100k | | ||
| Outbound Routes per ENI | 100k | | ||
| Inbound Routes per ENI | 10k** | | ||
| NSGs per ENI | 6 | | ||
| ACLs per ENI | 6x100K prefixes | | ||
| ACLs per ENI | 6x10K SRC/DST ports | | ||
| CA-PA Mappings | 10M | | ||
| CA-PA Mappings | 10M Per Card | | ||
| Active Connections/ENI | 1M (Bidirectional TCP or UDP) | | ||
| Metering Buckets per ENI | 4000 | | ||
|
||
\* Number of VNET is a software limit as VNET by itself does not take hardware resources. This shall be limited to number of VNI hardware can support | ||
|
||
\** Support 10K peering in-region/cross-region | ||
|
||
## 1.5 Metering requirements | ||
Metering is essential for billing the customers and below are the high-level requirements. Metering/Bucket in this context is related to byte counting for billing purposes and not related to traffic policer or shaping. | ||
- Billing shall be at per ENI level and shall be able to query metering packet bytes per ENI | ||
- All metering buckets must be UINT64 size and start from value 0 and shall be counting number of bytes. A bucket contains 2 counters; 1 inbound (Rx) and 1 outbound (Tx). | ||
- All metering buckets must be UINT64 size and start from value 0 and shall be counting number of bytes. A bucket contains 2 counters; 1 inbound (Rx) and 1 outbound (Tx) from an ENI perspective. | ||
- Implementation (a.k.a H/W pipeline implementation) must support metering at the following levels: | ||
- Policy based metering. - E.g. For specific destinations (prefix) that must be billed separately, say action_type 'direct' | ||
- Route table based metering - E.g. For Vnet peering cases. | ||
|
@@ -125,10 +131,11 @@ Metering is essential for billing the customers and below are the high-level req | |
- All outbound metered traffic from an ENI | ||
- All inbound metered traffic towards an ENI | ||
- Customer is billed based on number of bytes sent/received separately. A distinct counter must be supported for outbound vs inbound traffic of each category. | ||
- Outbound and Inbound bytes are from ENI perspective and not based on where the traffic is initiated. Any traffic from ENI to outbound is treated as TX bytes and towards ENI inbound is RX bytes. | ||
- For outbound flow and associated metering bucket, created as part of VM initiated traffic, the metering bucket shall account for outbound (Tx) bytes. Based on this outbound flow, pipeline shall also create a unified inbound flow. The same metering bucket shall account for the inbound (Rx) bytes for the return traffic to VM that matches this flow. | ||
- Application shall utilize the metering hardware resource in an optimized manner by allocating meter id and deallocating when not-in-use | ||
- Application shall bind all associated metering buckets to an ENI. During ENI deletion, all associated metering bucket binding should be auto-removed. | ||
- A route rule table can also have a metering bucket association for explicitly accounting the inbound traffic for an ENI. | ||
- Inbound metering: It is similar to outbound pipeline. A route rule table can have a metering bucket or a meter policy association for explicitly accounting the inbound traffic for an ENI. If inbound route rule points to a vnet, and mapping has a bucket id, it should be used for metering while creating the unified flow. | ||
|
||
_Open Items_ | ||
- Can we avoid explicit dependency between ENI's and mappings? | ||
|
@@ -183,14 +190,17 @@ It is worth noting that CA-PA mapping table shall be used for both encap and dec | |
|
||
## 2.3 Service Tunnel (ST) and Private Link (PL) packet processing pipelines | ||
|
||
ST/PL is employed for scenarios like multiple different customers want to access a common shared resource (e.g storage). This shall not fall into the regular Vnet packet path or Vnet peering path and hence a Private Endpoint is assigned for such accesses, as part of ENI routing or VNET's mapping tables. The lookup happens as described in the above sections, but actions are different. For ST/PL, actions include IPv4 to IPv6 transpositions and special routing/mapping lookups for encapsulation. Based on the outbound flow, inbound flows are created for return traffic. By having packet transpositions, Service Tunnel feature provides the capability of encoding “region id”, “vnet id”, “subnet id” etc via packet transformation. IPv6 transformation includes last 32 bits of the IPv6 packet as IPv4 address, while the remaining 96 bits of the IPv6 packet is used for encoding. Private Link feature is an extension to Service Tunnel feature and enables customers to access public facing shared services via their private IP addresses within their vnet. More details on traffic flow is captured in the example section. | ||
ST/PL is employed for scenarios like multiple different customers want to access a common shared resource (e.g storage). This shall not fall into the regular Vnet packet path or Vnet peering path and hence a Private Endpoint is assigned for such accesses, as part of ENI routing or VNET's mapping tables. The lookup happens as described in the above sections, but actions are different. For ST/PL, actions include IPv4 to IPv6 transpositions and special routing/mapping lookups for encapsulation. By having packet transpositions, Service Tunnel feature provides the capability of encoding “region id”, “vnet id”, “subnet id” etc via packet transformation. IPv6 transformation includes last 32 bits of the IPv6 packet as IPv4 address, while the remaining 96 bits of the IPv6 packet is used for encoding. Private Link feature is an extension to Service Tunnel feature and enables customers to access public facing shared services via their private IP addresses within their vnet. More details on traffic flow is captured in the example section. | ||
**ST/PL Inbound flow**: Using the outbound unified flow, the reverse transposition (inbound unified flow) is created. If no inbound flow is created, the packet shall be dropped if it does not match any existing inbound routing rule. There is no inbound policy based lookup expected for ST/PL scenarios. When FastPath kicks in, the respective outbound and inbound unified flows shall be modified accordingly. | ||
|
||
# 3 Modules Design | ||
|
||
The following are the schema changes. The NorthBound APIs shall be defined as sonic-yang in compliance to [yang-guideline](https://github.com/Azure/SONiC/blob/master/doc/mgmt/SONiC_YANG_Model_Guidelines.md) | ||
The following are the schema changes. The NorthBound APIs shall be defined as sonic-yang in compliance to [yang-guideline](https://github.com/Azure/SONiC/blob/master/doc/mgmt/SONiC_YANG_Model_Guidelines.md). | ||
|
||
For DASH objects, the proposal is to use the existing APP_DB instance and objects are prefixed with "DASH". DASH APP_DB objects are preserved only during warmboots and isolated from regular configurations that are persistent in the appliance across reboots. All the DASH objects are programmed by SDN and hence treated differently from the existing Sonic L2/L3 'switch' DB objects. Status of the configured objects shall be reflected in the corresponding STATE_DB entries. | ||
|
||
Reference Yang model for DASH Vnet is [here](https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-dash.yang). | ||
|
||
## 3.1 Config DB | ||
|
||
### 3.1.1 DEVICE Metadata Table | ||
|
@@ -260,8 +270,9 @@ qos = Associated Qos profile | |
underlay_ip = PA address for Inbound encapsulation to VM | ||
admin_state = Enabled after all configurations are applied. | ||
vnet = Vnet that ENI belongs to | ||
pl_sip_encoding = Private Link encoding for IPv6 SIP transpositions; Format "field:<bit_offset>:<size_in_bits>:<value in hex>:field:<bit_offset>:<size_in_bits>:<value in hex>" | ||
pl_underlay_sip = Underlay SIP to be used for all private link transformation for this ENI. | ||
pl_sip_encoding = Private Link encoding for IPv6 SIP transpositions; Format "0xfield_value/0xfull_mask". field_value must be used as a replacement to the | ||
first len(full_mask) bits of pl_sip. Last 32 bits are reserved for the IPv4 CA. Logic: ((pl_sip & !full_mask) | field_value). | ||
pl_underlay_sip = Underlay SIP (ST GW VIP) to be used for all private link transformation for this ENI | ||
``` | ||
### 3.2.4 ACL | ||
|
||
|
@@ -366,12 +377,12 @@ DASH_ROUTE_TABLE:{{eni}}:{{prefix}} | |
key = DASH_ROUTE_TABLE:eni:prefix ; ENI route table with CA prefix for packet Outbound | ||
; field = value | ||
action_type = routing_type ; reference to routing type | ||
vnet = vnet name ; destination vnet name if routing_type is {vnet, vnet_direct} | ||
vnet = vnet name ; destination vnet name if routing_type is {vnet, vnet_direct}, a vnet other than eni's vnet means vnet peering | ||
appliance = appliance id ; appliance id if routing_type is {appliance} | ||
overlay_ip = ip_address ; overly_ip to lookup if routing_type is {vnet_direct}, use dst ip from packet if not specified | ||
overlay_sip = ip_address ; overlay ipv6 src ip if routing_type is {servicetunnel}, transform last 32 bits from packet (src ip) | ||
overlay_dip = ip_address ; overlay ipv6 dst ip if routing_type is {servicetunnel}, transform last 32 bits from packet (dst ip) | ||
underlay_sip = ip_address ; underlay ipv4 src ip if routing_type is {servicetunnel}; this is the ST VIP | ||
underlay_sip = ip_address ; underlay ipv4 src ip if routing_type is {servicetunnel}; this is the ST GW VIP (for ST traffic) or custom VIP | ||
underlay_dip = ip_address ; underlay ipv4 dst ip to override if routing_type is {servicetunnel}, use dst ip from packet if not specified | ||
metering_bucket = bucket_id ; metering and counter | ||
``` | ||
|
@@ -386,6 +397,7 @@ DASH_ROUTE_RULE_TABLE:{{eni}}:{{vni}}:{{prefix}} | |
"vnet":{{vnet_name}} (OPTIONAL) | ||
"pa_validation": {{bool}} (OPTIONAL) | ||
"metering_bucket": {{bucket_id}} (OPTIONAL) | ||
"region": {{region_id}} (OPTIONAL) | ||
``` | ||
|
||
``` | ||
|
@@ -397,6 +409,7 @@ protocol = INT32 value ; protocol value of incomin | |
vnet = vnet name ; mapped VNET for the key vni/pa | ||
pa_validation = true/false ; perform PA validation in the mapping table belonging to vnet_name. Default is set to true | ||
metering_bucket = bucket_id ; metering and counter | ||
region = region_id ; optional region_id which the vni/prefix belongs to as a string for any vendor optimizations | ||
``` | ||
|
||
### 3.2.8 VNET MAPPING TABLE | ||
|
@@ -408,6 +421,7 @@ DASH_VNET_MAPPING_TABLE:{{vnet}}:{{ip_address}} | |
"mac_address":{{mac_address}} (OPTIONAL) | ||
"metering_bucket": {{bucket_id}} (OPTIONAL) | ||
"use_dst_vni": {{bool}} (OPTIONAL) | ||
"use_pl_sip_eni": {{bool}} (OPTIONAL) | ||
"overlay_sip":{{ip_address}} (OPTIONAL) | ||
"overlay_dip":{{ip_address}} (OPTIONAL) | ||
``` | ||
|
@@ -574,6 +588,9 @@ SONiC for DASH shall have a lite swss initialization without the heavy-lift of e | |
| Nexthop | SAI_NEXT_HOP_ATTR_IP | | ||
| | SAI_NEXT_HOP_ATTR_ROUTER_INTERFACE_ID | | ||
| | SAI_NEXT_HOP_ATTR_TYPE | | ||
| Nexthop Group | SAI_NEXT_HOP_GROUP_TYPE_ECMP | | ||
| | SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID | | ||
| | SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID | | ||
| Packet | SAI_PACKET_ACTION_FORWARD | | ||
| | SAI_PACKET_ACTION_TRAP | | ||
| | SAI_PACKET_ACTION_DROP | | ||
|
@@ -628,7 +645,10 @@ SONiC for DASH shall have a lite swss initialization without the heavy-lift of e | |
| | SAI_SWITCH_ATTR_VXLAN_DEFAULT_ROUTER_MAC | | ||
|
||
### 3.3.5 Underlay Routing | ||
DASH Appliance shall establish BGP session with the connected ToR and advertise the prefixes (VIP PA). In turn, the ToR shall advertise default route to appliance. With two ToRs connected, the appliance shall have route with gateway towards both ToRs and does ECMP routing. Orchagent install the route and resolves the neighbor (GW) mac and programs the underlay route/nexthop and neighbor. In the absence of a default-route, appliance shall send the packet back on the same port towards the receiving ToR and can derive the underlay dst mac from the src mac of the received packet or from the neighbor entry (IP/MAC) associated with the port. | ||
DASH Appliance shall establish BGP session with the connected Peer and advertise the prefixes (VIP PA). In turn, the Peer (e.g, Network device or SmartSwitches) shall advertise default route to appliance. With two Peers connected, the appliance shall have route with gateway towards both Peers and does ECMP routing. Orchagent install the route and resolves the neighbor (GW) mac and programs the underlay route/nexthop and neighbor. | ||
Underlay attributes on a DASH appliance shall be programmed similar to Sonic switch. RIF entries shall be created first using SAI_ROUTER_INTERFACE APIs with IP2ME routes installed using SAI_ROUTE_ENTRY APIs. Based on neighbor learned from peer(e.g, Network device or SmartSwitches), neighbor and next-hop entries shall be programmed using SAI_NEIGHBOR_ENTRY and SAI_NEXT_HOP APIs. Finally underlay routes learned via BGP shall be programmed with regular or ECMP next-hops via SAI underlay APIs as mentioned above. | ||
|
||
Note that *only* default route is expected from the peer BGP and appliance is _not_ expected to allocate an LPM resource for underlay. Implementation can choose whether to forward the packet on the same port it is received or do forwarding based on route and next-hop entry. Same is applicable for ECMP where the implementation can perform 5-tuple hashing or forward the "return" traffic on the same port it has received the original packet. | ||
|
||
### 3.3.6 Memory footprints | ||
|
||
|
@@ -878,7 +898,7 @@ For the example configuration above, the following is a brief explanation of loo | |
c. First Action for "servicetunnel" is 4to6 transposition | ||
d. Packet gets transformed as: Overlay SIP fd00:108:0:d204:0:200::a01:101, Overlay DIP 2603:10e1:100:2::3201:201 | ||
e. Second Action is Static NVGRE encap. | ||
f. Since underlay dip is not specified in the LPM table, It shall use Dst IP from packet, i.e 50.1.2.1 and underlay Src IP as 40.1.2.1 | ||
f. Since underlay dip is not specified in the LPM table, It shall use Dst IP (overlay) from packet, i.e 50.1.2.1 and underlay Src IP as 40.1.2.1 | ||
|
||
2. Packet destined to 60.1.2.1 from 10.1.1.1: | ||
a. LPM lookup hits for entry 60.1.2.1/32 | ||
|
@@ -921,7 +941,7 @@ For the example configuration above, the following is a brief explanation of loo | |
"underlay_ip": "25.1.1.1", | ||
"admin_state": "enabled", | ||
"vnet": "Vnet1", | ||
"pl_sip_encoding": "field:11:1:0x1:field:48:48:0x0a0b0d0a0b", | ||
"pl_sip_encoding": "0x0020000000000a0b0c0d0a0b/0x002000000000ffffffffffff", | ||
"pl_underlay_sip": "55.1.2.3" | ||
}, | ||
"OP": "SET" | ||
|
@@ -944,10 +964,9 @@ For the example configuration above, the following is a brief explanation of loo | |
"OP": "SET" | ||
}, | ||
{ | ||
"DASH_ROUTE_TABLE:F4939FEFC47E:10.2.0.0/16": { | ||
"action_type":"vnet_direct", | ||
"vnet":"Vnet1", | ||
"overlay_ip":"10.2.0.6" | ||
"DASH_ROUTE_TABLE:F4939FEFC47E:10.2.0.6/32": { | ||
"action_type":"vnet", | ||
"vnet":"Vnet1" | ||
}, | ||
"OP": "SET" | ||
}, | ||
|
@@ -972,18 +991,21 @@ For the example configuration above, the following is a brief explanation of loo | |
c. Next lookup is in the mapping table and mapping table action here is "privatelink" | ||
d. First Action for "privatelink" is 4to6 transposition | ||
e. Packet gets transformed as: | ||
For Overlay SIP, using ENI's "pl_sip_encoding": "field:11:1:0x1:field:48:48:0x0a0b0c0d0a0b" -> Overlay SIP fd30:108:0:0a0b:0c0d0:0a0b:a01:101; | ||
For Overlay SIP, using ENI's "pl_sip_encoding": "0x0020000000000a0b0c0d0a0b/0x002000000000ffffffffffff" -> Overlay SIP fd30:108:0:0a0b:0c0d:0a0b:a01:101 using the following logic: | ||
1. fv = (fd40:108:0:d204:0:200::0 & !0x002000000000ffffffffffff) (first 96 bits based on provided mask length) | ||
2. result = fv | 0x0020000000000a0b0c0d0a0b (first 96 bits based on the provided mask length) | ||
3. result = result | ca (last 32 bits if its set to 0 in mapping, implicit conversion) | ||
Overlay DIP 2603:10e1:100:2::3401:203 (No transformation, provided as part of mapping) | ||
f. Second Action is Static NVGRE encap with GRE key '100'. | ||
g. Underlay DIP shall be 50.1.2.3 (from mapping), Underlay SIP shall be 55.1.2.3 (from ENI) | ||
|
||
2. Packet destined to 10.2.0.8 from 10.1.1.2: | ||
a. LPM lookup hits for entry 10.2.0.0/16 | ||
b. The action in this case is "vnet_direct" with mapping lookup key as 10.2.0.6 | ||
2. Packet destined to 10.2.0.6 from 10.1.1.2: | ||
a. LPM lookup hits for entry 10.2.0.6/32 | ||
b. The action in this case is "vnet" | ||
c. Next lookup is in the mapping table and mapping table action here is "privatelink" | ||
d. First Action for "privatelink" is 4to6 transposition | ||
e. Packet gets transformed as: | ||
For Overlay SIP, using ENI's "pl_sip_encoding": "field:11:1:0x1:field:48:48:0x0a0b0c0d0a0b" -> Overlay SIP fd30:108:0:0a0b:0c0d0:0a0b:a01:102; | ||
For Overlay SIP, using ENI's "pl_sip_encoding": "0x0020000000000a0b0c0d0a0b/0x002000000000ffffffffffff" -> Overlay SIP fd30:108:0:0a0b:0c0d:0a0b:a01:102; | ||
Overlay DIP 2603:10e1:100:2::3402:206 (No transformation, provided as part of mapping) | ||
f. Second Action is Static NVGRE encap with GRE key '100'. | ||
g. Underlay DIP shall be 50.2.2.6 (from mapping), Underlay SIP shall be 55.1.2.3 (from ENI) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of this explains processing in the outbound direction. As discussed in the community meeting, please add the inbound processing details for both the ST and PL. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it is updated in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @prsunny Sorry that this inbound processing explanation doesn't answer my question. I was looking more from the point of view how you describe the VNET-to-VNET. Specifically, in order to determine the packet-direction, we look at the "VNI" and then to the inner MAC to find the ENI to which the packet belongs to. For ST/PL, in your example, if we are using NVGRE then the "key" is perhaps used to determine whether the packet is from Host or Network? Correct? If that is the case, the VNET definition should be modified to introduce the concept of NVGRE "key". Currently it only talks about VNI. Should we make those clarifications? Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mhanif , it is already captured in section 2.1 "The pipeline shall parse the VNI, and for VM traffic, the VNI shall be a special reserved VNI. Everything else shall be treated as as network traffic(RX)." |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, since the mapping is vnet_direct, all packets destinations in 10.2.0.0/16 subnet will use the same mapping entry - 10.2.0.6, with the same DIPi overwrite. Is this a valid case for PL ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @prsunny - this came up (also) in bmv2 meeting today w/ @vijasrin . Is this a valid case for Private Link too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed. its not a valid case for PL.