Update data/control plane split design #2729

sjberman · 2024-10-28T18:39:12Z

With the upcoming agent v3 release, the design for our control plane and data plane separation needs an update. Much of this is based off of discoveries from a recent PoC, as well as predicted design and implementation details that may be necessary. Added a new proposal that supersedes the old design.

Closes #2655

design/control-data-plane-separation/design.md

kate-osborn · 2024-10-29T20:55:39Z

design/control-data-plane-separation/design.md

  configuring nginx by writing files to a shared volume, it will send the configuration to the agent via gRPC.
 - _Control Plane Service_: Exposes the control plane via a Kubernetes Service of type `ClusterIP`. The data plane will
  use the DNS name of the Service to connect to the control plane.
- _Data Plane DaemonSet/Deployment_: A user can deploy the data plane as either a DaemonSet or Deployment. The data
+- _Data Plane DaemonSet/Deployment_: The data plane can be deployed as either a DaemonSet or Deployment. The data


Where would the user specify this? ParametersRef? Probably worth mentioning that here.

This is discussed in the bullet list below around how a user specifies deployment options. It's a bit fuzzy right now.

kate-osborn · 2024-10-29T20:58:39Z

design/control-data-plane-separation/design.md

+- Both deployments should have the minimal permissions required to perform their functions.
+- The nginx deployment should be configurable via the helm chart.
+  - Downside of this is that these options will apply to all nginx instances.
+  - We could introduce a CRD, but where would it attach? We already have NginxProxy which controls dynamic data plane configuration, and this may eventually attach to the Gateway instead of just the GatewayClass. Would a Deployment configuration fit in there, and would it be dynamic? That would require us to completely redeploy nginx if a user changes those settings.


What about the infrastructure API in the Gateway: https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.GatewayInfrastructure? It has a ParametersRef field, now. I remember that field being somewhat controversial, but I guess it was pushed through.

We could also look into other implementations that do provisioning and see how they are handling this.

I could foresee us potentially supporting NginxProxy at the Gateway level in the future. Since parametersRef only supports one object, we need to be thoughtful about putting something there.

kate-osborn · 2024-10-29T21:02:49Z

design/control-data-plane-separation/design.md

+  - This could also involve creating a ConfigMap that the control plane consumes on startup and contains all nginx Deployment/Daemonset configuration, including NGINX Plus usage configuration.
+- nginx Service should have configurable labels and annotations via the GatewayInfrastructure field in the Gateway resource.
+- Control plane creates the nginx deployment and service when a Gateway resource is created, in the same namespace as the Gateway resource. When the Gateway is deleted, the control plane deletes nginx deployment and service.
+- Control plane should label the nginx service and deployment with something related to the name of the Gateway so it can easily be linked.


Is the API opinionated about this?

I don't believe so.

kate-osborn · 2024-10-29T21:05:01Z

design/control-data-plane-separation/design.md

+  - We could introduce a CRD, but where would it attach? We already have NginxProxy which controls dynamic data plane configuration, and this may eventually attach to the Gateway instead of just the GatewayClass. Would a Deployment configuration fit in there, and would it be dynamic? That would require us to completely redeploy nginx if a user changes those settings.
+  - We could start with the helm chart option, and rely on user feedback to see if we need to get more granular.
+  - This could also involve creating a ConfigMap that the control plane consumes on startup and contains all nginx Deployment/Daemonset configuration, including NGINX Plus usage configuration.
+- nginx Service should have configurable labels and annotations via the GatewayInfrastructure field in the Gateway resource.


According to the spec, these labels and annotations should be applied to any resources created in response to the Gateway. Not just the service.

kate-osborn · 2024-10-29T21:14:24Z

design/control-data-plane-separation/design.md

+- Whenever the control plane sees an nginx instance become Ready, we send its config to it (it doesn't matter if this is a new pod or a restarted pod).
+- If no nginx instances exist, control plane should not send any configuration.
+- The control plane should check if a connection exists first before sending the config (all replicas, even non-leaders, should be able to send the config because the agent may connect to a non-leader).
+  - Different pods within an nginx deployment may connect to a different NGF control plane if the NGF control plane has also been scaled. So each NGF control plane may have a different list of connected agents that they would be sending to. However, each NGF pod should have the exact same internal graph of nginx configuration.


However, each NGF pod should have the exact same internal graph of nginx configuration

Are we guaranteeing this to be true at any instance in time? Or are we just guaranteeing it will be eventually consistent?

Eventually consistent in this case.

kate-osborn · 2024-10-29T21:19:15Z

design/control-data-plane-separation/design.md

+- Each graph that the control plane builds internally should be directly tied to an nginx deployment.
+- Whenever the control plane sees an nginx instance become Ready, we send its config to it (it doesn't matter if this is a new pod or a restarted pod).
+- If no nginx instances exist, control plane should not send any configuration.
+- The control plane should check if a connection exists first before sending the config (all replicas, even non-leaders, should be able to send the config because the agent may connect to a non-leader).


I'm assuming the leader will be in charge of writing status. Is it possible for that status to become out of sync with the nginx configuration for a particular deployment?

For example, let's say we have two data plane pods and two control plane pods:

dp1 connects to cp1

dp2 connects to cp2

cp2 is leader

Can cp1 fall behind due to network latencies/errors causing dp1 to be running version 1 of the config and cp2 to write the status corresponding to version 2 of the config?

Or, can cp2 fall behind retrying status and cp1 is now x number of versions ahead?

Yeah I'm not sure what the cleanest answer is here...

If the leader is the only one that can write status, then it seems like the replicas would need to share information with each other.

There's another option of only allowing the leader pod to accept traffic, so all nginx agents would connect only to the leader. I think this could be done by having the NGF readiness probe only return true if it's the leader. Non-leader pods would be marked as Unready, which would remove their Endpoint from the Service endpoint pool.

This would reduce complexity of sending config and writing status updates. Only the leader would be doing anything. Maybe this should be our approach instead, assuming no downsides.

kate-osborn · 2024-10-29T21:23:05Z

design/control-data-plane-separation/design.md

-    // DataplaneSoftwareDetails contains details for additional software running on the dataplane that pertains
-    // to NGINX Agent
-    DataplaneSoftwareDetails dataplane_software_details = 13 [(gogoproto.jsontag) = "dataplane_software_details"];
+In the current implementation using NGINX Plus, when only upstream servers change, NGF writes the upstream servers in the nginx config, but does not reload. It then calls the NGINX Plus API to update the servers. This process allows us to update the upstream servers using the API and not have to reload nginx, while still having the upstream servers exist in the nginx config for easy debugging and consistency. However, when using agent, any config write will result in a reload. To preserve the ability to update upstreams with the API without needing a reload, we'll have to utilize a `state` file instead of writing the servers directly in the nginx config. This way the list of servers is still available on the filesystem for debugging, but is written by the nginx when making the API call instead of by the control plane directly.


I thought there was a plan to expose N+ API calls through the agent API. Has that changed? How does the agent know that the file we are sending is a state file and it shouldn't trigger a reload?

We don't write that state file, nginx writes it when we call the API.

kate-osborn · 2024-10-29T21:23:55Z

design/control-data-plane-separation/design.md

+- JWT Secret for running NGINX Plus
+- Docker Registry Secret for pulling NGINX Plus
+- Client cert/key Secret for NIM connection
+- CA cert Secret for NIM connection



What about a Secret for NGF -> Agent connection?

Yeah, I'm wavering on this one as to whether a user should create it it before they create their Gateway, in that namespace, or if they create it once and we duplicate it. If it's managed by something like cert-manager, trying to think what the best option would be.

This would apply for all users, not just NGINX Plus.

kate-osborn · 2024-10-29T21:26:38Z

design/control-data-plane-separation/design.md

- NGF can scale to X number of data plane pods (we need to figure out what X is)
-
-[performance]: https://github.com/nginx/agent/blob/main/test/performance/user_workflow_test.go
+Our NFR tests will help ensure that performance of scaling and configuration have not degraded. We also may want to enhance these tests to include scaling nginx deployments.


I think we are missing some details around status and scaling. I'm not clear on how we are going to write meaningful status to resources when there are multiple agents and control plane pods.

With the upcoming agent v3 release, the design for our control plane and data plane separation needs an update. Much of this is based off of discoveries from a recent PoC, as well as predicted design and implementation details that may be necessary. Added a new proposal that supersedes the old design.

codecov · 2024-11-11T16:25:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.40%. Comparing base (82a8a23) to head (c32889e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2729   +/-   ##
=======================================
  Coverage   89.40%   89.40%           
=======================================
  Files         110      110           
  Lines       10913    10913           
  Branches       50       50           
=======================================
  Hits         9757     9757           
  Misses       1098     1098           
  Partials       58       58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions bot added the documentation Improvements or additions to documentation label Oct 28, 2024

kate-osborn requested changes Oct 29, 2024

View reviewed changes

sjberman force-pushed the design/cp-dp-split branch from f78ef82 to c32889e Compare November 11, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data/control plane split design #2729

Update data/control plane split design #2729

sjberman commented Oct 28, 2024 •

edited

Loading

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

sjberman Nov 5, 2024 •

edited

Loading

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

sjberman Nov 5, 2024

kate-osborn Oct 29, 2024

codecov bot commented Nov 11, 2024 •

edited

Loading

Update data/control plane split design #2729

Are you sure you want to change the base?

Update data/control plane split design #2729

Conversation

sjberman commented Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjberman Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

sjberman commented Oct 28, 2024 •

edited

Loading

sjberman Nov 5, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading