-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update data/control plane split design #2729
base: main
Are you sure you want to change the base?
Conversation
configuring nginx by writing files to a shared volume, it will send the configuration to the agent via gRPC. | ||
- _Control Plane Service_: Exposes the control plane via a Kubernetes Service of type `ClusterIP`. The data plane will | ||
use the DNS name of the Service to connect to the control plane. | ||
- _Data Plane DaemonSet/Deployment_: A user can deploy the data plane as either a DaemonSet or Deployment. The data | ||
- _Data Plane DaemonSet/Deployment_: The data plane can be deployed as either a DaemonSet or Deployment. The data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would the user specify this? ParametersRef? Probably worth mentioning that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is discussed in the bullet list below around how a user specifies deployment options. It's a bit fuzzy right now.
- Both deployments should have the minimal permissions required to perform their functions. | ||
- The nginx deployment should be configurable via the helm chart. | ||
- Downside of this is that these options will apply to all nginx instances. | ||
- We could introduce a CRD, but where would it attach? We already have NginxProxy which controls dynamic data plane configuration, and this may eventually attach to the Gateway instead of just the GatewayClass. Would a Deployment configuration fit in there, and would it be dynamic? That would require us to completely redeploy nginx if a user changes those settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the infrastructure
API in the Gateway: https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io/v1.GatewayInfrastructure? It has a ParametersRef field, now. I remember that field being somewhat controversial, but I guess it was pushed through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also look into other implementations that do provisioning and see how they are handling this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could foresee us potentially supporting NginxProxy at the Gateway level in the future. Since parametersRef only supports one object, we need to be thoughtful about putting something there.
- This could also involve creating a ConfigMap that the control plane consumes on startup and contains all nginx Deployment/Daemonset configuration, including NGINX Plus usage configuration. | ||
- nginx Service should have configurable labels and annotations via the GatewayInfrastructure field in the Gateway resource. | ||
- Control plane creates the nginx deployment and service when a Gateway resource is created, in the same namespace as the Gateway resource. When the Gateway is deleted, the control plane deletes nginx deployment and service. | ||
- Control plane should label the nginx service and deployment with something related to the name of the Gateway so it can easily be linked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the API opinionated about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so.
- We could introduce a CRD, but where would it attach? We already have NginxProxy which controls dynamic data plane configuration, and this may eventually attach to the Gateway instead of just the GatewayClass. Would a Deployment configuration fit in there, and would it be dynamic? That would require us to completely redeploy nginx if a user changes those settings. | ||
- We could start with the helm chart option, and rely on user feedback to see if we need to get more granular. | ||
- This could also involve creating a ConfigMap that the control plane consumes on startup and contains all nginx Deployment/Daemonset configuration, including NGINX Plus usage configuration. | ||
- nginx Service should have configurable labels and annotations via the GatewayInfrastructure field in the Gateway resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, these labels and annotations should be applied to any resources created in response to the Gateway. Not just the service.
- Whenever the control plane sees an nginx instance become Ready, we send its config to it (it doesn't matter if this is a new pod or a restarted pod). | ||
- If no nginx instances exist, control plane should not send any configuration. | ||
- The control plane should check if a connection exists first before sending the config (all replicas, even non-leaders, should be able to send the config because the agent may connect to a non-leader). | ||
- Different pods within an nginx deployment may connect to a different NGF control plane if the NGF control plane has also been scaled. So each NGF control plane may have a different list of connected agents that they would be sending to. However, each NGF pod should have the exact same internal graph of nginx configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, each NGF pod should have the exact same internal graph of nginx configuration
Are we guaranteeing this to be true at any instance in time? Or are we just guaranteeing it will be eventually consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually consistent in this case.
- Each graph that the control plane builds internally should be directly tied to an nginx deployment. | ||
- Whenever the control plane sees an nginx instance become Ready, we send its config to it (it doesn't matter if this is a new pod or a restarted pod). | ||
- If no nginx instances exist, control plane should not send any configuration. | ||
- The control plane should check if a connection exists first before sending the config (all replicas, even non-leaders, should be able to send the config because the agent may connect to a non-leader). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the leader will be in charge of writing status. Is it possible for that status to become out of sync with the nginx configuration for a particular deployment?
For example, let's say we have two data plane pods and two control plane pods:
- dp1 connects to cp1
- dp2 connects to cp2
- cp2 is leader
Can cp1 fall behind due to network latencies/errors causing dp1 to be running version 1 of the config and cp2 to write the status corresponding to version 2 of the config?
Or, can cp2 fall behind retrying status and cp1 is now x number of versions ahead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not sure what the cleanest answer is here...
If the leader is the only one that can write status, then it seems like the replicas would need to share information with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another option of only allowing the leader pod to accept traffic, so all nginx agents would connect only to the leader. I think this could be done by having the NGF readiness probe only return true if it's the leader. Non-leader pods would be marked as Unready, which would remove their Endpoint from the Service endpoint pool.
This would reduce complexity of sending config and writing status updates. Only the leader would be doing anything. Maybe this should be our approach instead, assuming no downsides.
// DataplaneSoftwareDetails contains details for additional software running on the dataplane that pertains | ||
// to NGINX Agent | ||
DataplaneSoftwareDetails dataplane_software_details = 13 [(gogoproto.jsontag) = "dataplane_software_details"]; | ||
In the current implementation using NGINX Plus, when only upstream servers change, NGF writes the upstream servers in the nginx config, but does not reload. It then calls the NGINX Plus API to update the servers. This process allows us to update the upstream servers using the API and not have to reload nginx, while still having the upstream servers exist in the nginx config for easy debugging and consistency. However, when using agent, any config write will result in a reload. To preserve the ability to update upstreams with the API without needing a reload, we'll have to utilize a `state` file instead of writing the servers directly in the nginx config. This way the list of servers is still available on the filesystem for debugging, but is written by the nginx when making the API call instead of by the control plane directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was a plan to expose N+ API calls through the agent API. Has that changed? How does the agent know that the file we are sending is a state
file and it shouldn't trigger a reload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't write that state file, nginx writes it when we call the API.
- JWT Secret for running NGINX Plus | ||
- Docker Registry Secret for pulling NGINX Plus | ||
- Client cert/key Secret for NIM connection | ||
- CA cert Secret for NIM connection | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a Secret for NGF -> Agent connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm wavering on this one as to whether a user should create it it before they create their Gateway, in that namespace, or if they create it once and we duplicate it. If it's managed by something like cert-manager, trying to think what the best option would be.
This would apply for all users, not just NGINX Plus.
- NGF can scale to X number of data plane pods (we need to figure out what X is) | ||
|
||
[performance]: https://github.com/nginx/agent/blob/main/test/performance/user_workflow_test.go | ||
Our NFR tests will help ensure that performance of scaling and configuration have not degraded. We also may want to enhance these tests to include scaling nginx deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing some details around status and scaling. I'm not clear on how we are going to write meaningful status to resources when there are multiple agents and control plane pods.
With the upcoming agent v3 release, the design for our control plane and data plane separation needs an update. Much of this is based off of discoveries from a recent PoC, as well as predicted design and implementation details that may be necessary. Added a new proposal that supersedes the old design.
f78ef82
to
c32889e
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2729 +/- ##
=======================================
Coverage 89.40% 89.40%
=======================================
Files 110 110
Lines 10913 10913
Branches 50 50
=======================================
Hits 9757 9757
Misses 1098 1098
Partials 58 58 ☔ View full report in Codecov by Sentry. |
With the upcoming agent v3 release, the design for our control plane and data plane separation needs an update. Much of this is based off of discoveries from a recent PoC, as well as predicted design and implementation details that may be necessary. Added a new proposal that supersedes the old design.
Closes #2655