This document proposes a design for separating the control and data planes.
Issue #292: #292
Note: I use data plane and agent interchangeably in this document.
NGF composes its control and data plane containers into a single Kubernetes Pod. The control plane uses OS signals and a shared file system to configure and reload nginx. This architecture is problematic because the same RBAC policies govern the control and data planes and share CVE potential. A compromised control plane may impact the customer’s traffic. The Kubernetes API server may be affected if the data plane is compromised. In addition to security concerns, this architecture does not allow the control plane and data plane to scale independently.
- Data plane and control plane containers run in separate Pods
- The communication channel between the control and data planes can be encrypted
- Data planes can register with control plane
- Data plane can scale independently of the control plane
- RBAC policy for data plane follows the principle of least privilege. The data plane should not have access to the Kubernetes API server.
- RBAC policy for control plane follows the principle of least privilege.
- Control plane can scale
- Support for multiple control planes per GatewayClass or Namespace.
Since the choice of a data plane agent will inform the rest of the design, I will start by proposing the agent.
I propose using the nginx agent as our data plane agent.
The following list outlines all of NGF's requirements for an agent and whether the nginx agent meets them:
- It is open source.
- It supports both OSS and Plus versions of nginx.
- It can be deployed in Kubernetes.
- It supports readiness and liveness probes.
- It supports logging to stderr/stdout.
- It handles SIGTERM gracefully.
- Its container base image is scratch or something minimal.
- It supports a read-only root file system.
- It can run as a non-root user.
- It can configure an nginx instance running in the same Pod.
- It is loosely coupled to the nginx version. The Nginx version can be updated independently of the agent.
- A user can update it independently of the control plane.
- It can accept configuration from a control plane running in a different Pod.
- It can report the status of configuration attempts to the control plane.
- It should not crash because of bad config.
- It can authenticate with the control plane.
- It supports TLS between the control plane and the agent.
- It registers itself with the control plane.
- Many agents can register to a single control plane.
- It can scale separately from the control plane.
- A user can configure it independently of other agents.
- It does not need access to the Kubernetes apiserver.
- It has a REST endpoint for metrics in Prometheus format.
- It should be lightweight.
- It can handle frequent configuration changes (~ 1 change/second).
- It can handle large configurations.
- It allows the control plane to access the N+ API securely.
The nginx agent is missing a few requirements we will need to add for our use case.
Features needed (in priority order, more or less):
- Add support for TLS CA certificate rotation for the agent <-> control plane gRPC channel
- Deterministically confirm that a nginx reload succeeds (e.g. check that new worker processes are running)
- Add an option to configure the server's token via a file
- Add an option to refresh server token from a file
- Add readiness and liveness endpoints
- Produce a container image as a release artifact
- This image should be non-root
- This image should be as minimal as possible
- Allow the control plane to access the N+ API to configure upstreams and the key-value store.
- Add support for metrics enrichment. Metrics can be enriched with Kubernetes meta-information such as namespace, pod name, etc.
Agent features/plugins that we'd like to disable:
- Metrics service client
- Data plane status updates
- Config upload feature
- This is the feature that uploads the config to the control plane
- The nginx-counting feature
- The activity-events feature
Using the nginx agent has the following benefits:
- Saves us development time. We can drop the agent in and start using it immediately (with some caveats).
- Requires limited changes to the control plane.
- It runs
nginx -t
before reloading nginx. Which will improve our error reporting. - No need to reinvent the wheel and contribute to the plethora of nginx agents.
- It supports the core features that we need.
- It is open source, but has a dedicated team from NGINX maintaining it.
- It supports additional features that we may need in the future.
- It has an established community of users (NMS) that are putting it through its paces.
- It is built to be modular and configurable. We should be able to disable most or all of the features we don’t need in the future.
Using the nginx agent has the following drawbacks:
- It is not custom-built for our use case. It contains more features that we need.
- It was not built to run in Kubernetes.
- Violates some of the best practices for running in Kubernetes. For example, it runs two processes in a single container.
- Metrics do not include Kubernetes meta-information.
- It is a dependency that we do not control.
- It does not support dynamic configuration of upstreams and the key-value store.
- We may need to replace it in the future if we run into performance issues or encounter a blocker.
We could write our own agent. This would give us the most control over the design and implementation. There are few different approaches we could take to design an agent.
Approach 1: Use the nginx agent as a library and build our own agent on top of it
Rather than start from scratch, we could use nginx agent packages to build a new agent. This would allow us to reuse the agent code and build a more lightweight agent that meets only our needs.
Benefits:
- We can reuse some nginx agent code.
- We can pick and choose what we want to include in our agent.
- We can make the agent more Kubernetes-friendly.
Drawbacks:
- I'm not sure how feasible this is. The nginx agent is not designed to be used as a library. Some features may be hard to tease apart.
- We would have to maintain the agent ourselves.
- We wouldn't have as much say in the roadmap of the agent.
Approach 2: Develop an API for nginx
We could develop an API for nginx. This would allow us to configure nginx on a more granular level than the nginx agent. Rather than sending the entire nginx configuration on every update, we could update only what has changed, such as endpoints, servers, or locations. We could use the Nginx Controller API, the N+ API, and the xDS API as inspiration for the design.
Benefits:
- We can configure nginx on a more granular level.
- We have control over the design and implementation.
- We can add dynamic configuration wherever possible
Drawbacks:
- Time. It would take a lot of time to develop, implement, and maintain this agent.
- We would add yet another nginx agent to the ecosystem.
Approach 3: Use the xDS API and build an agent that translates xDS to nginx config
The xDS API was developed for Envoy, however there is a CNCF xDS API Working Group that is working to create a universal data plane API based on the xDS APIs.
Benefits:
- We can participate xDS Working Group and help drive the evolution of the APIs
- We don't have to design our own API
- We could focus on the data plane and bring in an xDS control plane
- An agent that translates xDS to nginx config could be useful for other projects
Drawbacks:
- Steep learning curve. We would need to learn xDS APIs
- Translating xDS to nginx config is not a trivial task. Some features will not map easily to nginx config.
- We would either need to completely re-write our control plane or adopt an xDS control plane
We could modify the nginx agent to meet our needs. This would require us to maintain a fork of the nginx agent. I see this option as a last resort that we can decide to adopt if we hit a roadblock with the nginx agent.
- Control Plane Deployment: The control plane is a Kubernetes Deployment with one container running the NGF controller. Initially, the control plane will be limited to a single Pod. Once we add leader election, the control plane will be able to scale. The control plane will perform the same functions as it does today, but instead of configuring nginx by writing files to a shared volume, it will send the configuration to the agent via gRPC.
- Control Plane Service: Exposes the control plane via a Kubernetes Service of type
ClusterIP
. The data plane will use the DNS name of the Service to connect to the control plane. - Data Plane DaemonSet/Deployment: A user can deploy the data plane as either a DaemonSet or Deployment. The data plane contains a single container running both the agent and nginx processes. The agent will download the configuration from the control plane over a streaming RPC.
- NGINX Service: Exposes nginx via a Kubernetes Service of type
LoadBalancer .
This is the entry point for the customer’s traffic. Initially, this Service will only expose ports 80 and 443. In the future, if we add support for additional listener ports, this Service will expose all the listener ports. Note that this Service should not expose any of the agent’s ports.
The user will deploy the control plane, and data plane via Kubernetes manifests. The control plane supports a
single GatewayClass
and Gateway
resource. Following this model, it makes sense for the control plane to support a
single data plane Deployment/DaemonSet. Each agent+nginx Pod in the data plane will receive the same configuration from
the control plane. In the future, we may reconsider this model if there’s a use case for the control plane to support
multiple Gateway
resources. For example, in that model, it might make sense for the control plane to be responsible
for provisioning a separate data plane for each Gateway.
Users can run multiple NGF controllers in the same cluster by leveraging the GatewayClass
resource. Multiple NGF
controllers will be able to coexist in the same cluster as long as they each have a unique corresponding GatewayClass
resource. In this case, each installation of NGF will contain a unique GatewayClass
resource, a control plane
Deployment, and a data plane Deployment/DaemonSet.
The control plane and agent will communicate over gRPC. The agent will establish a gRPC connection to the control plane on start-up. The agent will gracefully retry to connect to the control plane, so the start order of the containers is not an issue. The gRPC runtime will handle the connection establishment and management. If an error occurs or the stream or connection is dropped, the connection must be reestablished.
The control plane will have to implement the Commander
gRPC service defined below.
// Interface exported by the server.
service Commander {
// A Bidirectional streaming RPC established by the data plane agent and is kept open
rpc CommandChannel(stream Command) returns (stream Command) {}
// A streaming RPC established by the data plane agent and is used to download resources associated with commands
// The download stream will be kept open for the duration of the data transfer and will be closed when its done/
// The transfer is a stream of chunks as follows:
// - header
// - data chunk count to follow
// - resource identifier/metadata
// - data 1
// ...
// - data
//
// each data chunk is of a size smaller than the maximum gRPC payload
rpc Download(DownloadRequest) returns (stream DataChunk) {}
// A streaming RPC established by the data plane agent and is used to upload resources associated with commands
rpc Upload(stream DataChunk) returns (UploadStatus) {}
}
The CommandChannel
RPC is a bidirectional stream RPC that is established by the agent and remains an open HTTP/2
stream over the course of the agent’s lifetime. The purpose of this channel is for the control plane and the agent to
send command requests and responses to one another. For more information on the command message, see
the Command Message section.
Commands can contain payloads, but they are mostly small (a few kb at most) and should not transport large amounts of
data. The message size is significant with gRPC because gRPC has a message size limit. For larger payloads, like the
nginx configuration, the Download
and Upload
RPCs will be used in conjunction with commands.
The control plane will use the Download
channel to push the configuration to the agent. See
the Configuration Download section for details.
The agent uses the Upload
command to upload large payloads, such as the nginx configuration, to the control plane. For
our use case, we only need the ability to send the nginx configuration from the control plane to the agent. We do not
need the agent to upload its configuration to the control plane. Ideally, we would like to disable this functionality in
the agent.
For the full commander service definition, see this file.
Command Messages have the following structure:
- Metadata - contains details about the sender and the message.
- Type - contains information about the type of data the message carries.
- Data - the message payload.
Command messages act as envelopes, but they make use of both the type field and the oneof
feature of gRPC to embed
different types of payloads.
// Command is the envelope sent between the management plane and the data plane, requesting some action or reporting a response
message Command {
Metadata meta = 1 [(gogoproto.jsontag) = "meta"]; // add metadata later with fields like timestamp etc
enum CommandType {
NORMAL = 0;
DOWNLOAD = 1;
UPLOAD = 2;
}
// used as a dispatch flag to quickly send the command to the correct base processor that will then further sort
// based on the actual data type
CommandType type = 2 [(gogoproto.jsontag) = "type"];
oneof data {
// common command status response - used by most command responses
CommandStatusResponse cmd_status = 3 [(gogoproto.jsontag) = "cmd_status"];
// request action on nginx config when sent C → A - all action values (see NgxConfig)
// notify config details when sent A → C - only RETURN action
NginxConfig nginx_config = 4 [(gogoproto.jsontag) = "nginx_config"];
// request action on nginx config when sent C → A - all action values (see NgxConfig)
// notify config details when sent A → C - only RETURN action
NginxConfigResponse nginx_config_response = 5 [(gogoproto.jsontag) = "nginx_config_response"];
// request connection to a management plane, A → C
AgentConnectRequest agent_connect_request = 6 [(gogoproto.jsontag) = "agent_connect_request"];
// connection response to the data plane, C → A
AgentConnectResponse agent_connect_response = 7 [(gogoproto.jsontag) = "agent_connect_response"];
// request Configuration parameters for agent, A → C
AgentConfigRequest agent_config_request = 8 [(gogoproto.jsontag) = "agent_config_request"];
// configuration parameters for Agent C → A. This message can be sent asynchronously as well
AgentConfig agent_config = 9 [(gogoproto.jsontag) = "agent_config"];
// DataplaneStatus reports Dataplane metrics the Agent is aware of
DataplaneStatus dataplane_status = 11 [(gogoproto.jsontag) = "dataplane_status"];
// EventReport reports events the Agent is aware of, e.g. Start/Stop of Agent, Config Apply NGINX
f5.nginx.agent.sdk.events.EventReport event_report = 12 [(gogoproto.jsontag) = "event_report"];
// DataplaneSoftwareDetails contains details for additional software running on the dataplane that pertains
// to NGINX Agent
DataplaneSoftwareDetails dataplane_software_details = 13 [(gogoproto.jsontag) = "dataplane_software_details"];
// DataplaneUpdate contains details for dataplane resources that have changed
DataplaneUpdate dataplane_update = 14 [(gogoproto.jsontag) = "dataplane_update"];
}
}
There are many command types defined above, and the control plane does not care about all of them. However, any commands that we do not care about can be easily ignored by the control plane by inspecting the message type.
For the full command message definition, see this file.
The control plane sends the NginxConfig
message in chunks over the Download
channel to the agent in order to
configure nginx.
message NginxConfig {
NginxConfigAction action = 1 [(gogoproto.jsontag) = "action"];
ConfigDescriptor config_data = 2 [(gogoproto.jsontag) = "config_data"];
ZippedFile zconfig = 3 [(gogoproto.jsontag) = "zconfig"];
ZippedFile zaux = 4 [(gogoproto.jsontag) = "zaux"];
AccessLogs access_logs = 5 [(gogoproto.jsontag) = "access_logs"];
ErrorLogs error_logs = 6 [(gogoproto.jsontag) = "error_logs"];
SslCertificates ssl = 7 [(gogoproto.jsontag) = "ssl"];
DirectoryMap directory_map = 8 [(gogoproto.jsontag) = "directory_map"];
}
action
: what action the agent should take with the nginx config: apply, test, rollback, return, or force.config_data
: contains metadata on the agent and nginx instance we are configuring.zconfig
: a zipped file with all nginx.conf
files.zaux
: a zipped file with all nginx auxiliary files, such as njs modules, static html files, etc.access_logs
: meta-information about the access logs.error_logs
: meta-information about the error logs.ssl
: meta-information about the SSL certificates stored on the data plane.directory_map
:meta-information about the nginx configuration files. The agent uses this for synchronization (i.e., comparing configuration against previous deployments) and to interrogate the file system before applying the configuration.
I found that for NGF’s use case, the access_logs,
error_logs,
and ssl
fields are unnecessary. For more details on
how the control plane pushes nginx config to the agent, see the Configuration Download
section.
For the full NginxConfig
message definition, see
this file.
The agent and control plane communication channel will be encrypted. We will store the server certificate, key pair, and
CA certificate in Kubernetes Secrets. The user will install the Secrets in the nginx-gateway
namespace under the
following names:
nginx-gateway-cert
: This Secret will contain the TLS certificate and private key that the control plane will use to serve gRPC traffic.nginx-agent-cert
: This Secret will contain the CA bundle that validates the control plane’s certificate.
The Secrets will be mounted to the control plane and agent containers, respectively. If desired, we can make the Secret names and mount path configurable via flags. For production, we will direct the user to provide their own certificates. For development and testing purposes, we will provide a self-signed default certificate. In order to be secure by default, NGF should generate the default certificates and keypair during installation using a Kubernetes Job.
Kubernetes automatically updates mounted Secrets when the content changes, but the control plane
and agent must make sure they are using the latest certificates. We can achieve this by providing a callback in
the tls.Config
for the gRPC server and client. We will need to add his functionality to the agent as
well.
The agent will use a Kubernetes ServiceAccount token to authenticate with the control plane. The control plane will authenticate the token by sending a request to the Kubernetes TokenReview API.
On start-up the agent will create a gRPC CommanderClient
and connect to the control plane
CommanderServer
using the server address, server token, and TLS options specified in the agent’s
configuration file (see Agent Configuration). This connection is secured by TLS; see the
Encryption section for more information. The control plane will validate the token with
Kubernetes by sending a TokenReview API request. If the token is valid, the bidirectional streaming CommandChannel
between the agent and the control plane is established and left open for the lifetime of the agent.
Upon successful connection, the agent will register by sending an AgentConnectionRequest
command sent from the agent
to the control plane over the CommandChannel.
This message is used by the control plane to associate the agent with
internal resources correctly. See the Registration section for more information.
Long-lived tokens are JWT tokens for a ServiceAccount that are valid for the lifetime of the ServiceAccount. They are stored in Secrets and can be mounted to a Pod as a file or an environment variable. We can use the TokenReview API to verify the token. While long-lived tokens can still be created and used in Kubernetes, bound tokens are now the default and preferred option.
Bound ServiceAccount tokens are OpenID Connect (OIDC) identity tokens that are obtained directly from the TokenRequest API and are mounted into Pods using a projected volume. Bound tokens are more secure than long-lived tokens because they are time-bound, audience-bound, and object-bound.
- Time-bound: Bound tokens expire after a configurable amount of time. The default is 1 hour. The kubelet will periodically refresh the token before it expires.
- Audience-bound: Bound tokens are only valid for a specific audience. The audience is a string that identifies the intended recipient of the token.
- Object-bound: Bound tokens are bound to the Pod.
The TokenReview API only considers a bound token to be valid if the token is not expired, the audience of the token matches the audience specified in the TokenReview API request, and the Pod that the token is bound to is still present and running.
Bound tokens expire, and are written to the filesystem by the kubelet. While bound tokens are more secure than long-lived tokens, the agent needs to be modified to use them. The agent would need to be able to reload the tokens from the filesystem periodically. That would require the following changes in the agent code:
- Add a new configuration option to specify the path to the token file. Currently, the agent supports reading the token from an environment variable or the configuration file, not from a file.
- Modify the gRPC client to fetch the token from a file before connecting to the control plane. Currently, the token is loaded on start-up and never refreshed. If the agent reconnects to the control plane, it will use the same token provided on start-up.
I propose that we initially use a long-lived token for the agent. This will allow us to get the agent up and running quickly. We can then add support for bound tokens in a future release.
To create the long-lived token, we will provide the following manifest:
apiVersion: v1
kind: Secret
metadata:
name: nginx-agent-token-secret
annotations:
kubernetes.io/service-account.name: nginx-agent
type: kubernetes.io/service-account-token
And expose the token as an environment variable in the agent container:
env:
- name: NMS_SERVER_TOKEN
valueFrom:
secretKeyRef:
name: nginx-agent-token-secret
key: token
The agent will load the token from the $NSM_SERVER_TOKEN
environment variable and add it to the Authorization
header of the gRPC request when connecting to the control plane.
For a good comparison of long-lived and bound tokens, see this blog post.
The agent will send an AgentConnectRequest
command to the control plane over theCommandChannel
to
register with the control plane. This command includes the agent’s system ID and nginx ID(s) which the control plane
needs in order to push configuration. During registration, the control plane will store this information and associate
it with the connection.
In response, the control plane will send an AgentConnectResponse
over the CommandChannel .
The AgentConnectResponse
contains the connect request's status and the agent’s runtime configuration. This includes
the enabled features, extensions, tags, log configuration, and alias for the agent.
Currently, NGF configures nginx by translating the Gateway API resources into an internal representation of the nginx config, executing a template with this data to generate the nginx configuration as bytes, and then writing the bytes to the filesystem.
The agent expects the configuration files to be compressed and packaged into a ZippedFile
message.
The ZippedFile
message contains the compressed configuration files' contents, the files' root directory, and the
contents' checksum.
To produce the ZippedFile
message, the control plane will import the zip.Writer
from the agent’s SDK.
The control plane will still execute the template to generate the nginx configuration as bytes, but instead of writing
the files to the filesystem, the control plane will add the files to the zip.Writer
.
In addition to the ZippedFile
containing the nginx configuration files, we will also need to build the
DirectoryMap
message. The DirectoryMap
message contains a mapping of the directory structure
holding the nginx configuration files. The agent uses this map to interrogate the file system before applying the config
and to determine which files to delete. Each file we add to the zip.Writer
we will also add to the DirectoryMap.
The final pieces of the NginxConfig
message that we will need to provide are the Action
and
ConfigDescriptor
messages. We will set the action to APPLY
and fill out the ConfigDescriptor
according to the agent information received during registration.
Note that we must send the entire nginx configuration to the agent on each configuration update. We can leverage the checksum if we want to reduce the number of configuration updates sent to the agent. By storing the last checksum in the control plane, we can avoid sending the configuration to the agent if it hasn’t changed.
Static configuration files, such as njs modules, do not need to be sent to the agent on every configuration update.
Instead, we will mount these files to the agent’s container using a ConfigMap .
This will also require us to specify
the path to the module in the agent’s configuration file and the DirectoryMap
of the NginxConfig
message to prevent
the agent from removing them from the filesystem.
The TLS certificates and keys specified in the GatewayTLSConfig
field of Listeners
are references
to Kubernetes Secrets. In the future, we will support other forms of authentication data, such as JWT tokens and
user/pass files. Users will also store this secret data in Kubernetes Secrets.
Currently, NGF watches all Secrets, stores them in memory, and then selectively writes them to the filesystem. We only
write Secrets that are referenced by Listeners
to the filesystem. As part of the separation work, we need to figure
out a way for the data plane to have access to these TLS certificates and keys.
The challenge is that we do not want the data plane not to have access to the Kubernetes apiserver. This is one of the driving reasons behind separating the control plane from the data plane. So, we need a way for the agent to retrieve the TLS certificates and keys without having to make a request to the Kubernetes apiserver.
Option 1: Send the secret data as part of the NginxConfig
message
The NginxConfig
message has two ZippedFile
fields, one for all nginx conf files and one for all the auxiliary files.
Auxiliary files are all files that are referenced by the nginx configuration but are not .conf
files. For this option,
the control plane will continue to watch Secrets and store them in memory. When building the NginxConfig
message, we will add the content from all referenced Secrets to the auxiliary ZippedFile
message as individual secret
data files. The DirectoryMap
will also be updated to include the paths to these files. The agent will extract the
secret data files from the auxiliary ZippedFile
and write them to the filesystem.
This option, or a variation, is used in several other Gateway/Ingress implementations. For example, the Contour Gateway
sends TLS certificates and keys over gRPC to its Envoy data plane. This Envoy API is called
the Secret Discovery Service. The ingress-nginx
project is also working on separating their control and data
planes, and in their draft implementation, the secret data is stored in the JSON config payload that is sent to the data
plane over gRPC. Additionally, APISix sends secret data over HTTP to their data plane.
Benefits:
- The agent does not need access to the Kubernetes apiserver.
- The agent does not need to be modified to support this.
Drawbacks:
- The control plane sends secret data over gRPC on every configuration update.
Option 2: Run a sidecar container in the agent’s Pod that watches the Secrets and writes them to a shared volume
For this option, we will write a Kubernetes controller that watches Secrets and writes them to a shared volume. This
controller will run as a sidecar in the agent’s Pod. We only want to write Secrets that are referenced by the nginx
configuration, so we need to figure out a way to tell the sidecar which Secrets to write. There’s also an ordering issue
to consider. The secret data must be written to the shared volume before the configuration that references the secret
data is applied, otherwise, the configuration application will fail. One way to do this is to write a plugin for the
agent that ranges over the SslCertificates
field in the NginxConfig
message and requests those Secrets over localhost from the sidecar. In this case, the control plane is responsible for
writing the SslCertificates
field in the NginxConfig
message.
Benefits:
- Secret data is not sent over gRPC on every configuration update.
- The agent container does not need access to the Kubernetes apiserver.
Drawbacks:
- The agent needs to be modified to support this.
- We need to write and maintain an additional controller.
- We need to produce an additional container image.
- The Pod will still need to have access to the Kubernetes apiserver. We may be able to limit access to the sidecar container by disabling the automatic mounting of the ServiceAccount token and manually mounting the token to the sidecar container only. However, the agent does need a ServiceAccount token to connect to the control plane.
Option 3: Run a DaemonSet that watches the Secrets and sends them to the agent over a Unix socket
This option is similar to option 2, but instead of running the controller as a sidecar container, we will run it as a DaemonSet. The controller will watch Secrets and send them to the agent over an API served over a Unix domain socket. The agent will be responsible for writing the secret data to the filesystem.
This approach will require hostPath
volumes on the agent and the Secret controller Pods, so they can
access the Unix domain socket on the host. It will also require a new plugin for the agent that can fetch the secret
data over the API and write it to the filesystem.
Benefits:
- Secret data is not sent over gRPC on every configuration update.
- The agent container does not need access to the Kubernetes apiserver.
Drawbacks:
- The agent needs to be modified to support this.
- We need to write and maintain an additional controller.
- We need to produce an additional container image.
- We need to use
hostPath
volumes which pose security risks and some Kubernetes distributions do not allow them.
My preference is option 1, as it is the simplest and requires the least changes.
The control plane will send a Download
command over the CommandChannel
to the agent. Then, the control plane will
chunk out the NginxConfig
message and send it over the DownloadChannel .
The agent will receive the Download
command and then download the config chunks over the DownloadChannel .
It will then assemble the chunks into a full
configuration and attempt to apply it. It will send the status of the configuration application over
the CommandChannel
in a NginxConfigResponse
message.
We can configure the agent through a YAML file, command-line flags, or environment variables on start-up. The agent interprets configuration in the following priorities (from highest to lowest):
- Command-line flags
- Environment variables
- Config files
We can either configure the agent with command-line flags or mount the config file as a volume to the agent’s container. Some configuration options we care about are:
server-host
: the DNS name of the control plane Kubernetes Serviceserver-grpcport
: the gRPC port that the control plane is listening ontls-ca
: the path to the CA certificate filetls-cert
: the path to the agent’s certificate filetls-key
: the path to the agent’s private key filetls-enable
: enable TLS for secure communication with the control planefeatures
: list of enabled featuresconfig-dirs
:the paths that you want to grant agent read/write access to
For a list of configuration options, see NGINX Agent CLI Flags & Usage.
The control plane can send runtime configuration using the AgentConfig
messages. I’m not sure if we
have a use case for runtime configuration at the moment.
The following edge cases should be considered and tested during implementation:
- The data plane fails to establish a connection with the control plane.
- Existing connections between data plane and control plane are terminated during a download event.
In these cases, we expect the agent to be resilient. It should not crash or produce invalid config, and it should retry when possible.
Since the data plane is deployed in its own Pod, a user can horizontally scale the data plane independently of the control plane. When a new agent Pod spins up, the agent will register itself with the control plane. The control plane will have to keep track of all the agents.
We will need to figure out how many agent instances can be supported by NGF’s control plane. As an example data point, NGINX Controller can monitor and manage up to 100 N+ instances or up to 30 N+ instances with App Protect enabled. Since our use case for the agent is much simpler than NGINX Controller, we may be able to support more instances of the agent.
The agent exposes a REST /metrics
endpoint that returns metrics in Prometheus format. The agent also sends metrics to
the control plane. One feature we will need to add to the agent is the ability to enrich the metrics with additional
information. For example, upstream metrics need to be tied to the Kubernetes Pod name and namespace for them to be
useful.
The agent team has a suite of performance tests here that they run. If we need benchmarks on larger configurations, we can work with them to add bigger config files to the test suite.
Config Push Benchmarks:
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkNginxConfig/../testdata/configs/bigger/1k.conf-2 1000000000 0.005241 ns/op 0 B/op 0 allocs/op
BenchmarkNginxConfig/../testdata/configs/bigger/2k.conf-2 1000000000 0.01087 ns/op 0 B/op 0 allocs/op
BenchmarkNginxConfig/../testdata/configs/bigger/3k.conf-2 1000000000 0.01492 ns/op 0 B/op 0 allocs/op
BenchmarkNginxConfig/../testdata/configs/bigger/10k.conf-2 1000000000 0.05086 ns/op 0 B/op 0 allocs/op
BenchmarkGetConfigFiles/GetConfigFiles-2 1000000000 0.0004904 ns/op 0 B/op 0 allocs/op
BenchmarkGetConfigFiles/GetConfigFiles#01-2 1000000000 0.0007442 ns/op 0 B/op 0 allocs/op
BenchmarkGetConfigFiles/GetConfigFiles#02-2 1000000000 0.0009297 ns/op 0 B/op 0 allocs/op
BenchmarkGetConfigFiles/GetConfigFiles#03-2 1000000000 0.002607 ns/op 0 B/op 0 allocs/op
Config Zip and Unzip Benchmarks:
BenchmarkZipConfig/Zip_config../testdata/configs/bigger/1k.conf-2 1000000000 0.0001852 ns/op 0 B/op 0 allocs/op
BenchmarkZipConfig/Zip_config../testdata/configs/bigger/2k.conf-2 1000000000 0.0000663 ns/op 0 B/op 0 allocs/op
BenchmarkZipConfig/Zip_config../testdata/configs/bigger/3k.conf-2 1000000000 0.0000674 ns/op 0 B/op 0 allocs/op
BenchmarkZipConfig/Zip_config../testdata/configs/bigger/10k.conf-2 1000000000 0.0000672 ns/op 0 B/op 0 allocs/op
BenchmarkUnZipConfig/#00-2 1000000000 0.0000193 ns/op 0 B/op 0 allocs/op
BenchmarkUnZipConfig/#01-2 1000000000 0.0000132 ns/op 0 B/op 0 allocs/op
BenchmarkUnZipConfig/#02-2 1000000000 0.0000118 ns/op 0 B/op 0 allocs/op
BenchmarkUnZipConfig/#03-2 1000000000 0.0000140 ns/op 0 B/op 0 allocs/op
PASS
ok command-line-arguments 17.727s
- NGF can handle frequent configuration changes (1 change per second)
- NGF can handle large configurations:
- 5000 server blocks
- 64 TLS certs/keys
- 50 JWT keys
- 50 TLS cert/keys for egress
- 50 CA certs
- 50 basic auth files
- 50 OIDC secrets
- NGF can scale to X number of data plane pods (we need to figure out what X is)