Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block a Pod's IP packets until its NetworkPolicies are realized #5698

Closed
wants to merge 1 commit into from

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Nov 13, 2023

In the previous implementation, traffic from/to a Pod may bypass NetworkPolicies applied to the Pod in a short time window when the agent restarts because realizing NetworkPolicies and enabling forwarding are asynchronous.

However, we can't wait for all NetworkPolicies to be realized before enabling forwarding of OVS because there are some cases the former depends on the latter, for example, when proxyAll is enabled, or when it's a Windows Node, in which cases control-plane communication relies on the forwarding of OVS.

This patch takes a more fine-grained approach: block a Pod's IP packets in NetworkPolicy's entry tables until its NetworkPolicies are realized.This granularity leaves the Node and the hostNetwork Pods' traffic untouched and makes the realization issue of a Pod's NetworkPolicies affect the Pod's IP packets only.

The following changes are made to implement the approach:

  1. EgressSecurityClassifierTable is now always required. (Previously it's only required for ExternalNode, not K8sNode).
  2. One flow with low priority dropping traffic from local Pods is installed in EgressSecurityClassifierTable, and one flow with low priority dropping traffic to local Pods is installed in IngressSecurityClassifierTable.
  3. When a Pod's NetworkPolicies are fully realized the first time, one flow with normal priority allowing traffic from this Pod is installed in EgressSecurityClassifierTable to override the above drop action, one flow in IngressSecurityClassifierTable did the same for traffic to this Pod.

Depends on #5697

@tnqn tnqn changed the title Block Pod traffic until its NetworkPolicies are realized Block a Pod's IP packets until its NetworkPolicies are realized Nov 14, 2023
@tnqn tnqn force-pushed the ensure-flow-synced branch 3 times, most recently from 67b4cb8 to 200b422 Compare November 15, 2023 11:19
In the previous implementation, traffic from/to a Pod may bypass
NetworkPolicies applied to the Pod in a short time window when the agent
restarts because realizing NetworkPolicies and enabling forwarding are
asynchronous.

However, we can't wait for all NetworkPolicies to be realized before
enabling forwarding of OVS because there are some cases the former
depends on the latter, for example, when proxyAll is enabled, or when
it's a Windows Node, in which cases control-plane communication relies
on the forwarding of OVS.

This patch takes a more fine-grained approach: block a Pod's IP packets
in NetworkPolicy's entry tables until its NetworkPolicies are realized.
This granularity leaves the Node and the hostNetwork Pods' traffic
untouched and makes the realization issue of a Pod's NetworkPolicies
affect the Pod's IP packets only.

The following changes are made to implement the approach:
1. EgressSecurityClassifierTable is now always required. (Previously
   it's only required for ExternalNode, not K8sNode).
2. One flow with low priority dropping traffic from local Pods is
   installed in EgressSecurityClassifierTable, and one flow with low
   priority dropping traffic to local Pods is installed in
   IngressSecurityClassifierTable.
3. When a Pod's NetworkPolicies are fully realized the first time, one
   flow with normal priority allowing traffic from this Pod is installed
   in EgressSecurityClassifierTable to override the above drop action,
   one flow in IngressSecurityClassifierTable did the same for traffic
   to this Pod.

Signed-off-by: Quan Tian <[email protected]>
@tnqn tnqn marked this pull request as ready for review November 15, 2023 14:05
@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Nov 15, 2023
@antoninbas
Copy link
Contributor

I haven't reviewed the code in details yet, but I had one question.

However, we can't wait for all NetworkPolicies to be realized before enabling forwarding of OVS because there are some cases the former depends on the latter, for example, when proxyAll is enabled, or when it's a Windows Node, in which cases control-plane communication relies on the forwarding of OVS.

Historically, we have claimed that Pod network setup has no dependency on the Antrea Controller. In other words, if the Antrea Controller is down, the Antrea Agent can still connect the Pod to the Pod network, and the Pod will have IP connectivity. Our architecture doc even has the following statement:

At the moment, Antrea Controller mainly exists for NetworkPolicy implementation. If you only care about connectivity between Pods but not NetworkPolicy support, you may choose not to deploy Antrea Controller at all.

We still today present this as an advantage of the Antrea architecture, when we talk about the project.

So I wanted to double check that this proposed change doesn't invalidate that statement. If the core idea still holds (Antrea Controller isn't required to provide connectivity, and NP realization is more of a local concept here), then I have no objection. If this PR changes that, maybe we should have a discussion about whether this should be configuration-based.

@tnqn
Copy link
Member Author

tnqn commented Nov 16, 2023

I haven't reviewed the code in details yet, but I had one question.

However, we can't wait for all NetworkPolicies to be realized before enabling forwarding of OVS because there are some cases the former depends on the latter, for example, when proxyAll is enabled, or when it's a Windows Node, in which cases control-plane communication relies on the forwarding of OVS.

Historically, we have claimed that Pod network setup has no dependency on the Antrea Controller. In other words, if the Antrea Controller is down, the Antrea Agent can still connect the Pod to the Pod network, and the Pod will have IP connectivity. Our architecture doc even has the following statement:

At the moment, Antrea Controller mainly exists for NetworkPolicy implementation. If you only care about connectivity between Pods but not NetworkPolicy support, you may choose not to deploy Antrea Controller at all.

We still today present this as an advantage of the Antrea architecture, when we talk about the project.

So I wanted to double check that this proposed change doesn't invalidate that statement. If the core idea still holds (Antrea Controller isn't required to provide connectivity, and NP realization is more of a local concept here), then I have no objection. If this PR changes that, maybe we should have a discussion about whether this should be configuration-based.

The situation is the statement is already stale, the proposed change invalidates it further. But we can make some changes to achieve the statement again.

Yes, from connectivity's perspective, antrea-controller is not a must-have (except AntreaIPAM mode) technically. However, when NetworkPolicy is in use, I think it's wrong to treat it as best-effort and declare the behavior that an already realized NetworkPolicy may be bypassed when the agent restarts is expected. Transient packet loss is a nature of networking and something applications can tolerate, but not degraded security. If this principle is agreed, then we have to make realization of NetworkPolicy a precondition of connectivity logically.

Even though the statement is there, I feel there is no real use case that doesn't deploy antrea-controller, because the connection with antrea-controller has become a check of antrea-agent's readiness since v1.0.0 (by #1946). The agents wouldn't appear as ready if there is no antrea-controller.

It's still possible to achieve what the statement describes, but first we need to make NetworkPolicy optional explicitly. User can choose to disable the whole NetworkPolicy function and not to deploy antrea-controller. Then antrea-agent doesn't initilize NetworkPolicyController and doesn't make it a check of readiness, and the flows to block IP packets in NetworkPolicy tables wouldn't be there.

@antoninbas
Copy link
Contributor

Transient packet loss is a nature of networking and something applications can tolerate, but not degraded security. If this principle is agreed, then we have to make realization of NetworkPolicy a precondition of connectivity logically.

Yes, I do think that overall this is an improvement to the implementation. Can you remind me how we define "NetworkPolicies are realized"? Which set of NetworkPolicies are we considering in this case? Is there any specific guarantee or it just means that the Pod has been processed "once" by the Antrea Controller?

Even though the statement is there, I feel there is no real use case that doesn't deploy antrea-controller

I tend to agree. IMO, the design was really more about having the CNI implementation (implementation of CNI ADD / DEL) independent of a centralized control-plane, not about the ability to deploy Antrea without the controller. We still have this today (CNI ADD is handled locally, and doesn't require communication with the controller), unlike ovn-kubernetes for example. Even then, it's unclear if there is any tangible benefit, besides a simpler architecture.
With this change, we do add tighter coupling since connectivity won't be enabled if there is a transient issue with the connection to the controller.

It's still possible to achieve what the statement describes, but first we need to make NetworkPolicy optional explicitly.

I don't know if it's worth it. We haven't had any user request for this, and as you said the coupling exists since Antrea v1.0. We should update the design doc though (and any other possible reference to this in the rest of the documentation).
We could also provide an option to disable the new behavior (installation of flows to block IP forwarding), to preserve the current behavior. But I don't think there would be any scenario in which it would make sense for a user to disable this behavior upfront. If there is a transient issue with the controller, it should resolve fast. And if it doesn't resolve fast, there is an underlying issue that should be addressed. Also, if the controller is temporarily down and policy changes cannot be handled, this proposed change would not impact existing workloads AFAIK.

It's still a bit of a compromise between security / correctness, and robustness. This change could theoretically increase the blast radius of a bug in the controller. If a single-tenant K8s user doesn't have any NetworkPolicies, this change would still affect them in a way. A bug in the controller could mean the inability to schedule workloads.

@tnqn
Copy link
Member Author

tnqn commented Nov 16, 2023

Transient packet loss is a nature of networking and something applications can tolerate, but not degraded security. If this principle is agreed, then we have to make realization of NetworkPolicy a precondition of connectivity logically.

Yes, I do think that overall this is an improvement to the implementation. Can you remind me how we define "NetworkPolicies are realized"? Which set of NetworkPolicies are we considering in this case? Is there any specific guarantee or it just means that the Pod has been processed "once" by the Antrea Controller?

In this PR, a Pod's NetworkPolicies are realized when all rules applied to have been reconciled successfully.

func (c *StatusController) GetPodRealization(pod v1beta2.PodReference) bool {
rules := c.ruleCache.getAppliedNetworkPolicyRules(pod.Namespace, pod.Name)
for key := range rules {
if _, exists, _ := c.realizedRules.GetByKey(key); !exists {
klog.V(2).InfoS("Pod's NetworkPolicy was not realized", "Pod", klog.KRef(pod.Namespace, pod.Name), "rule", key)
return false
}
}
return true
}

We could also provide an option to disable the new behavior (installation of flows to block IP forwarding), to preserve the current behavior. But I don't think there would be any scenario in which it would make sense for a user to disable this behavior upfront. If there is a transient issue with the controller, it should resolve fast. And if it doesn't resolve fast, there is an underlying issue that should be addressed. Also, if the controller is temporarily down and policy changes cannot be handled, this proposed change would not impact existing workloads AFAIK.

It's still a bit of a compromise between security / correctness, and robustness. This change could theoretically increase the blast radius of a bug in the controller. If a single-tenant K8s user doesn't have any NetworkPolicies, this change would still affect them in a way. A bug in the controller could mean the inability to schedule workloads.

I feel an option to disable NetworkPolicy entirely makes more sense than an option to preserve the "best-effort" NetworkPolicy behavior as users will need to configure something anyway, and the reason why they should do it is, it's unnecessary to add a failure point when they don't really use any function provided by it.

I should mention the implementation only blocks a Pod's IP packets when the agent never knows what NetworkPolicies have applied to the Pod (this is before the agent has fully synced with the controller) or when known NetworkPolicies haven't been realized succesfully (which hardly happen today). It doesn't require the desired NetworkPolicy antrea-agent has retrieved to be strictly up-to-date (and it's perhaps impossible). The logic is that we could realize NetworkPolicies based on a snapshot of a time point but can't based on an empty data. It essentially prevents a regression of security for Pods when the agent upgrades/restarts.
In practice, the Pod traffic won't be blocked as long as the agent has ever connected to the controller once (retrieved a snapshot of desired NetworkPolicy), even the controller crashed after that for some reason. The worse case only happens when antrea-agent restarts and antrea-controller is not and can't get running.

@tnqn
Copy link
Member Author

tnqn commented Nov 16, 2023

I'm thinking, we could also have a local copy of the desired NetworkPolicies in host filesystem, which will be used as the source of NetworkPolicy before connecting to the controller successfully. It may resolve the problem and perhaps we don't need to block IP packets in this case, then it wouldn't increase the blast radius of a bug in the controller.

@rajnkamr
Copy link
Contributor

In the previous implementation, traffic from/to a Pod may bypass NetworkPolicies applied to the Pod in a short time window when the agent restarts because realizing NetworkPolicies and enabling forwarding are asynchronous.

However, we can't wait for all NetworkPolicies to be realized before enabling forwarding of OVS because there are some cases the former depends on the latter, for example, when proxyAll is enabled, or when it's a Windows Node, in which cases control-plane communication relies on the forwarding of OVS.

This patch takes a more fine-grained approach: block a Pod's IP packets in NetworkPolicy's entry tables until its NetworkPolicies are realized.This granularity leaves the Node and the hostNetwork Pods' traffic untouched and makes the realization issue of a Pod's NetworkPolicies affect the Pod's IP packets only.

The following changes are made to implement the approach:

  1. EgressSecurityClassifierTable is now always required. (Previously it's only required for ExternalNode, not K8sNode).
  2. One flow with low priority dropping traffic from local Pods is installed in EgressSecurityClassifierTable, and one flow with low priority dropping traffic to local Pods is installed in IngressSecurityClassifierTable.
  3. When a Pod's NetworkPolicies are fully realized the first time, one flow with normal priority allowing traffic from this Pod is installed in EgressSecurityClassifierTable to override the above drop action, one flow in IngressSecurityClassifierTable did the same for traffic to this Pod.

Depends on #5697

IMO, the design depend on selection of security policy strategy for pods, it seems this new strategy applies Default-Deny for pods wrt Network Policies realization and NetworkPolicy feature, however it has its own limitations like in case antrea agent fail to restart during first time wrt realization of policies, what will be the failover strategy here for pods on different nodes ! pod traffic will remain blocked until agent recovers & realise np for first time/ at least once . It might be that on one of node, pods remain in default deny state due to step 2(with agent start failure) and one of node it will remain in default allow state(with step 2 & 3 success),
So it might be difficult to identify the default security behaviour, if it is default-deny/default-allow security strategy !

@tnqn
Copy link
Member Author

tnqn commented Nov 16, 2023

however it has its own limitations like in case antrea agent fail to restart during first time wrt realization of policies, what will be the failover strategy here for pods on different nodes ! pod traffic will remain blocked until agent recovers & realise np for first time/ at least once . It might be that on one of node, pods remain in default deny state due to step 2(with agent start failure) and one of node it will remain in default allow state(with step 2 & 3 success),
So it might be difficult to identify the default security behaviour, if it is default-deny/default-allow security strategy !

I'm not sure if I get the point. The PR's point isn't about default-deny/default-allow strategy but to prevent security breach which happens in the described situation (when an agent doesn't know the policies yet). On a Node it has gotten policies, it enforces the requested policies; on a Node it hasn't gotten policies, it doesn't enabling forwarding for safe, otherwise it could be a security breach.

@tnqn
Copy link
Member Author

tnqn commented Nov 16, 2023

I'm thinking, we could also have a local copy of the desired NetworkPolicies in host filesystem, which will be used as the source of NetworkPolicy before connecting to the controller successfully. It may resolve the problem and perhaps we don't need to block IP packets in this case, then it wouldn't increase the blast radius of a bug in the controller.

@antoninbas I had a PoC of storing policies to local files: tnqn@05b7d69 (it's a draft please ignore the code redundancy). I think it's simpler than the current PR and is much less invasive to datapath. As seen from the patch, it mainly changes cache.go: store data to local files and use local files to restore data upon start. If we reach a consensus on this direction, I plan to do two more enhancements:

  1. Ensure fresh data got from antrea-controller (if they are available) takes priority over the local data.
  2. Installing Pod flows (via podConfigurator.reconcile) after NetworkPolicies have been realized regardless of the source of the NetworkPolicies.

This could guarantee that

  1. Even antrea-controller keeps crashing or is never deployed, Pod traffic wouldn't be blocked when there is never a policy asking for it.
  2. Policies could be realized with a delay if there is something wrong with antrea-controller, which is also the current situation, but security regression would never happen: what's been realized will continue to take effect dispite the restart of antrea-agent.

@Dyanngg
Copy link
Contributor

Dyanngg commented Nov 16, 2023

it mainly changes cache.go: store data to local files and use local files to restore data upon start.

I think the solution makes a lot of sense to me except for two concerns:

  1. If I'm understanding this correctly, it used to be when agent restarts, the ruleCache is populated from scratch based on the addressGroup/appliedToGroup/internalNP events received from the controller, so we always get the desired state. If on the other hand the state is first restored from local filesystem, will stale resources be handled correctly? i.e. any groups for example deleted during agent downtime triggers a DELETE event once agents reconnects. Or are we simply replacing the ruleCache once it reconnects? I guess that fall into the Ensure fresh data got from antrea-controller (if they are available) takes priority over the local data enhancement you were talking about.

  2. An I/O for each and every ag/atg/internalNP update seems pretty expense to me for this specific purpose, considering by default there's only 4 workers for syncing ruleCache updates, which is now blocking. I'm a bit worried that reconciling policies of 1k+ scale will show degrade in performance.

@antoninbas
Copy link
Contributor

@tnqn I think I understand the Agent restart case pretty well know, but I may still be confused about the Pod creation case.
Initially, I thought that your patch would block traffic to / from the Pod until we have a chance to get updated NP objects from the controller (e.g., updated / new AppliedToGroups). However, based on your recent comments (and based on me finally looking at the code...), it doesn't seem to be the case? The following situation is totally possible, with either the original approach or the new approach (file-based caching):

  1. Pod network is configured, forwarding is blocked
  2. NP controller checks which if any rule applies to the Pod, none found
  3. forwarding is enabled by installing new higher priority flows
  4. NP controller receives new / updated ATGs from the antrea-controller, for which the new Pod is a member
  5. Rules are realized in the datapath

So to clarify: this PR is about preventing regressions in NP enforcement (if a NP was ever enforced and the user hasn't updated it, it will always be enforced, or forwarding will not be permitted), not about providing guarantees for newly created Pods?

@tnqn
Copy link
Member Author

tnqn commented Nov 17, 2023

1. If I'm understanding this correctly, it used to be when agent restarts, the ruleCache is populated from scratch based on the addressGroup/appliedToGroup/internalNP events received from the controller, so we always get the desired state. If on the other hand the state is first restored from local filesystem, will stale resources be handled correctly? i.e. any groups for example deleted during agent downtime triggers a DELETE event once agents reconnects. Or are we simply replacing the ruleCache once it reconnects? I guess that fall into the Ensure fresh data got from antrea-controller (if they are available) takes priority over the local data enhancement you were talking about.

Yes, if the connection is successful, the fresh data should be hornored. We can consider it as that the agent disconnects from the controller (no restart), it will use the last data it got from the controller until a successful reconnection. The PoC code doesn't do so but it's something I will add when creating a PR. Then, when the controller is fine, it will behave like before, while it will use local files as the policy data when the controller is down.

2. An I/O for each and every ag/atg/internalNP update seems pretty expense to me for this specific purpose, considering by default there's only 4 workers for syncing ruleCache updates, which is now blocking. I'm a bit worried that reconciling policies of 1k+ scale will show degrade in performance.

This is a good point and I was going to test it too. I was not too worried about it because the I/O is in the process of event handling, which runs in parallel with the rule workers. It won't affect the efficiency of the workers but may slow down how fast the events are delivered to workers. But I assume writing to file is faster than realizing a rule via OVS (the producer is faster than the consumer), especially the file directory is in tmpfs (/run/). But I could run a benchmark to get some figures to validate it.

@tnqn
Copy link
Member Author

tnqn commented Nov 17, 2023

So to clarify: this PR is about preventing regressions in NP enforcement (if a NP was ever enforced and the user hasn't updated it, it will always be enforced, or forwarding will not be permitted), not about providing guarantees for newly created Pods?

Yes, it's to prevent regressions in NP enforcement. Providing guarantees for newly created Pods is something more complicated, requiring the controller to even indicate a Pod doesn't have any policies applied explicitly and a healthy controller must be an essential dependency of Pod connectivity, otherwise the agent won't know whether the Pod doesn't have policies applied or it has but the agent hasn't received yet.

However, I think it matches what we declare: without antrea-controller, antrea-agent will still enforce the policies it has received, but new changes to NetworkPolicy or Pod will not be reflected in datapath. And there should be no real problem for newly created Pod in practice as long as antrea-controller is running fine, because we include a Pod in AppliedToGroup right after it's created, before it gets an Pod IP, the agent will see the Pod in the AppliedToGroup right after it gets scheduled to the Node, may even before its Pod interface gets created.

If we want to provide strict guarantee for newly created Pods, I think we should have a separate PR, and perhaps make it controlled by a config option which is not enabled by default, due to the side effect that Pod connectivity will rely on antrea-controller to explicitly indicate there is no policies applied to it.

@tnqn
Copy link
Member Author

tnqn commented Nov 17, 2023

  1. An I/O for each and every ag/atg/internalNP update seems pretty expense to me for this specific purpose, considering by default there's only 4 workers for syncing ruleCache updates, which is now blocking. I'm a bit worried that reconciling policies of 1k+ scale will show degrade in performance.

This is a good point and I was going to test it too. I was not too worried about it because the I/O is in the process of event handling, which runs in parallel with the rule workers. It won't affect the efficiency of the workers but may slow down how fast the events are delivered to workers. But I assume writing to file is faster than realizing a rule via OVS (the producer is faster than the consumer), especially the file directory is in tmpfs (/run/). But I could run a benchmark to get some figures to validate it.

The following numbers should confirm the extra cost is not going to be a problem:

  1. For a single NetworkPolicy event:
  • It took 5,545 ns/op to add the event to cache without writing to file
  • It took 23,651 ns/op to add the event to cache with writing to file

Although it adds 18µs delay, it can’t compare with the time spent on realizing a rule, which is ms level. After the event handler adds 1k policies to cache and writes them to filesystem, which may take around 23ms, each worker may just realize 1 or 2 rules.

networkpolicy_controller.go:619] "Finished syncing rule" ruleID="8e9c56acc8035b5d" duration="14.557197ms"
networkpolicy_controller.go:619] "Finished syncing rule" ruleID="1a0cc0f70b144dbe" duration="18.110148ms"
  1. For a bunch of policies, appliedToGroups and addressGroups, in a scale of 1,000 policies (applied to the Node), along with 1,000 appliedToGroups, each of which contains 100 Pods, 1,000 addressGroups, 10 of which contain 10,000 Pods, 100 of which contain 1,000 Pods, 890 of which contain 100 Pods.
  • It took 74,574,402 ns/op without writing to file
  • It took 357,756,521 ns/op with writing to file

The extra 283ms should be fine as the workers can't process them in time anyway.

  1. After storing them as files, the space costs with the above scale are as below, and I think it's not going to be a problem too given this is perhaps already the maximum scale a Kubernetes can have, clusters in reality have much less policies applied to a Node's Pods, and the selectors don't usually select so many Pods.
# du -sh *
29M     address-groups
7.9M    applied-to-groups
4.0M    network-policies

@antoninbas
Copy link
Contributor

antoninbas commented Nov 17, 2023

@tnqn I think all these objects also support Protobuf? That may make a difference for file size and encoding / decoding performance. I know there is also encoding/gob, but last time I tried it, it was underwhelming for small-ish objects.

@tnqn
Copy link
Member Author

tnqn commented Nov 20, 2023

@tnqn I think all these objects also support Protobuf? That may make a difference for file size and encoding / decoding performance. I know there is also encoding/gob, but last time I tried it, it was underwhelming for small-ish objects.

Good point. We can leverage the Scheme and Codec objects to encode and decode, which is handy for versioning as well. The results are as below:

Format Execution Time Size
Json 357ms 40.9M
Protobuf 168ms 24M

@antoninbas
Copy link
Contributor

@tnqn what are the next steps, will you be updating this PR with the new approach, or are you still working through some quirks?

@tnqn
Copy link
Member Author

tnqn commented Nov 21, 2023

@tnqn what are the next steps, will you be updating this PR with the new approach, or are you still working through some quirks?

I'm polishing code and tests for the new approach, particularly how to deal with versioning in the future (e.g. when the API version upgrades to v1 while the stored data is in v1beta2). I plan to create another PR for the new approach and keep this one for reference.

@tnqn
Copy link
Member Author

tnqn commented Nov 30, 2023

superseded by #5739

@tnqn tnqn closed this Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants