-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track max pods, simplify warm IP pool management #2745
Conversation
dc58d22
to
ca26b91
Compare
return short > 0 | ||
func (c *IPAMContext) isDatastorePoolTooLow() (bool, *datastore.DataStoreStats) { | ||
stats := c.dataStore.GetIPStats(ipV4AddrFamily) | ||
// If max pods has been reached, pool is not too low |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If max pods has been reached, pool is not too low
I think, this comment is bit confusing, as max pods has been mentioned without the context. I got confused while reading this.
If we write, if c.maxPods <= stats.TotalIPs
then it seems alright.
Here we are saying, the nodes maxPods is less than or equal to the total ip available in the datastore, and so the isDatastorePoolTooLow too low is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what this function checks is a bit confusing. isDatastorePoolTooLow
looks at the number of currently available IPs and compares that number to the warm/minimum targets. If there are not enough available IPs, we return false.
The maxPods
check here is short-circuiting that logic by saying, "forgot about warm targets, do not allocate any more IPs if we have already allocated more than maxPods worth"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood it better, when thinking like this
c.maxPods <= stats.TotalIPs
Because we are sending stats as the argument, I think, you went with the validation as stats.TotalIPs >= c.maxPods
; explaining the we have more than maxPods worth.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -499,17 +500,26 @@ func (c *IPAMContext) nodeInit() error { | |||
}, 30*time.Second) | |||
} | |||
|
|||
// Make a k8s client request for the current node so that max pods can be derived | |||
node, err := k8sapi.GetNode(ctx, c.k8sClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ideally remove the need of k8sClient usage unless any feature like custom networking would need it. We can enhance CNINode for this information. /cc @haouc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that, but we already have a non-optional k8s client call for the node when we attach an ENI - to check if the node is currently marked as schedulable. It would be nice to have this node state available in CNINode
, but that would still require a k8s client call to retrieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure..with secondary IPs we would never go over max pods since max pods take secondary IP into account and once max pods are reached pretty much new IPs cannot be assigned since ENIs will be full. This would be beneficial for prefix delegation but it is still tricky since allocation happens in chunks of 16..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's we don't enforce a hard check on increase (or decrease). We simply do not increase, even when the warm targets tell us to, when max pods has been reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in PD mode - Let's say we have WARM_PREFIX set to 1
and the node is already @ Max Pods. Now, with this change, we will not be honoring the WARM PREFIX in this state as the node is already at capacity w.r.t maxPods. If the user now churns the pods on this node (Let's say rollout restart of all deployments) - IPs assigned to recently deleted pods can't be reused for next 30s and so the new pods will not have any IPs and we would need to pull in a new prefix anyways? Current workflow, would've assigned an additional prefix and the new pods would've consumed IPs form the additional prefix and we would've released the old prefix after the cool down period. Did I understand the new flow right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct. Over-provisioning could help in a case where the user restarts all pods at once. I do not think that is a case that we should design for, though. The downsides of over-provisioning (cost and resource utilization) seem much higher than slower recovery after restarting all pods, which should be an extremely rare event.
What type of PR is this?
enhancement
Which issue does this PR fix:
N/A
What does this PR do / Why do we need it:
I set out trying to do two things in this PR:
max-pods
, so as not to waste VPC IPs.On point 1, I partially succeeded, as we now do not increase the datastore pool when more than
max-pods
IPs are allocated to the node. This check cannot be exact since IPs can be allocated in chunks (prefix mode or "full-ENI" mode). Note that this check only needs to be enforced on the allocation side, not the deallocation side.On point 2, I had very limited success, as the warm IP pool management logic is a behemoth. I made some simplifications, primarily around limiting how many times we calculate
DataStoreStats
. Calls toGetIPStats
acquire a datastore lock that can slow down pod IP acquisition. The conclusion I came to is that the warm pool logic has to be significantly refactored and broken up.If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
N/A
Testing done on this change:
Manually verified that all integration tests pass after this change. Also verified that when max pods is set to a low value, excess IPs are not allocated.
Will this PR introduce any new dependencies?:
No
Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No, Yes
Does this change require updates to the CNI daemonset config files to work?:
No
Does this PR introduce any user-facing change?:
Yes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.