Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CRI to obtain pod sandbox IDs instead of Kubernetes API #714

Merged
merged 8 commits into from
Dec 4, 2019

Conversation

drakedevel
Copy link
Contributor

@drakedevel drakedevel commented Nov 12, 2019

Issue #, if available: #712

Description of changes:
Reverts #371, and then replaces the use of the Docker API with the CRI. The revert is in a separate commit, to make it easier to see my changes.

Work in progress: I've done basic testing to ensure the values returned by pkg/cri are reasonable, unit tests pass, etc. I've tested it briefly on a Docker staging cluster, and for several days on a CRI-O staging cluster, and all is smooth so far. There's still a couple of edge cases to think about and TODOs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@drakedevel
Copy link
Contributor Author

@mogren we've been running this on our staging cluster (on CRI-O) for a few days and it's been smooth sailing. I've manually tested it works with Docker as well. There's a few more edge cases to think about before this is ready to review, but I did have a couple of questions for you:

  • I had originally thought it might be a good idea to provide a parameter or an environment variable, but it occurred to me that it's kind of pointless, since that path is just a mountpoint. The current iteration of this PR has assumes the socket is at /var/run/cri.sock, and the DaemonSet is responsible for mounting whatever appropriate socket there (e.g. /var/run/dockershim.sock, /var/run/crio/crio.sock, etc.). The default deploy configs could mount in dockershim.sock for default compatibility with EKS. Do you think it's necessary to add more configurability on the ipamd side here?
  • Do you have any context on the UID collision checks in the Docker code? I'm trying to gauge whether those checks should get ported over to the CRI module or whether they were only relevant for direct Docker access. (I've ported them and marked them with a TODO in the current draft.)

@mogren
Copy link
Contributor

mogren commented Nov 15, 2019

@drakedevel Hi, sorry for the slow reply, I'll go through this PR today.

I agree that it's fine to have the daemonset set that up. I'm also not sure about the Docker UID issue, @jaypipes, do you have some insight?

@mogren
Copy link
Contributor

mogren commented Nov 15, 2019

@drakedevel I built and and ran some tests using these changes, but I could only get pods to schedule on the first two ENIs that were already attached to the node. Triggered the issue by scheduling 150 pods on 3 m5.xlarge nodes that should be able to handle 56 pods each.

The logs had a lot of errors from the pods that failed to schedule, looking like:

2019-11-15T21:36:17.180Z [INFO] 	Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-11-15T21:36:17.214Z [INFO] 	Received DelNetwork for IP <nil>, Pod hello-6c9c9df6cd-nc5lm, Namespace default, Container 450671e7175f31c4695c3658e1e2488db02123ac7b028cfc5cb6d29fb12efe88
2019-11-15T21:36:17.214Z [DEBUG] 	UnassignPodIPv4Address: IP address pool stats: total:14, assigned 0, pod(Name: hello-6c9c9df6cd-nc5lm, Namespace: default, Container 450671e7175f31c4695c3658e1e2488db02123ac7b028cfc5cb6d29fb12efe88)
2019-11-15T21:36:17.214Z [WARN] 	UnassignPodIPv4Address: Failed to find pod hello-6c9c9df6cd-nc5lm namespace default Container 450671e7175f31c4695c3658e1e2488db02123ac7b028cfc5cb6d29fb12efe88
2019-11-15T21:36:17.214Z [DEBUG] 	UnassignPodIPv4Address: IP address pool stats: total:14, assigned 0, pod(Name: hello-6c9c9df6cd-nc5lm, Namespace: default, Container )
2019-11-15T21:36:17.214Z [WARN] 	UnassignPodIPv4Address: Failed to find pod hello-6c9c9df6cd-nc5lm namespace default Container
2019-11-15T21:36:17.214Z [INFO] 	Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod

I guess the issue is that since I schedule 50 on each node, but they only have around 28 available IPs, the first 28 pods schedule fine, but the 29th gets an error back from ipamd because the data_store has no more free IPs, and then the kubelet tries to clean up 10 times. I switched back to v1.6.0-rc4, and then things worked just fine again. Will try to do some more tests.

@drakedevel
Copy link
Contributor Author

@mogren Thanks for trying it out! Did you cherry-pick onto release-1.6, or did you run this commit directly? (This branch is on top of master, specifically d73d118)

Our staging cluster is actually ENI-limited; every node in that cluster is using more than one ENI, a handful have used every available EIP. So I'm a little puzzled by that result :P

That rebase I did right before I pinged you was intended to pull in the integration test scripts with no functional changes (although I didn't end up getting those tests working), but I just realized d73d118 is itself a significant functional change that I missed while rebasing. I haven't tested that commit, and it isn't on the release-1.6 branch -- is it possibly related?

I'll spin up an EKS cluster to test with and see if I can repro either way.

@drakedevel
Copy link
Contributor Author

@mogren Reproduced on an EKS cluster. Oddly, one of the nodes stalled at 14 pods, the other at 28, using 1 and 2 ENIs respectively (despite having all 3 in /enis). I confirmed it also reproduces with the exact image currently deployed on our staging cluster, so that rebase didn't change anything.

I tried deploying a build of 7eab704 (the original root of this branch on master) and actually got the same result! So I think master is just broken in default EKS, and this PR isn't involved (unless it somehow got the nodes into a bad state).

I don't have any theories why it works on our staging cluster 😛 The main differences are the Kubernetes version (1.16 on staging, 1.14 on EKS) and network config (our staging cluster uses ENIConfig CRDs).

I'll rebase this PR against release-1.6 for now to keep things easier to test.

@drakedevel drakedevel changed the base branch from master to release-1.6 November 16, 2019 00:43
@drakedevel
Copy link
Contributor Author

Rebased, and the new image (v1.6.0-rc4-4-gf5daa9e8) doesn't show the issue when deployed to EKS. I had no trouble running 50 pods on each m5.xlarge node.

I'll roll that image out on our staging cluster to get it baking there.

There's not a lot of diff between release-1.6 and 7eab704; the only relevant functional changes appear to be c74841a and 1ee59a0.

@mogren
Copy link
Contributor

mogren commented Nov 18, 2019

@drakedevel Thanks a lot for testing this. I'll do some more testing as well to try and figure out when this happens. I've been terminating the nodes completely between the tests. Next I'll try with just restarting the aws-node pods. Something is definitely going on here...

Btw, I tested by running a build of your commit.

@drakedevel
Copy link
Contributor Author

drakedevel commented Nov 18, 2019

@mogren just to clarify in case my comments were confusing: I was able to reproduce the issue you described on this branch, but also on master unmodified. I've since rebased this branch onto release-1.6 and I can no longer reproduce that issue, so I think it's an issue with the state of master rather than this PR. (Either way the issue only reproduces on EKS for me -- both master and release-1.6 seem to work fine on our staging cluster.)

The build of this commit (f5daa9e) should be suitable for testing again now that it's been rebased.

@drakedevel
Copy link
Contributor Author

I got some time today to dig into the open questions I had in this PR, and found some interesting results!

The main insight was discovering that the Docker GetRunningContaines code actually is a reasonably close match for the kubelet runtime GetPods function. That function actually uses the CRI itself, so it's an even closer analogy to what we're doing in this PR!

It seems like the main reason for the UID collision check (which is also present in GetPods) is that the sandbox can be re-created for a variety of reasons (e.g. if it dies, or possibly even in the case of static pod updates). The new sandbox will have the same Namespace, Name, and Uid metadata fields, and is only distinguished by the Attempt field, which is incremented by kubelet each time the sandbox is re-created. The old containers aren't cleaned up instantly, so if GetPods is called with all true, then it's entirely reasonable for the pod to have multiple associated sandboxes if all but one (or all) are dead.

We only care about living pods, though, since dead pods don't need IPs. It's still technically possible to get multiple ready sandboxes for the same pod from the CRI, but it appears to be very much an edge case. The easiest way to trigger it is to manually create a sandbox with the appropriate labels, but it may be possible under various sorts of CRI bugs (e.g. if it fails to actually kill a previous iteration of the sandbox). Kubelet has explicit handling for this case in the pod sync loop, and will respond by killing all of the ready sandboxes and making a fresh one.

I would say for now the best way to handle multiple ready sandboxes with the same UID on startup is to just crash -- it's an unusual case, and kubelet should be cleaning it up so by the time we restart there's a good chance it'll be fixed. That's what the Docker code does (although the check for Name equality should really include a check on Namespace).

I haven't come up with any way besides a manually-crafted sandbox that would lead to two sandboxes having the same pod UID but different pod names/namespaces. I think the safest thing would still be to bail out in that case, since there's nothing sensible we can hope to make out of that state.

Future work

I think this provides more evidence in favor of the strategy to avoid the Kubernetes API entirely for pod details.

IPAMD gets told that a sandbox is being created/destroyed, and allocates/frees an IP in response. On startup, it checks if there's any changes that it missed notifications for and does the same thing. Everything needed to support this is available from the CRI, as I mentioned in the issue thread.

What's clearer now is that there's significant complexity we're exposed to because we try to think about the sandboxes in Pod terms. Sandboxes aren't 1:1 with pods, but if we were oblivious to pods as a concept, we wouldn't have to care. We could leave the job of reconciling the CRI with the Kubernetes API to kubelet 😛

Concretely, that would mean treating the sandbox ID as the single identity to associate with an IP in the datastore, and treating any information like pod name/namespace/uid as a best-effort logging and debugging aid. It should simplify things a lot. And, while I'm sure it's not a high priority goal (if it's a goal at all), a free side-effect of that refactor would be that this CNI would work in non-Kubernetes environments (minus custom network config).

@mogren
Copy link
Contributor

mogren commented Nov 20, 2019

@drakedevel Thanks again for really researching this issue! We should definitely keep track of the namespace as well. I'll rebuild with these latest changes and run some more tests.

@drakedevel drakedevel marked this pull request as ready for review November 20, 2019 23:33
Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good stuff, @drakedevel, appreciate this PR very much.

I have a couple minor nits and suggestions inline, but nothing major.

pkg/cri/cri.go Outdated Show resolved Hide resolved
ipamd/ipamd.go Outdated
@@ -420,6 +423,27 @@ func (c *IPAMContext) getLocalPodsWithRetry() ([]*k8sapi.K8SPodInfo, error) {
return nil, nil
}

var containers map[string]*cri.ContainerInfo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to put a comment above this block that explains why you're doing this. Something like this might work...

// Ask CRI for the locally-running containers and attempt to set the
// Pod.Container field to the value of the *sandbox* container. This
// will ensure that we don't unassign IPs from Pods due to looking up
// Pods with the wrong container ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do -- should hopefully be clearer once the sandbox rename is done, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that this code was largely pre-existing -- I restored it in the first commit with a revert, and modified it to use the cri package instead of docker, but didn't otherwise touch it. (e.g. the TODO was there before).

Added a comment and tidied up the other comment to be a bit clearer.

pkg/cri/cri.go Outdated
return &Client{}
}

func (c *Client) GetRunningContainers() (map[string]*ContainerInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be better to name this GetRunningSandboxContainers() to make it clearer at the calling point what this is actually doing. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! For consistency with Kubernetes and the CRI (and thus less confusion hopefully), I think it's probably better to consistently refer to them as "pod sandboxes." The fact that a sandbox is a container under the hood is a common implementation choice but not universal: e.g. under Kata Containers a sandbox is a full-blown VM in which other containers run.

I'm renaming things now to use the pod sandbox terminology -- let me know if you'd prefer a different name.

@mogren
Copy link
Contributor

mogren commented Nov 23, 2019

@drakedevel When I tried this change using docker, I saw that existing pods never got removed from the data_store when they were deleted. Haven't really figured out why the containerIDs don't match yet, been busy with some other issues. Will try to get back to testing ASAP.

@drakedevel
Copy link
Contributor Author

Addressed CR comments, included two small changes in separate commits, as well:

  • I noticed that my revert of 14de538 was incomplete, and hadn't reverted the changes to pkg/k8sapi/discovery.go. This would have allowed the incorrect container ID to be added to the datastore if the real sandbox ID wasn't retrieved from the CRI. That could lead to an IP leak on restart if a pod is terminated while IPAMD is restarting.
  • I noticed that getLocalPodsWithRetry will give up on retrieving sandboxes from the CRI after the maximum attempts are exceeded, rather than propagating the error. Before the first change, that would have led to a datastore full of invalid sandbox IDs (leaking all IPs), after the first change it will lead to an empty datastore (releasing all IPs). Neither of these is a good outcome. I changed it to propagate the error which should also make misconfigurations (like an invalid CRI socket) much faster to detect.

I'm not sure what tests you were running @mogren but is it possible these are the cause of the issue you were seeing? The behavior you described is what I would expect to see if the tests ran without a valid cri.sock mount (and without these fixes). Notably, this PR doesn't include any changes to the deployment configs that would include that mount automatically (since there's no v1.6 directory yet).

Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking excellent, thanks very much @drakedevel. I left one question in there, but it's about a corner case and not something I would hold this PR up for.

The other question I had was around this:

- name: dockersock
hostPath:
path: /var/run/docker.sock

With use of CRI instead of Docker, would that still be necessary?


log.Tracef("Update for pod %s: %+v, %+v", podName, pod.Status, pod.Spec)

// Save pod info
d.workerPods[key] = &K8SPodInfo{
Name: podName,
Namespace: pod.GetNamespace(),
Container: containerID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Pod's K8S_POD_INFRA_CONTAINER_ID change ever? If so, we may need to update the K8SPodInfo.Sandbox attribute accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we care about the Sandbox field here, specifically. The only use of this data in IPAMD is during nodeInit, where we extract a copy and then promptly overwrite the Sandbox field with data from the CRI. I'm pretty sure we never look at it again, I don't know why we even keep the watch active. (Per #714 (comment), there's really no reason for the K8S Pod API to be involved even as a one-shot.)

More generally, the sandbox ID for a Pod can change (more on that later), but the CNI/CRI side of things doesn't care. The ipamd/datastore PodKey store includes the sandbox ID, so we'll treat a sandbox change as a completely unrelated pod. It'll look exactly like one pod getting shut down and another pod getting started up. The fact that they were "the same pod" in some sense doesn't affect anything.

As to how a sandbox ID for a pod can change:

The key value here comes from MetaNamespaceKeyFunc, which just returns a namespace/name string. The sandbox ID associated with that can very easily change -- restarting a pod that's part of a StatefulSet will produce a new pod with the same namespace and name, but a different sandbox (and a different UID).

A less obvious way it can happen is if the sandbox gets re-created. This can happen if the sandbox process gets killed, as well as in some more obscure scenarios which I'm not sure how to trigger (e.g. pod spec changes, which are normally immutable but there's code to handle anyway). In that case, you'll get the same name, namespace, and UID; the only change is the Attempt field in the CRI PodSandboxMetadata structure, which is not exposed in the Kubernetes API.

Because information about the sandbox is generally not exposed through the API, the only observable effects of a sandbox re-creation are that all of the normal containers restart. There's no way to distinguish that the cause of the restart was a sandbox change.

If we treat the CNI/CRI as the source of truth then the complexity around how and when Kubelet manages various obscure sync edge-cases just disappears completely. But per #712 (comment), saving that for a follow-up PR 😄

@mogren
Copy link
Contributor

mogren commented Nov 26, 2019

@drakedevel Sorry for my late reply. Tested this last night, and scaling up worked with no issues this time. I did see some problems when changing the config and restarting the CNI, in the middle of scaling up, the pod got stuck in a crash loop. I didn't get the logs since I was in the middle of something else, but I think the cause is the second change you mentioned earlier. I wonder why the sandbox ID is no longer matching...

@drakedevel
Copy link
Contributor Author

@jaypipes

With use of CRI instead of Docker, would that still be necessary?

That socket is no longer necessary, but you'll need a CRI socket. We're using a manifest that looks like this:

...
          volumeMounts:
            ...
            - mountPath: /var/run/cri.sock
              name: crisock
...
      volumes:
        ...
        - name: crisock
          hostPath:
            path: /var/run/dockershim.sock
            type: Socket

That path on the host varies with the CRI in use; for CRI-O it's /var/run/crio/crio.sock, for containerd it's /var/run/containerd/containerd.sock, etc. (We use a custom Helm chart that makes a DaemonSet for each possible CRI with appropriate selectors so we can run a mixed system.) The type: Socket adds a safeguard that should prevent the pod from even starting if the socket path is invalid.

added the enhancement label 6 hours ago

Not sure of your PR label conventions but based on the name I'm not sure enhancement applies here: this PR is a fix for #712 which is a priority/P0 bug. (Just mentioning in case that's used for prioritization anywhere.)

but I think the cause is the second change you mentioned earlier. I wonder why the sandbox ID is no longer matching...

@mogren The second change will cause crashes under three conditions (after several retries):

  • grpc.Dial on the CRI socket returns an error
  • gRPC call ListPodSandbox returns an error
  • ListPodSandbox returns two ready sandboxes with the same Metadata.UID field

All of these have different errors/logs so those logs would be very useful for debugging 🙂

The first two are most likely transient (if the CRI crashes or got restarted or something), so if it keeps happening that seems likely caused by a misconfiguration of some sort, somewhere. The latter would be very weird, and it suggests a Kubelet bug if it persists, since I'm pretty sure the pod sync loop will correct that situation. It's also possible I misread or miswrote some code somewhere, so any logs/configs/etc you can share would be very helpful in tracking the issue down 😛

This has been rock-solid on our CRI-O staging cluster, and I haven't been able to reproduce any issues on Docker while torture-testing on EKS, so I'm a little bit stuck on debugging without more to go on.

Copy link
Contributor

@mogren mogren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drakedevel Thanks a lot for all the work on this, looks great. So far I haven't seen any new issues, but I'll keep running tests before making the next v1.6.0 release candidate.

@jaypipes jaypipes added bug priority/P0 Highest priority. Someone needs to actively work on this. and removed enhancement labels Dec 4, 2019
Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on this, @drakedevel. Nudging this on its merry way (and changed from enhancement to priority/P0 bug fix)

@jaypipes jaypipes merged commit ad5ee62 into aws:release-1.6 Dec 4, 2019
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this pull request Dec 4, 2019
* Revert "fix: get container ID from kube rather than docker (aws#371)"

This reverts commit 14de538.

* go mod tidy

* Add initial proof-of-concept implementation of CRI support.

* Update CRI socket path to /var/run/cri.sock

* Filter ready sandboxes and abort if there's a pod UID conflict.

* Revert "fix: get container ID from kube rather than docker (aws#371)"

This reverts commit 14de538.

* Address review comments, refactor to use pod sandbox nomenclature consistently.

* Bail if we can't retrieve local pod sandboxes on startup.
jaypipes pushed a commit that referenced this pull request Dec 5, 2019
)

* Revert "fix: get container ID from kube rather than docker (#371)"

This reverts commit 14de538.

* go mod tidy

* Add initial proof-of-concept implementation of CRI support.

* Update CRI socket path to /var/run/cri.sock

* Filter ready sandboxes and abort if there's a pod UID conflict.

* Revert "fix: get container ID from kube rather than docker (#371)"

This reverts commit 14de538.

* Address review comments, refactor to use pod sandbox nomenclature consistently.

* Bail if we can't retrieve local pod sandboxes on startup.
@drakedevel drakedevel deleted the cri-support branch December 6, 2019 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants