Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't recreate resources as we're tearing a thing down. #2678

Closed
mattmoor opened this issue Dec 9, 2018 · 8 comments
Closed

Don't recreate resources as we're tearing a thing down. #2678

mattmoor opened this issue Dec 9, 2018 · 8 comments
Assignees
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@mattmoor
Copy link
Member

mattmoor commented Dec 9, 2018

We recently started registering DeleteFunc handlers for our resources, which enables us to quickly recreate resources if they are deleted from out from under us.

A byproduct of this (still needs confirmation) is that I noticed that sometimes when a Revision is being torn down, that right at the moment the Deployments pod starts Terminating another Pod (under a different ReplicaSet) appears. At first I thought this was us updating the Deployment (e.g. #2632), but we are clearly creating two Deployments in our logs:

image

I think what's happening is that we Reconcile the Revision when it has a DeletionTimestamp (mid delete) and see no Deployment, so we recreate it.

cc @lichuqiang @dgerd

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Dec 9, 2018
@dgerd
Copy link

dgerd commented Dec 10, 2018

Your explanation makes the most sense to me. Is this reproducible all the time? If so it should be easy to add some instrumentation to confirm. Happy to take a stab at doing this.

ns := rev.Namespace
deploymentName := resourcenames.Deployment(rev)
logger := logging.FromContext(ctx).With(zap.String(logkey.Deployment, deploymentName))
deployment, err := c.deploymentLister.Deployments(ns).Get(deploymentName)
if apierrs.IsNotFound(err) {
// Deployment does not exist. Create it.
rev.Status.MarkDeploying("Deploying")
deployment, err = c.createDeployment(ctx, rev)
if err != nil {
logger.Errorf("Error creating deployment %q: %v", deploymentName, err)
return err
}
logger.Infof("Created deployment %q", deploymentName)

@dgerd
Copy link

dgerd commented Dec 20, 2018

/milestone Serving 0.4

@knative-prow-robot knative-prow-robot added this to the Serving 0.4 milestone Dec 20, 2018
@dgerd
Copy link

dgerd commented Dec 21, 2018

/assign @dgerd

@dgerd
Copy link

dgerd commented Dec 21, 2018

From @mattmoor: "I think the key is that we didn't make Reconcile() sensitive to metadata.deletionTimestamp when we added the OnDelete hooks. I believe deletionTimestamp gets set as a tombstone as Forground GC runs"

@dgerd
Copy link

dgerd commented Dec 21, 2018

I added some logging and attempted to reproduce this at HEAD by creating a runLatest Service and then deleting the service. I tried multiple times waiting: immediately after, after a few seconds, and after a few minutes. I have yet to reproduce this.

I am using kubectl apply -f to create and kubectl delete -f to delete the service. I will try throwing traffic at it before deleting to see if that changes the behavior at all.

Let me know if you have any other reproduction advice.

@mattmoor
Copy link
Member Author

So I can see this in the scale test (at least cranked up to 150 as I have it right now), e.g.

scale-00150-015-zztulrto-00001-deployment-6977d4bcc-xd749    1/2     Terminating   0          3m
scale-00150-015-zztulrto-00001-deployment-97785665f-8t659    1/2     Terminating   0          1m

Interpreting from the suffixes, these pods came from distinct ReplicaSets, and the main way that would happen would be if we created a second deployment while the pods from the first were still tearing down.

@mattmoor
Copy link
Member Author

I added some logic in the revision controller to log when we see Revisions with a DeletionTimestamp, and running N=100 a bunch, I'm not seeing my log statement. The only other thing that comes to mind would be adding a delay to the queuing of Deployment deletion events using this.

@mattmoor mattmoor modified the milestones: Serving 0.4, Ice Box Jan 29, 2019
@mattmoor
Copy link
Member Author

I'd initially thought this was us doing something obviously wrong, but given that DeletionTimestamp is unset when this happens this seems cosmetic and somewhat tough to fix. Moving out of v1.

mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
mattmoor added a commit to mattmoor/serving that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
knative-prow-robot pushed a commit that referenced this issue Jan 30, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: #2678
ZhiminXiang pushed a commit to ZhiminXiang/serving-1 that referenced this issue Feb 7, 2019
When we had finalizers previously we would race with K8s GC to recreate our children as K8s reaped them.

The simplest way to test this is to enable "foreground" deletion in our e2e tests, which is implemented as a finalizer.

Fixes: knative#2678
@dprotaso dprotaso removed this from the Ice Box milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

No branches or pull requests

4 participants