Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify await logic for deletes #3133

Merged
merged 15 commits into from
Aug 21, 2024
Merged

Unify await logic for deletes #3133

merged 15 commits into from
Aug 21, 2024

Conversation

blampe
Copy link
Contributor

@blampe blampe commented Jul 26, 2024

NB: This is a larger change that's easier to review as separate commits. The
first commit introduces some new types and interfaces; the second one hooks up
those types and removes a lot of existing code.

We already have "generic" await logic for deletion -- if we don't have an
explicit delete-awaiter defined for a particular GVK, then we always run some
logic that waits for the resource to 404.

As it turns out, all of our custom delete-awaiters do essentially the same 404
check, modulo differences in the messages we log. This makes the deletion code
flow a good starting place to introduce more generic/unified await logic.

If you look at the code deleted in the second commit you get a good sense of
the current issues with our awaiters: each custom awaiter is responsible for
establishing its own watchers; determining its default timeout; performing the
same 404 check; etc. There's a lot of duplication and subtle differences in
behavior that leads to issues like
#1232.

As part of this change we start decomposing our await logic into more
composable pieces. Note that I'm replacing our deletion code path with these
new pieces because I don't think we lose much by changing the logged messages,
but for the create/update code paths we'll need some glue to preserve existing
await behavior.

The relevant interfaces are:

// Observer acts on a watch.Event Source. Range is responsible for filtering
// events to only those relevant to the Observer, and Observe optionally
// updates the Observer's state.
type Observer interface {
	// Range iterates over all events visible to the Observer. The caller is
	// responsible for invoking Observe as part of the provided callback. Range
	// can be used to customize setup and teardown behavior if the Observer
	// wraps another Observer.
	Range(func(watch.Event) bool)

	// Observe handles events and can optionally update the Observer's state.
	// This should be invoked by the caller and not during Range.
	Observe(watch.Event) error
}

// Satisfier is an Observer which evaluates the observed object against some
// criteria.
type Satisfier interface {
	Observer

	// Satisfied returns true when the criteria is met.
	Satisfied() (bool, error)

	// Object returns the last-known state of the object being observed.
	Object() *unstructured.Unstructured
}

// Source encapsulates logic responsible for establishing
// watch.Event channels.
type Source interface {
	Start(context.Context, schema.GroupVersionKind) (<-chan watch.Event, error)
}

At a high level:

  1. We determine what condition (Satisfier) to wait for during deletion. There is
    always a condition even if it's a no-op. Deletion is simple because there
    are only two possibilities -- "skip" and "wait for 404" -- but with
    create/update and user-defined conditions it will get more interesting.
  2. We wait for the condition Satisfier and can combine it with arbitrary
    Observers. This lets us do things like log additional information while
    we're waiting, e.g. Emit event logs during await #3135.

The underlying machinery responsible for handling timeouts, informers, etc. is
all hidden behind the Source. Implementing new await logic is essentially
just a matter of defining a new Satisifer which understands how to evaluate
an unstructured resource.

A number of unit tests are included as well as an E2E regression test to ensure we respect the skipAwait annotation. The existing delete-await tests are mostly unchanged except for tweaks to inject a Condition instead of an awaitSpec. Some watcher-specific tests were no longer relevant and were removed, however the functionality is still implemented/tested as part of Awaiter.

Fixes #3157.
Fixes #1418.
Refs #2824.

@blampe
Copy link
Contributor Author

blampe commented Jul 26, 2024

Copy link

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

Copy link

codecov bot commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 87.58621% with 36 lines in your changes missing coverage. Please review.

Project coverage is 37.93%. Comparing base (2ec7a1a) to head (229a385).
Report is 1 commits behind head on master.

Files Patch % Lines
provider/pkg/await/internal/awaiter.go 78.18% 9 Missing and 3 partials ⚠️
provider/pkg/await/condition/immediate.go 83.72% 7 Missing ⚠️
provider/pkg/await/condition/source.go 85.00% 3 Missing and 3 partials ⚠️
provider/pkg/await/condition/observer.go 91.66% 3 Missing and 2 partials ⚠️
provider/pkg/await/await.go 82.60% 2 Missing and 2 partials ⚠️
provider/pkg/metadata/overrides.go 83.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3133      +/-   ##
==========================================
+ Coverage   36.59%   37.93%   +1.33%     
==========================================
  Files          70       76       +6     
  Lines        9264     9335      +71     
==========================================
+ Hits         3390     3541     +151     
+ Misses       5541     5452      -89     
- Partials      333      342       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This was referenced Jul 26, 2024
@blampe blampe force-pushed the blampe/2996-await-delete branch 2 times, most recently from 7affd4c to 5c06e56 Compare July 26, 2024 20:28
@blampe blampe requested a review from rquitales July 26, 2024 20:28
@blampe blampe marked this pull request as ready for review July 26, 2024 20:28
provider/pkg/await/await_test.go Outdated Show resolved Hide resolved
provider/pkg/await/await_test.go Show resolved Hide resolved
Comment on lines 114 to 125
r, _ := status.Compute(uns)
if r.Message != "" {
dc.logger.LogMessage(checkerlog.StatusMessage(r.Message))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will still emit messages related to the object's status, if possible.

provider/pkg/await/await.go Show resolved Hide resolved
.github/workflows/run-acceptance-tests.yml Outdated Show resolved Hide resolved
provider/pkg/await/await_test.go Outdated Show resolved Hide resolved
tests/sdk/java/await_test.go Show resolved Hide resolved
tests/sdk/java/await_test.go Show resolved Hide resolved
provider/pkg/await/internal/awaiter.go Outdated Show resolved Hide resolved
provider/pkg/await/internal/awaiter_test.go Show resolved Hide resolved
return
}
// Make sure Observers are all done.
wg.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event of context deadline being exceeded, wouldn't this mean that the awaiter still continues waiting/observing? Shouldn't we just skip waiting for the wait group to be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context is plumbed all the way down and respected by the observers, so if it dies all of our informers will shut down and this will resolve. Namely these guys:

https://github.com/pulumi/pulumi-kubernetes/pull/3133/files#diff-c68cee828d9c5172eef833ba32b6185741c858ff507bd2f4d6df8c5a6fb275dbR55-R59

https://github.com/pulumi/pulumi-kubernetes/pull/3133/files#diff-b52d2594c41dd1ff41784e1c6101fcd3c86f51c128ea8b8de43ddc6607977a15R148

Full disclosure I'm very skeptical we need any of this "Hail Mary" logic. There's a comment in the code but I think it largely stems from issues we had handling watch errors. I kept it as-is since we've got tests around it, but I expect we could get rid of it without issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified this a bit -- we don't need to worry about context cancellation here since the observer's already shut down when that happens.

provider/pkg/await/await_test.go Outdated Show resolved Hide resolved
provider/pkg/await/condition/deleted.go Outdated Show resolved Hide resolved
provider/pkg/await/condition/deleted.go Outdated Show resolved Hide resolved
return dc, nil
}

// Range confirms the object exists before establishing an Informer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this not a race between checking the existence and starting the informer? Or does the informer throw an error if the object doesn't exist (in which case, why call Get at all?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment wasn't accurate -- there is a race, so we check if the object was deleted after establishing the informer. (Informers are fine if the object doesn't exist, so you can subscribe to creations.)

Comment on lines 93 to 95
dc.logger.LogMessage(checkerlog.WarningMessage(
fmt.Sprintf("finalizers might be preventing deletion (%s)", strings.Join(finalizers, ", ")),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ephemeral status message? The whole await procedure is ending at this point right? Wonder how anyone would see this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I've been looking at a lot of non-interactive output and didn't realize these aren't already persisted. I suggest we do what we do in go-provider/docker-build and treat any warnings/errors as non-status so they get shown in the final interactive output.

provider/pkg/await/condition/observer.go Show resolved Hide resolved
provider/pkg/await/condition/source.go Show resolved Hide resolved
provider/pkg/await/condition/source.go Show resolved Hide resolved
provider/pkg/await/internal/awaiter.go Show resolved Hide resolved
provider/pkg/await/internal/awaiter.go Show resolved Hide resolved
@EronWright
Copy link
Contributor

EronWright commented Aug 6, 2024

I notice that the cli-utils library has a watcher package, seems similar the core logic in this PR. Thoughts on it?
https://github.com/kubernetes-sigs/cli-utils/blob/master/pkg/kstatus/watcher/doc.go

Copy link
Contributor Author

@blampe blampe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the cli-utils library has a watcher package, seems similar the core logic in this PR. Thoughts on it?
https://github.com/kubernetes-sigs/cli-utils/blob/master/pkg/kstatus/watcher/doc.go

@EronWright yes, I saw that as well (more specifically the polling portion). We're already using Informers for our other awaiters and they work well enough, so the primary goal is to use them for deletes as well.

provider/pkg/await/await_test.go Outdated Show resolved Hide resolved
return dc, nil
}

// Range confirms the object exists before establishing an Informer.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment wasn't accurate -- there is a race, so we check if the object was deleted after establishing the informer. (Informers are fine if the object doesn't exist, so you can subscribe to creations.)

Comment on lines 93 to 95
dc.logger.LogMessage(checkerlog.WarningMessage(
fmt.Sprintf("finalizers might be preventing deletion (%s)", strings.Join(finalizers, ", ")),
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I've been looking at a lot of non-interactive output and didn't realize these aren't already persisted. I suggest we do what we do in go-provider/docker-build and treat any warnings/errors as non-status so they get shown in the final interactive output.

provider/pkg/await/condition/deleted.go Outdated Show resolved Hide resolved
provider/pkg/await/condition/deleted.go Outdated Show resolved Hide resolved
provider/pkg/await/condition/source.go Show resolved Hide resolved
provider/pkg/await/condition/observer.go Show resolved Hide resolved
provider/pkg/await/condition/observer.go Show resolved Hide resolved
provider/pkg/await/internal/awaiter.go Show resolved Hide resolved
provider/pkg/await/internal/awaiter.go Show resolved Hide resolved
@blampe blampe changed the base branch from master to blampe/await-config August 9, 2024 22:13
Base automatically changed from blampe/await-config to master August 9, 2024 22:50
@mjeffryes mjeffryes added this to the 0.108 milestone Aug 16, 2024
Comment on lines +138 to +140
// Our context might be closed, but we still want to issue this request
// even if we're shutting down.
ctx := context.WithoutCancel(dc.ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Never seen this before, seems a bit awkward. If the context is indeed canceled, does the result still matter? Maybe you could check err == context.Canceled?

dc.observer.Range(yield)
}()

dc.getClusterState()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a typo, but exists to cause a side-effect, and would make more sense if the function was called refreshClusterState.

Comment on lines +150 to +151
dc.logger.LogStatus(diag.Warning,
"unexpected error while checking cluster state: "+err.Error(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't an abnormal error cause the waiter to quit?

Comment on lines +90 to +92
// Attempt one last lookup if the object still exists. (This is legacy
// behavior that might be unnecessary since we're using Informers instead of
// Watches now.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary to me, like it is second-guessing the informer. But I suppose you need to fetch the object any way to see whether any finalizers exist.

@@ -320,20 +320,26 @@ func TestAwaitDaemonSetDelete(t *testing.T) {
}

for _, tt := range tests {
tt := tt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I learned that loop variables are now copied as of go 1.22 and you don't need this line anymore.

@mikhailshilkov mikhailshilkov modified the milestones: 0.108, 0.109 Aug 21, 2024
@blampe blampe merged commit adfff97 into master Aug 21, 2024
19 checks passed
@blampe blampe deleted the blampe/2996-await-delete branch August 21, 2024 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic when interrupting deletion Destroying a resource fails with "timed out waiting to be Ready"
5 participants