Recover from panics in async external clients #428
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of your changes
This PR implements a panic recovery mechanism for async external clients. Controller runtime can be configured to recover from panics, which Crossplane runtime adopted in crossplane/crossplane-runtime#493. Unfortunately, that recovery mechanism doesn't apply to the Goroutines started by async external clients (an example). Therefore, panics in async code would crash pods as reported in crossplane-contrib/provider-upjet-aws#1242.
With this PR, panic is logged and reconciliation continues, without the pod crashing.
2024-08-22T12:20:22+03:00 DEBUG provider-aws Async create ended. {"trackerUID": "b1a58d9b-2236-4740-9a90-c169b5d2e538", "resourceName": "test-async-panic-handler-securitygroupingressrule", "gvk": "ec2.aws.upbound.io/v1beta1, Kind=SecurityGroupIngressRule", "error": "async create failed: panic: runtime error: index out of range [0] with length 0 [recovered]"}
There is one behavioral difference compared to the sync external client panic handling: Upon panicking during resource creation, async external client continues reconciling the resource and panicking each time, whereas sync external client stops reconciliation because of the existence of
external-create-failed
annotation on the resource. The async behavior is different, because Crossplane runtime is not async aware. Async clients let Crossplane runtime setexternal-create-succeeded
annotation in order to be able to do the creation asynchronously. Havingexternal-create-succeeded
annotation prevents reconciliation from stopping even in the existence ofexternal-create-failed
annotation.I have:
make reviewable
to ensure this PR is ready for review.Addedbackport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Thanks to @haarchri's discovery, I triggered a panic using the manifest below.
SecurityGroupIngressRule
is a Terraform Plugin Framework resource, whereasSecurityGroup
andVPC
are Plugin SDKv2 resources.Terraform provider panics with message:
E0822 12:19:21.286186 84654 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)