Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from panics in async external clients #428

Merged
merged 2 commits into from
Aug 22, 2024

Conversation

mergenci
Copy link
Member

@mergenci mergenci commented Aug 21, 2024

Description of your changes

This PR implements a panic recovery mechanism for async external clients. Controller runtime can be configured to recover from panics, which Crossplane runtime adopted in crossplane/crossplane-runtime#493. Unfortunately, that recovery mechanism doesn't apply to the Goroutines started by async external clients (an example). Therefore, panics in async code would crash pods as reported in crossplane-contrib/provider-upjet-aws#1242.

With this PR, panic is logged and reconciliation continues, without the pod crashing.

2024-08-22T12:20:22+03:00	DEBUG	provider-aws	Async create ended.	{"trackerUID": "b1a58d9b-2236-4740-9a90-c169b5d2e538", "resourceName": "test-async-panic-handler-securitygroupingressrule", "gvk": "ec2.aws.upbound.io/v1beta1, Kind=SecurityGroupIngressRule", "error": "async create failed: panic: runtime error: index out of range [0] with length 0 [recovered]"}

There is one behavioral difference compared to the sync external client panic handling: Upon panicking during resource creation, async external client continues reconciling the resource and panicking each time, whereas sync external client stops reconciliation because of the existence of external-create-failed annotation on the resource. The async behavior is different, because Crossplane runtime is not async aware. Async clients let Crossplane runtime set external-create-succeeded annotation in order to be able to do the creation asynchronously. Having external-create-succeeded annotation prevents reconciliation from stopping even in the existence of external-create-failed annotation.

I have:

  • Read and followed Upjet's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Thanks to @haarchri's discovery, I triggered a panic using the manifest below. SecurityGroupIngressRule is a Terraform Plugin Framework resource, whereas SecurityGroup and VPC are Plugin SDKv2 resources.

apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroupIngressRule
metadata:
  name: test-async-panic-handler-securitygroupingressrule
spec:
  forProvider:
    # Omitting `cidrIpv4` panics the provider.
    # cidrIpv4: 10.0.0.0/8
    fromPort: 8080
    ipProtocol: tcp
    region: us-west-1
    securityGroupIdRef:
      name: test-async-panic-handler-securitygroup
    tags:
      Name: test-async-panic-handler-securitygroupingressrule
    toPort: 8080

---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroup
metadata:
  name: test-async-panic-handler-securitygroup
spec:
  forProvider:
    region: us-west-1
    tags:
      Name: test-async-panic-handler-securitygroup
    vpcIdRef:
      name: test-async-panic-handler-vpc

---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: VPC
metadata:
  name: test-async-panic-handler-vpc
spec:
  forProvider:
    cidrBlock: 172.16.0.0/16
    region: us-west-1
    tags:
      Name: test-async-panic-handler-vpc

Terraform provider panics with message:

E0822 12:19:21.286186   84654 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)

@mergenci mergenci marked this pull request as ready for review August 21, 2024 23:23
Copy link
Collaborator

@ulucinar ulucinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mergenci for working on this ancient bug and thanks for your effort to come up with an idiomatic and safe-to-maintain implementation. Left some minor comments, lgtm.

pkg/controller/external_async_tfpluginfw.go Show resolved Hide resolved
pkg/controller/external_async_tfpluginfw.go Outdated Show resolved Hide resolved
if cErr := n.callback.Create(mg.GetName())(err, ctx); cErr != nil {
n.opTracker.logger.Info("Async create callback failed", "error", cErr.Error())
}
_, ph.err = n.terraformPluginFrameworkExternalClient.Create(ctx, mg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We've already discussed this offline but leaving a note here for future reference. If we later add extra steps that may panic below this, the current implementation may loose a potential create error (chaining would then be a good idea to preserve the original create error). Not a real concern as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants