Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116

Kasama · 2022-10-21T00:19:36Z

Cross-posting about crossplane/terrajet#300, because the exact same behavior happens with Upjet, and as far as I understood, terrajet will be deprecated in favor of Upjet, so it makes sense to keep this issue tracked here.

This is specially relevant as this now seems to be the "official" backend for provider implementations.

What happened?

The expected behaviour is that upjet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.

In reality, as calling terraform takes a while (around 1 second on my tests), the provider controller is unable to clear the work queue and because of that, any new events (such as creating a new resource) takes very long to complete when there are multiple other resources, since the controller adds those to the end of the queue.

There are more details and on the original bug report

How can we reproduce it?

The reproduction steps are basically the same as the original issue, just changing the terrajet provider for the upjet provider.

Create a new kubernetes cluster (with kind or in the cloud).
Install crossplane
Install an upjet provider (I'll use AWS because it can be easily compared with the native provider-aws)
Create a handful of resources to be managed by upjet (I'll use iam Policy because it is quickly created and dont incur costs)
Wait until all resources are created and ready
it will take some minutes, but a burst of resources is expected to take a bit.
Althought it does take much longer than provider-aws for the same resource.
Create one more resource to be managed by upjet

The last step will take a long time, which is the problem this bug report is about.

Open collapsible for reproducible commands

# Create kind cluster
kind create cluster --name upjet-load --image kindest/node:v1.23.10

# Install crossplane with helm
kubectl create namespace crossplane-system
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
helm upgrade --install crossplane --namespace crossplane-system crossplane-stable/crossplane

# Install AWS upjet provider
kubectl apply -f - <<YAML
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: crossplane-provider-jet-aws
spec:
  controllerConfigRef:
    name: config
  package: xpkg.upbound.io/upbound/provider-aws:v0.18.0
---
apiVersion: pkg.crossplane.io/v1alpha1
kind: ControllerConfig
metadata:
  name: config
spec:
  args:
  - --debug
YAML
## use your aws credentials in the secret, if you have a custom way to interact with AWS, change the credentials key of this secret.
kubectl create secret generic -n crossplane-system --from-file=credentials=$HOME/.aws/credentials aws-credentials
kubectl apply -f - <<YAML
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      name: aws-credentials
      namespace: crossplane-system
      key: credentials
YAML

# Create a handful of resources managed by upjet. I chose policies because they are created near-instantly in the cloud and don't incur costs
## Note: seq on MacOS doesn't seem to support the -w flag, it can be removed safely below
for n in $(seq -w 200); do
  echo "---"
  sed "s/NUMBER/$n/" <<YAML
apiVersion: iam.aws.upbound.io/v1beta1
kind: Policy
metadata:
  name: upboundNUMBER
spec:
  providerConfigRef:
    name: default
  forProvider:
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:GetPolicy",
            "Resource": "*"
          }
        ]
      }
YAML
done | kubectl apply -f -

# Wait until all are ready, it took about 15 minutes for me
kubectl get policies.iam.aws.upbound.io

# Create one more resource
kubectl apply -f - <<YAML
apiVersion: iam.aws.upbound.io/v1beta1
kind: Policy
metadata:
  name: upbound-slow
spec:
  providerConfigRef:
    name: default
  forProvider:
    policy: |
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "iam:GetPolicy",
            "Resource": "*"
          }
        ]
      }
YAML

# It takes a long time for the resource to become ready.
kubectl get policies.iam.aws.upbound.io upbound-slow

########

# Cleanup cloud resources
kubectl delete policies.iam.aws.upbound.io $(kubectl get policies.iam.aws.upbound.io | grep upbound-slow | awk '{print $1}')

# Delete kind cluster
kind destroy cluster --name upjet-load

The text was updated successfully, but these errors were encountered:

jeanduplessis · 2022-10-21T05:07:56Z

Thank you for the detailed report @Kasama, we'll be taking a look at this in our next sprint starting next week.

Kasama · 2022-10-21T14:03:12Z

Great to hear that! Feel free to reach out either here or on crossplane's slack thread (@roberto.alegro) if I can help with more details or reproduction steps

muvaf · 2022-11-01T01:49:36Z

Probably related to crossplane-contrib/provider-upjet-aws#86

turkenh · 2022-11-01T19:18:22Z

Thanks a lot for your detailed analysis here @Kasama.

I believe a low-hanging fruit here is to set some reasonable defaults for MaxConcurrentReconciles and PollInterval configurations to cover common cases and then to ensure they are exposed as configuration parameters so that they can be tuned further depending on specific deployment cases.

muvaf · 2022-11-07T10:56:06Z

FYI #99 is also another thing that may cause CPU saturation.

turkenh · 2022-11-08T14:45:05Z

On a GKE cluster with e2-standard-4 machine type, I did run the following 3 experiments:

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)
Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community providers defaults)
Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

There are definitely some improvements between Exp#1 and Exp#2 but TBH, I am a bit surprised with Exp#3 not being much different than Exp#2. I am wondering if this could be related to CPU being throttled for both cases. I am planning to repeat the two Experiments with larger nodes not to get throttled.

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)

Provisioned 100 ECR Repositories and it took ~19 mins until all of them become Ready.
Wanted to delete 1 of them and took 9min until it got processed first time and ~18 min until deleted.

Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~1 min until it got processed first time and ~5 mins until deleted.

Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

Provisioned 100 ECR Repositories and it took ~12 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~1 min until it got processed first time and ~4 mins until deleted.

Kasama · 2022-11-08T23:51:33Z

Yeah, during my testing I've walked a similar path, I had changed pollInterval, but assumed it didn't have an impact. Maybe an even bigger interval, like 30m or 1h would yield different results, but that starts to become a bit unreasonable imo.

Indeed there are some improvements when bumping the concurrency, but sadly the problem still remains in that the time it takes for new resources to be ready greatly depends on the amount of already existing resources.

turkenh · 2022-11-09T15:32:23Z

I repeated the last experiment on a bigger node (e2-standard-32) to eliminate the effect of CPU throttling and this time it looks much better (except the resource consumption).

Experiment 4: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults) (on e2-standard-32)

Provisioned 100 ECR Repositories and it took ~2 Mins until all of them become Ready.
Wanted to delete 1 of them and took ~5 secs until it got processed first time and ~10 secs until deleted.

I believe improving resource usage is something orthogonal with the settings here and I feel good with the above defaults while still exposing them as configurable params.

I'll open PRs with proposed defaults.

Kasama · 2022-12-02T13:22:39Z

I was finally able to do some more tests using a bigger instance (an m6i.8xlarge on aws in my case, similar to the e2-standard-32 you used). And I can confirm that running in a bigger instance does make the queue clear faster with the new defaults.

But when trying with ~5000 concurrent resources, there was still a similar problem and the queue was with ~700 resources at all times. That can be mitigated again by increasing the reconciliation time, but It would be much better to have a way to scale these controllers horizontally, though.

Kasama added the bug Something isn't working label Oct 21, 2022

Kasama mentioned this issue Oct 21, 2022

Terrajet providers take a long time on cluster with many resources crossplane/terrajet#300

Open

jeanduplessis added the community label Oct 21, 2022

jeanduplessis assigned turkenh Oct 21, 2022

turkenh mentioned this issue Nov 3, 2022

Re-enable shared TF provider usage crossplane-contrib/provider-upjet-aws#86

Closed

turkenh closed this as completed in crossplane-contrib/provider-upjet-aws#140 Nov 14, 2022

This was referenced Feb 28, 2023

Load tests with provider-aws crossplane-contrib/provider-upjet-aws#576

Closed

Load tests with provider-azure crossplane-contrib/provider-upjet-azure#404

Closed

ulucinar mentioned this issue Mar 14, 2023

Load tests with provider-gcp crossplane-contrib/provider-upjet-gcp#255

Closed

patst mentioned this issue Jun 26, 2023

Super high CPU usage with many resources grafana/crossplane-provider-grafana#34

Closed

stevendborrelli mentioned this issue Jul 10, 2023

Providers should support K8s resource limits crossplane/crossplane#4308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116

Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116

Kasama commented Oct 21, 2022

jeanduplessis commented Oct 21, 2022

Kasama commented Oct 21, 2022 •

edited

Loading

muvaf commented Nov 1, 2022

turkenh commented Nov 1, 2022

muvaf commented Nov 7, 2022

turkenh commented Nov 8, 2022

Kasama commented Nov 8, 2022

turkenh commented Nov 9, 2022

Kasama commented Dec 2, 2022

Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116

Upjet providers can't consume workqueue fast enough. Causes huge time-to-readiness delay #116

Comments

Kasama commented Oct 21, 2022

What happened?

How can we reproduce it?

jeanduplessis commented Oct 21, 2022

Kasama commented Oct 21, 2022 • edited Loading

muvaf commented Nov 1, 2022

turkenh commented Nov 1, 2022

muvaf commented Nov 7, 2022

turkenh commented Nov 8, 2022

Experiment 1: With maxConcurrentReconciles=1 and pollInterval=1m (Current defaults)

Experiment 2: With maxConcurrentReconciles=10 and pollInterval=1m (Community defaults)

Experiment 3: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults)

Kasama commented Nov 8, 2022

turkenh commented Nov 9, 2022

Experiment 4: With maxConcurrentReconciles=10 and pollInterval=10m (Proposed defaults) (on e2-standard-32)

Kasama commented Dec 2, 2022

Kasama commented Oct 21, 2022 •

edited

Loading