-
Notifications
You must be signed in to change notification settings - Fork 38
Terrajet providers take a long time on cluster with many resources #300
Comments
Hello team. I want to update this issue with my latest findings. After a while debugging and understanding a bit better how everything works, I figured the main issue seems to be that terrajet providers, in general, don't work fast enough to be able to consume the controller's workqueue faster than it's produced. Which causes the delays we're seeing as, for any change, the controller needs to go through all the workqueue before it can get to the recent change. I've setup two side-by-side AWS providers, one using terrajet and another using provider-aws. Then I've setup a prometheus and grafana to collect the metrics exposed by the controller manager. Open for more details on the experiment setupExperiment environment
Metrics SetupThe grafana dashboard was created pointing to the prometheus instance Configuration filesPrometheus.yaml # my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
external_labels:
monitor: 'monitor'
rule_files: []
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'jet-provider'
static_configs:
- targets: ['localhost:8080']
- job_name: 'native-provider'
static_configs:
- targets: ['localhost:8081'] Grafana dashboard: {
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
},
{
"datasource": {
"type": "datasource",
"uid": "grafana"
},
"enable": true,
"iconColor": "blue",
"name": "Annotations",
"target": {
"limit": 1000,
"refId": "Anno",
"tags": [
"annotation"
],
"type": "tags"
}
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 2,
"links": [],
"liveNow": false,
"panels": [
{
"collapsed": false,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 10,
"panels": [],
"title": "Controller",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 1
},
"id": 6,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(controller_runtime_reconcile_time_seconds_sum{controller=~\"$controllers\"}[$__rate_interval])",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "B"
},
{
"datasource": {
"name": "Expression",
"type": "__expr__",
"uid": "__expr__"
},
"expression": "$A/$B",
"hide": false,
"refId": "C",
"type": "math"
}
],
"title": "Controller Reconcile Time",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"max": 10,
"min": -1,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 1
},
"id": 12,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_reconcile_errors_total{controller=~\"$controllers\"}",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Reconcile Errors",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 9
},
"id": 16,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "histogram_quantile(0.95, sum by(le, controller) (rate(controller_runtime_reconcile_time_seconds_bucket{controller=~\"$controllers\"}[$__rate_interval])))",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "controller_runtime_max_concurrent_reconciles{controller=~\"$controllers\"}",
"hide": true,
"legendFormat": "__auto",
"range": true,
"refId": "B"
},
{
"datasource": {
"name": "Expression",
"type": "__expr__",
"uid": "__expr__"
},
"expression": "$A/$B",
"hide": false,
"refId": "C",
"type": "math"
}
],
"title": "95th percentile reconcile time",
"type": "timeseries"
},
{
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 17
},
"id": 8,
"title": "Queue",
"type": "row"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "ms"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 18
},
"id": 4,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(workqueue_queue_duration_seconds_sum{name=~\"$controllers\"}[$__rate_interval])",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Workqueue queue duration",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 18
},
"id": 2,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "workqueue_depth{name=~\"$controllers\"}",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Workqueue depth",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "auto",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 26
},
"id": 14,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"editorMode": "builder",
"expr": "rate(workqueue_adds_total{name=~\"$controllers\"}[$__rate_interval])",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Rate of Queue Adds",
"type": "timeseries"
}
],
"refresh": false,
"schemaVersion": 37,
"style": "dark",
"tags": [],
"templating": {
"list": [
{
"allValue": "managed/.*",
"current": {
"selected": false,
"text": [
"managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
"managed/policy.iam.aws.crossplane.io"
],
"value": [
"managed/iam.aws.jet.crossplane.io/v1alpha2, kind=policy",
"managed/policy.iam.aws.crossplane.io"
]
},
"datasource": {
"type": "prometheus",
"uid": "cdbCIg4Vk"
},
"definition": "controller_runtime_active_workers",
"description": "Controllers to watch. Regex",
"hide": 0,
"includeAll": true,
"label": "controller",
"multi": true,
"name": "controllers",
"options": [],
"query": {
"query": "controller_runtime_active_workers",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": ".*controller=\"(managed/.*?)\".*",
"skipUrlSync": false,
"sort": 0,
"type": "query"
}
]
},
"time": {
"from": "2022-10-03T04:52:47.659Z",
"to": "2022-10-03T05:09:38.612Z"
},
"timepicker": {},
"timezone": "",
"title": "Crossplane Controllers AWS",
"uid": "-WIVl0VVk",
"version": 2,
"weekStart": ""
} In the following graphs, the green lines represent the jet provider, while the yellow line represents the native one. Initially I've created 100 resources managed by each provider (first vertical blue line). There, we can see that the reconciliation time for the terrajet provider is much higher than the native provider. After some time I've changed the jet provider to have a
on the second section (after the second vertical blue line), the reconcilicaiton time is about on the second section (after the second vertical blue line), the workqueue depth stabilizes at about This left the question of "What if we bump the number of This, however, is not maintained when we create even more resources. When jumping to To make sure that the reconcile time was the main culprit here, I've also locally changed the Observe method of the native provider to include a I'm curious to see which solutions we can brew up to this. For now it seems that terrajet-based providers are not suitable for environments that need to manage many resources of the same type. |
I cross-posted this issue in upjet as it will (as far as I understand) replace terrajet in the future. |
Hello team!
Me and my team are developing a solution for developers to be able to manage infrastructure resources through an API. This solution relies on crossplane to manage cloud resources as kubernetes CRs.
We've started using terrajet for AWS, but changed to the native provider because of this issue. New we also need to manage other resources for which there are no native providers (namely vault, boundary and datadog) and it would be great to use terrajet.
Existing Context
initial slack thread with problem description
What happened?
The expected behaviour is that terrajet resources time-to-readiness wouldn't depend on the amount of resources that already exist in the cluster.
In reality, the terrajet controller takes roughly
3
*(number of existing resources)
*(time it takes for one resource to be reconciled)
seconds to consider a new resource ready.The examples in this issue are using
provider-jet-aws
, but all terrajet providers have the same behaviour, from what I can see.Below are the debug logs related to one recently created policy (
terrajetloadtestslow
). It was created on a provider that is already managing 200 other aws policies.From these logs we can also see the time it took for the resource to be
Ready
. From 23:28:19 to 23:45:13, ~17 minutes. (using GMT time to make it easier)It should also be noted that the resource is created in the cloud provider at 23:28:25 (12 seconds), when the logs say
Successfully requested creation of external resource
, but it takes a long time until the resource is marked asReady
in kubernetesThe reason why it takes so long seems to be that resource change events are not given any priority over the normal reconciliation loop. The controller stays in an endless loop reconciling every resource it knows about, which is necessary to override external changes in the providers, but any resource that is added or changed in kubernetes doesn't get immediately picked up. Rather it seems to be added to the queue and only seen after all existing reconcile loops are done, then the resource is picked up, and a creation flow starts, but the result will only be seen on the next reconciliation loop. Once that comes, terrajet will pick up the resource's status from the terraform state and requeue once more, only after this last requeue the status will be updated to
Ready
.It's also worth noting that changing the MaxConcurrentReconciles configuration to something bigger, like 5 or 10 reduces the time by that factor, but the underlying behavior persists and all resources are reconciled before considering new resource or updates
Ready
.What I'm having trouble understanding is how this
terrajet
workflow differs from the nativeprovider-aws
one, as both seem to use the same Reconcile function from crossplane-runtime.Also, the
UseAsync
configuration for a resource doesn't seem to change anything. I've tried both forpolicies.iam
and forrepositories.ecr
with the same behavior.How can we reproduce it?
provider-aws
)it will take some minutes, but a burst of resources is expected to take a bit.
Althought it does take much longer than
provider-aws
for the same resource.The last step will take a long time, which is the problem this bug report is about.
Open collapsible for reproducible commands
Possibly related discussions
Please let me know if I can help figure out what's happening.
The text was updated successfully, but these errors were encountered: