-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187
fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187
Conversation
94533f3
to
408bc79
Compare
Codecov ReportBase: 81.58% // Head: 81.60% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2187 +/- ##
==========================================
+ Coverage 81.58% 81.60% +0.02%
==========================================
Files 124 124
Lines 18959 18965 +6
==========================================
+ Hits 15467 15476 +9
+ Misses 2702 2700 -2
+ Partials 790 789 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@jandersen-plaid is the underlying bug actually here? I'm wondering if the bug is better fixed by choosing not to change the replica counts because the selectors are not yet swapped? |
Sorry, I was reading too fast and misunderstood. I don't think this is necessarily an issue with replica counts, but with percentage of traffic rolled out to a specific replicaset -- one moment, will post a more full explanation and example. |
Consider the following rollout logs:
The issue is at 20:24:18 where we have changed the weights via the traffic routing controller. The replica counts have already maxed out with the last step, and the rollout has "finished".
I don't think this solves the issue because the replica counts are already set to 0 (for the stable) and the maximum (for the canary). That being said, I don't know these logs like you do so feel free to correct me on the timeline. |
fddf5d6
to
dcf70df
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Considering the number of end to end test failures, this is definitely the wrong approach though. |
Some questions:
We had some discussion about this. The bug might be in reconcileTrafficRouting() where we are adjusting weights even though selectors are not yet swapped, and the fix may need to consider current label selectors before making the weight change of 0% canary. |
Hey @jessesuen thanks for following up here!
Here it is: apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
labels:
app.kubernetes.io/managed-by: "Helm"
app.kubernetes.io/name: "server"
app.kubernetes.io/component: "server"
k8s.plaid.io/team: "cs"
app.kubernetes.io/version: "3.20.0"
helm.sh/chart: "plaid-app-3.20.0"
name: server
namespace: cs-team
spec:
replicas: 450
strategy:
canary:
dynamicStableScale: true
canaryService: server-canary
canaryMetadata:
labels:
k8s.plaid.io/deployment-role: canary
stableService: server-stable
stableMetadata:
labels:
k8s.plaid.io/deployment-role: stable
trafficRouting:
smi:
trafficSplitName: server-traffic-split
rootService: server
analysis:
templates:
- templateName: server-analysis
args:
- name: service
valueFrom:
fieldRef:
fieldPath: "metadata.labels['app.kubernetes.io/name']"
- name: commit
valueFrom:
fieldRef:
fieldPath: "metadata.labels['k8s.plaid.io/commit-sha']"
- name: stable-hash
valueFrom:
podTemplateHashValue: Stable
- name: latest-hash
valueFrom:
podTemplateHashValue: Latest
steps:
- setWeight: 1
- pause:
duration: 10m0s
- setWeight: 10
- pause:
duration: 20m0s
- setWeight: 25
- pause:
duration: 30m0s
- setWeight: 50
- pause:
duration: 30m0s
- setWeight: 100
- pause:
duration: 30m0s
revisionHistoryLimit: 2
selector:
matchLabels:
app.kubernetes.io/managed-by: "Helm"
app.kubernetes.io/name: "server"
app.kubernetes.io/component: "server"
k8s.plaid.io/team: "cs"
template:
metadata:
labels:
app.kubernetes.io/managed-by: "Helm"
app.kubernetes.io/name: "server"
app.kubernetes.io/component: "server"
k8s.plaid.io/team: "cs"
app.kubernetes.io/version: "3.20.0"
helm.sh/chart: "plaid-app-3.20.0"
k8s.plaid.io/commit-sha: "05354ced8b0e3597227a5d6958aab2222dbe28d3"
annotations:
iam.amazonaws.com/role: k8s-server-production
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
linkerd.io/inject: enabled
config.linkerd.io/skip-outbound-ports: "25,3306,5432,6379,11211,27017"
config.linkerd.io/close-wait-timeout: "3600s"
config.linkerd.io/proxy-cpu-request: "100m"
config.linkerd.io/proxy-memory-request: "100Mi"
config.linkerd.io/proxy-cpu-limit: ""
config.linkerd.io/proxy-memory-limit: "1Gi"
config.alpha.linkerd.io/proxy-wait-before-exit-seconds: "185"
config.linkerd.io/proxy-log-format: "json"
config.linkerd.io/proxy-log-level: warn,linkerd2_proxy=info
spec:
terminationGracePeriodSeconds: 185
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: k8s.plaid.io/spot
operator: DoesNotExist
values: []
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: server
app.kubernetes.io/name: server
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: app
image: "server:05354ced"
resources:
limits:
cpu: 8000m
memory: 16Gi
requests:
cpu: 4000m
memory: 12Gi
livenessProbe:
httpGet:
path: /health
port: 8025
initialDelaySeconds: 60
periodSeconds: 60
successThreshold: 1 # liveness must be 1
failureThreshold: 5
timeoutSeconds: 60
readinessProbe:
httpGet:
path: /health
port: 8025
initialDelaySeconds: 60
periodSeconds: 60
successThreshold: 1
failureThreshold: 5
timeoutSeconds: 60
command:
- "./start_server.sh"
env:
- name: PLAID_ENV
value: "production"
- name: K8S_METADATA_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_METADATA_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: K8S_LABEL_APP_KUBERNETES_IO_MANAGED_BY
valueFrom:
fieldRef:
fieldPath: metadata.labels['app.kubernetes.io/managed-by']
- name: K8S_LABEL_APP_KUBERNETES_IO_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app.kubernetes.io/name']
- name: K8S_LABEL_APP_KUBERNETES_IO_TEAM
valueFrom:
fieldRef:
fieldPath: metadata.labels['k8s.plaid.io/team']
- name: K8S_LABEL_APP_KUBERNETES_IO_VERSION
valueFrom:
fieldRef:
fieldPath: metadata.labels['app.kubernetes.io/version']
- name: K8S_LABEL_HELM_SH_CHART
valueFrom:
fieldRef:
fieldPath: metadata.labels['helm.sh/chart']
- name: K8S_LABEL_K8S_PLAID_IO_NETWORK_ACCESS_GROUP
valueFrom:
fieldRef:
fieldPath: metadata.labels['k8s.plaid.io/network-access-group']
- name: K8S_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: K8S_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
ports:
- name: main
containerPort: 8024
- name: metrics
containerPort: 8025
dnsConfig:
options:
- name: ndots
value: "1" The most key thing here is that
This is a bit tougher. From what I know, to reproduce the issue:
At this point, I am not sure whether the bug applies just to SMI Rollouts, or any Rollout that does not need to have knowledge of the replicaset hash at any given time.
I can see if it still exists there, for sure! I don't think this is easily reproducible in a kind or minikube cluster yet, but I think I can take away some instances on one of our testing clusters to simulate this.
I agree. I was looking into that when I last had to leave this work as we found a work around (just disable From what I could tell (starting from the point where we have delayed swapping the selectors and want to reconcile traffic routing at the final step):
The only way I could think to fix this was to make the traffic reconciler aware of the hashes by adding a map to the reconciler which is updated with the newest hashes in That being said, it could probably be simplified by just having |
oh no 😮💨 (really bad rebase) |
da5ccb2
to
bbc818e
Compare
99127a9
to
19862c9
Compare
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
…ce swaps Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
Signed-off-by: Jack Andersen <[email protected]>
46cad81
to
0db6d11
Compare
Signed-off-by: Jack Andersen <[email protected]>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
going to close in favor of #2441 |
This fixes #2050 by ensuring that, if a rollout needs to delay swapping the replicaset hashes, we do not continue reconciling.
If we were to continue reconciling then we would end up moving the
stable
(which should be thecanary
) to 100% and thecanary
(which should be thestable
) to 0%. Importantly, if this is at the end of a rollout where the stable set has been spun down then this will cause a traffic outage because thestable
set will have 0 replicas available.To illustrate the failure mode it would be:
The troublesome bit here is step 3 which can be catastrophic if the canary replicaset continues to be unhealthy for an indeterminate period of time. Returning an error would only extend the period at which we are at then end of the rollout but have not completed yet.
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.