-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates get stuck when nodes are imbalanced across zones #370
Comments
After recovering the data balance job, storaged-14 was still not restarting with the error: Check the operator log and found: Error code -2016 is error of On this error, the user try to delete storaged-12, storaged-13, and storaged-14 manually. Then storaged go to the status like this: This is strange because with operator, a pod cannot be deleted. A deleted pod will restart automatically. Since now we have two storaged down forever, it means the operator thinks the cluster has 12 storaged and they are all running. It is reasonable because operator would get stuck at Now we can just After the operation, the cluster returns to normal. |
Resolved in release v1.7.1. |
The user was trying to updating a static flag on the cluster which had a scale down earlier (from 18 to 15 storaged). The earlier scale down causes imbalance in zone (#351 ). The rolling restart gets stuck and now is in pending state.
Failure reason:
Normal NotTriggerScaleUp 60s (x147 over 59m) cluster-autoscaler (combined from similar events): pod didn't trigger scale-up: 3 max node group size reached, 3 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod topology spread constraints, 2 node(s) had volume node affinity conflict
Full logs of operator:
nebula-operator-controller-manager-deployment-c8995b98b-dq4sv.txt
User is guessing
This is most likely due to pv in different zone, and topology spread constraint is preventing it to be scheduled in same zone
.Users took the following step after the restart is in pending:
How can we recover from this?
The text was updated successfully, but these errors were encountered: