Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates get stuck when nodes are imbalanced across zones #370

Closed
wenhaocs opened this issue Oct 24, 2023 · 5 comments
Closed

Updates get stuck when nodes are imbalanced across zones #370

wenhaocs opened this issue Oct 24, 2023 · 5 comments
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected

Comments

@wenhaocs
Copy link

wenhaocs commented Oct 24, 2023

The user was trying to updating a static flag on the cluster which had a scale down earlier (from 18 to 15 storaged). The earlier scale down causes imbalance in zone (#351 ). The rolling restart gets stuck and now is in pending state.

image
Failure reason:
Normal NotTriggerScaleUp 60s (x147 over 59m) cluster-autoscaler (combined from similar events): pod didn't trigger scale-up: 3 max node group size reached, 3 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod topology spread constraints, 2 node(s) had volume node affinity conflict

Full logs of operator:
nebula-operator-controller-manager-deployment-c8995b98b-dq4sv.txt

User is guessing This is most likely due to pv in different zone, and topology spread constraint is preventing it to be scheduled in same zone.

Users took the following step after the restart is in pending:

  1. Manually scale nodes if that can help scheduler to schedule the storaged on new nodes
  2. Scale down storaged pods so that zonal imbalance can fix

How can we recover from this?

@wenhaocs wenhaocs added the type/bug Type: something is unexpected label Oct 24, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 24, 2023
@wenhaocs
Copy link
Author

Here is the current distribution of storaged across zones.
image

@wenhaocs
Copy link
Author

wenhaocs commented Oct 24, 2023

After a while, storaged-0 get recovered by itself. But now storaged-14 is in pending state.
image

image

@wenhaocs
Copy link
Author

wenhaocs commented Oct 24, 2023

Asked the user to (1) update the schedule policy from DoNotSchedule to ScheduleAnyway (2) restart the pending pod.

Editing nc is successful. But seems the pending pod is still using the old policy.
image
image

See this in the desc pod:
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=mau-comm-2,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph

From your log, we discovered that there was scale in job failed, which is blocking other jobs.
From the operator log, your jobs are blocked at E1025 01:22:55.063205 1 nebula_cluster_control.go:124] reconcile storaged cluster failed: Balance job still in progress, jobID 66, spaceID 6
Asked the user to check jobs. There are data balance job failure.
image

Then we asked the user to recover the job.

@MegaByte875 MegaByte875 added this to the v1.8.0 milestone Oct 26, 2023
@wenhaocs
Copy link
Author

wenhaocs commented Oct 27, 2023

After recovering the data balance job, storaged-14 was still not restarting with the error:
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=mau-comm-2,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph

Check the operator log and found:
E1025 04:30:26.190042 1 storaged_scaler.go:192] drop hosts [HostAddr({Host:mau-comm-2-storaged-13.mau-comm-2-storaged-headless.mau-comm-2.svc.cluster.local Port:9779}) HostAddr({Host:mau-comm-2-storaged-12.mau-comm-2-storaged-headless.mau-comm-2.svc.cluster.local Port:9779})] failed: metad client response code -2016 name <UNSET> E1025 04:30:26.190211 1 storaged_cluster.go:164] scale storaged cluster [mau-comm-2/mau-comm-2-storaged] failed: metad client response code -2016 name <UNSET> E1025 04:30:26.190249 1 nebula_cluster_control.go:124] reconcile storaged cluster failed: metad client response code -2016 name <UNSET> I1025 04:30:26.224872 1 nebulacluster.go:119] NebulaCluster [mau-comm-2/mau-comm-2] updated successfully I1025 04:30:26.224911 1 nebula_cluster_controller.go:173] NebulaCluster [mau-comm-2/mau-comm-2] reconcile details: waiting for nebulacluster ready E1025 04:30:26.224923 1 nebula_cluster_controller.go:184] NebulaCluster [mau-comm-2/mau-comm-2] reconcile failed: metad client response code -2016 name <UNSET> I1025 04:30:26.224933 1 nebula_cluster_controller.go:143] Finished reconciling NebulaCluster [mau-comm-2/mau-comm-2] (199.465552ms), result: {false 5s}

Error code -2016 is error of E_RELATED_SPACE_EXISTS = -2016, // There are still some space on the host, cannot drop it. Actually, this is bug and is fixed in: #377. Originally, before removing a host, operator will check if there are leader partitions on the host to remove. If no, it will not call balance data remove. This will cause issue because we need to call balance data remove to remove partitions anyway as long as there are partitions on the host, regardless of the leader status. In the user case, since update is executed the same time as scale in, it is possible that the storage host to scale in has been executed leader transfer during update process, and cause there are no leaders on the host. With the current code, partition removal will not be called. That's why there will be error of trying to remove host when there are still partitions on the host. Since then, every reconcile will try to run dropHosts and will never succeed without human intervention.

On this error, the user try to delete storaged-12, storaged-13, and storaged-14 manually. Then storaged go to the status like this:
image

This is strange because with operator, a pod cannot be deleted. A deleted pod will restart automatically. Since now we have two storaged down forever, it means the operator thinks the cluster has 12 storaged and they are all running. It is reasonable because operator would get stuck at dropHosts on every reconcile. Since the user manually deleted 3 pods, the operator found that the number of running storaged replicas is the same as what is set in the cluster. It will not go to dropHosts anymore, so will not get stuck.

Now we can just drop hosts and balance data remove for those two storaged just as there was no operator.

After the operation, the cluster returns to normal.

@MegaByte875
Copy link
Contributor

Resolved in release v1.7.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

2 participants