Updates get stuck when nodes are imbalanced across zones #370

wenhaocs · 2023-10-24T00:50:24Z

The user was trying to updating a static flag on the cluster which had a scale down earlier (from 18 to 15 storaged). The earlier scale down causes imbalance in zone (#351 ). The rolling restart gets stuck and now is in pending state.

Failure reason:
Normal NotTriggerScaleUp 60s (x147 over 59m) cluster-autoscaler (combined from similar events): pod didn't trigger scale-up: 3 max node group size reached, 3 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod topology spread constraints, 2 node(s) had volume node affinity conflict

Full logs of operator:
nebula-operator-controller-manager-deployment-c8995b98b-dq4sv.txt

User is guessing This is most likely due to pv in different zone, and topology spread constraint is preventing it to be scheduled in same zone.

Users took the following step after the restart is in pending:

Manually scale nodes if that can help scheduler to schedule the storaged on new nodes
Scale down storaged pods so that zonal imbalance can fix

How can we recover from this?

The text was updated successfully, but these errors were encountered:

wenhaocs · 2023-10-24T04:55:48Z

Here is the current distribution of storaged across zones.

wenhaocs · 2023-10-24T05:06:49Z

After a while, storaged-0 get recovered by itself. But now storaged-14 is in pending state.

wenhaocs · 2023-10-24T18:32:03Z

Asked the user to (1) update the schedule policy from DoNotSchedule to ScheduleAnyway (2) restart the pending pod.

Editing nc is successful. But seems the pending pod is still using the old policy.

See this in the desc pod:
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=mau-comm-2,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph

From your log, we discovered that there was scale in job failed, which is blocking other jobs.
From the operator log, your jobs are blocked at E1025 01:22:55.063205 1 nebula_cluster_control.go:124] reconcile storaged cluster failed: Balance job still in progress, jobID 66, spaceID 6
Asked the user to check jobs. There are data balance job failure.

Then we asked the user to recover the job.

wenhaocs · 2023-10-27T22:43:11Z

After recovering the data balance job, storaged-14 was still not restarting with the error:
Topology Spread Constraints: topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/cluster=mau-comm-2,app.kubernetes.io/component=storaged,app.kubernetes.io/managed-by=nebula-operator,app.kubernetes.io/name=nebula-graph

Check the operator log and found:
E1025 04:30:26.190042 1 storaged_scaler.go:192] drop hosts [HostAddr({Host:mau-comm-2-storaged-13.mau-comm-2-storaged-headless.mau-comm-2.svc.cluster.local Port:9779}) HostAddr({Host:mau-comm-2-storaged-12.mau-comm-2-storaged-headless.mau-comm-2.svc.cluster.local Port:9779})] failed: metad client response code -2016 name <UNSET> E1025 04:30:26.190211 1 storaged_cluster.go:164] scale storaged cluster [mau-comm-2/mau-comm-2-storaged] failed: metad client response code -2016 name <UNSET> E1025 04:30:26.190249 1 nebula_cluster_control.go:124] reconcile storaged cluster failed: metad client response code -2016 name <UNSET> I1025 04:30:26.224872 1 nebulacluster.go:119] NebulaCluster [mau-comm-2/mau-comm-2] updated successfully I1025 04:30:26.224911 1 nebula_cluster_controller.go:173] NebulaCluster [mau-comm-2/mau-comm-2] reconcile details: waiting for nebulacluster ready E1025 04:30:26.224923 1 nebula_cluster_controller.go:184] NebulaCluster [mau-comm-2/mau-comm-2] reconcile failed: metad client response code -2016 name <UNSET> I1025 04:30:26.224933 1 nebula_cluster_controller.go:143] Finished reconciling NebulaCluster [mau-comm-2/mau-comm-2] (199.465552ms), result: {false 5s}

Error code -2016 is error of E_RELATED_SPACE_EXISTS = -2016, // There are still some space on the host, cannot drop it. Actually, this is bug and is fixed in: #377. Originally, before removing a host, operator will check if there are leader partitions on the host to remove. If no, it will not call balance data remove. This will cause issue because we need to call balance data remove to remove partitions anyway as long as there are partitions on the host, regardless of the leader status. In the user case, since update is executed the same time as scale in, it is possible that the storage host to scale in has been executed leader transfer during update process, and cause there are no leaders on the host. With the current code, partition removal will not be called. That's why there will be error of trying to remove host when there are still partitions on the host. Since then, every reconcile will try to run dropHosts and will never succeed without human intervention.

On this error, the user try to delete storaged-12, storaged-13, and storaged-14 manually. Then storaged go to the status like this:

This is strange because with operator, a pod cannot be deleted. A deleted pod will restart automatically. Since now we have two storaged down forever, it means the operator thinks the cluster has 12 storaged and they are all running. It is reasonable because operator would get stuck at dropHosts on every reconcile. Since the user manually deleted 3 pods, the operator found that the number of running storaged replicas is the same as what is set in the cluster. It will not go to dropHosts anymore, so will not get stuck.

Now we can just drop hosts and balance data remove for those two storaged just as there was no operator.

After the operation, the cluster returns to normal.

MegaByte875 · 2023-10-30T02:24:05Z

Resolved in release v1.7.1.

wenhaocs added the type/bug Type: something is unexpected label Oct 24, 2023

github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 24, 2023

MegaByte875 added this to the v1.8.0 milestone Oct 26, 2023

MegaByte875 removed this from the v1.8.0 milestone Oct 30, 2023

wenhaocs closed this as completed Oct 31, 2023

github-actions bot added the process/fixed Process of bug label Oct 31, 2023

wey-gu mentioned this issue Nov 4, 2023

Weekly Report 2023-11-03 vesoft-inc/nebula-community#414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates get stuck when nodes are imbalanced across zones #370

Updates get stuck when nodes are imbalanced across zones #370

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 24, 2023

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 27, 2023 •

edited

Loading

MegaByte875 commented Oct 30, 2023

Updates get stuck when nodes are imbalanced across zones #370

Updates get stuck when nodes are imbalanced across zones #370

Comments

wenhaocs commented Oct 24, 2023 • edited Loading

wenhaocs commented Oct 24, 2023

wenhaocs commented Oct 24, 2023 • edited Loading

wenhaocs commented Oct 24, 2023 • edited Loading

wenhaocs commented Oct 27, 2023 • edited Loading

MegaByte875 commented Oct 30, 2023

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 24, 2023 •

edited

Loading

wenhaocs commented Oct 27, 2023 •

edited

Loading