Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphd on different zone not even #425

Closed
jinyingsunny opened this issue Jan 25, 2024 · 2 comments
Closed

graphd on different zone not even #425

jinyingsunny opened this issue Jan 25, 2024 · 2 comments
Assignees
Labels
affects/master PR/issue: this bug affects master version. process/done Process of bug severity/major Severity of bug type/bug Type: something is unexpected

Comments

@jinyingsunny
Copy link

jinyingsunny commented Jan 25, 2024

as title
before expend cluster config:[3metad-3storaged-9graphd], then expend graphd from 9->28
image

operator上的扩容日志:

I0125 02:58:59.593453       1 workload.go:122] workload StatefulSet nebula/nebula2-graphd updated successfully
E0125 02:58:59.597458       1 nebula_cluster_control.go:171] reconcile console failed: waiting for graphd cluster [nebula/nebula2-graphd] ready
I0125 02:58:59.614309       1 nebulacluster.go:129] NebulaCluster [nebula/nebula2] status updated successfully
I0125 02:58:59.614331       1 nebula_cluster_controller.go:184] NebulaCluster [nebula/nebula2] reconcile details: waiting for graphd cluster [nebula/nebula2-graphd] ready
I0125 02:58:59.614336       1 nebula_cluster_controller.go:184] NebulaCluster [nebula/nebula2] reconcile details: waiting for nebulacluster ready
I0125 02:58:59.614340       1 nebula_cluster_controller.go:157] Finished reconciling NebulaCluster [nebula/nebula2] (149.240646ms), result: {false 10s}
I0125 02:58:59.614438       1 nebula_cluster_controller.go:174] Start to reconcile NebulaCluster
I0125 02:58:59.806120       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-9] scheduled on node sunny in zone us-east-2b
I0125 02:58:59.809390       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-10] scheduled on node liuxue in zone us-east-2a
I0125 02:58:59.812622       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-11] scheduled on node k8s-node2 in zone us-east-2c
I0125 02:58:59.815446       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-12] scheduled on node k8s-node1 in zone us-east-2b
I0125 02:58:59.818606       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-13] scheduled on node liuxue in zone us-east-2a
I0125 02:58:59.821397       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-14] scheduled on node k8s-node2 in zone us-east-2c
I0125 02:58:59.824473       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-15] scheduled on node sunny in zone us-east-2b
I0125 02:58:59.827788       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-16] scheduled on node liuxue in zone us-east-2a
I0125 02:58:59.831215       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-17] scheduled on node k8s-node2 in zone us-east-2c
I0125 02:58:59.834867       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-18] scheduled on node k8s-node1 in zone us-east-2b
I0125 02:58:59.837658       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-19] scheduled on node k8s-master in zone us-east-2a
I0125 02:58:59.840796       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-20] scheduled on node sunny in zone us-east-2b
I0125 02:58:59.843901       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-21] scheduled on node liuxue in zone us-east-2a
I0125 02:58:59.846766       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-22] scheduled on node k8s-node2 in zone us-east-2c
I0125 02:58:59.851170       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-23] scheduled on node k8s-node1 in zone us-east-2b
I0125 02:58:59.854348       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-24] scheduled on node k8s-node2 in zone us-east-2c
I0125 02:58:59.866979       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-25] scheduled on node k8s-master in zone us-east-2a
I0125 02:58:59.915542       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-26] scheduled on node sunny in zone us-east-2b
I0125 02:58:59.979432       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-27] scheduled on node k8s-master in zone us-east-2a
I0125 02:59:00.045983       1 cm.go:98] configMap [nebula/nebula2-graphd-zone] updated successfully

扩容后集群状态

# kubectl -n nebula get pod
NAME                                READY   STATUS    RESTARTS   AGE
nebula2-console                     1/1     Running   0          19h
nebula2-exporter-5d5d6f5455-7r842   1/1     Running   0          20h
nebula2-graphd-0                    1/1     Running   0          14h
nebula2-graphd-1                    1/1     Running   0          14h
nebula2-graphd-10                   1/1     Running   0          25m
nebula2-graphd-11                   1/1     Running   0          25m
nebula2-graphd-12                   1/1     Running   0          25m
nebula2-graphd-13                   1/1     Running   0          25m
nebula2-graphd-14                   1/1     Running   0          25m
nebula2-graphd-15                   1/1     Running   0          25m
nebula2-graphd-16                   1/1     Running   0          25m
nebula2-graphd-17                   1/1     Running   0          25m
nebula2-graphd-18                   1/1     Running   0          25m
nebula2-graphd-19                   1/1     Running   0          25m
nebula2-graphd-2                    1/1     Running   0          14h
nebula2-graphd-20                   1/1     Running   0          25m
nebula2-graphd-21                   1/1     Running   0          25m
nebula2-graphd-22                   1/1     Running   0          25m
nebula2-graphd-23                   1/1     Running   0          25m
nebula2-graphd-24                   1/1     Running   0          25m
nebula2-graphd-25                   1/1     Running   0          25m
nebula2-graphd-26                   1/1     Running   0          25m
nebula2-graphd-27                   1/1     Running   0          25m
nebula2-graphd-3                    1/1     Running   0          14h
nebula2-graphd-4                    1/1     Running   0          14h
nebula2-graphd-5                    1/1     Running   0          14h
nebula2-graphd-6                    1/1     Running   0          14h
nebula2-graphd-7                    1/1     Running   0          14h
nebula2-graphd-8                    1/1     Running   0          14h
nebula2-graphd-9                    1/1     Running   0          25m
nebula2-metad-0                     1/1     Running   0          14h
nebula2-metad-1                     1/1     Running   0          14h
nebula2-metad-2                     1/1     Running   0          14h
nebula2-storaged-0                  1/1     Running   0          14h
nebula2-storaged-1                  1/1     Running   0          14h
nebula2-storaged-2                  1/1     Running   0          14h

我的yaml文件:

apiVersion: apps.nebula-graph.io/v1alpha1
kind: NebulaCluster
metadata:
  name: nebula
  namespace: nebula
spec:
  console:
    image: vesoft/nebula-console
    version: v3.6.0
  agent:
    image: reg.vesoft-inc.com/cloud-dev/nebula-agent
    resources: {}
    version: latest
  enablePVReclaim: true
  exporter:
    httpPort: 9100
    image: vesoft/nebula-stats-exporter
    maxRequests: 20
    replicas: 1
    version: latest
  failoverPeriod: 5m0s
  graphd:
    config:
      stderrthreshold: "0"
    image: reg.vesoft-inc.com/rc/nebula-graphd-ent
    replicas: 9
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 200m
        memory: 500Mi
    version: v3.5-snap-ent
  imagePullPolicy: Always
  imagePullSecrets:
  - name: image-pull-secret
  metad:
    config:
      stderrthreshold: "1"
      zone_list: us-east-2a,us-east-2b,us-east-2c
      timestamp_in_logfile_name: "false"
      #validate_session_timestamp: "false"
      v: "3"
      license_manager_url: nebula-license-manager.nebula-license-manager.svc.cluster.local:9119
    dataVolumeClaim:
      resources:
        requests:
          storage: 2Gi
      storageClassName: local-path
    image: reg.vesoft-inc.com/rc/nebula-metad-ent
    replicas: 1
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 300m
        memory: 500Mi
    version: v3.5-snap-ent
  reference:
    name: statefulsets.apps
    version: v1
  schedulerName: default-scheduler
  sslCerts:
    caCert: root.crt
    caSecret: ca-cert
    clientCACert: ca.crt
    clientCert: tls.crt
    clientKey: tls.key
    clientSecret: client-cert
    insecureSkipVerify: true
    serverCert: tls.crt
    serverKey: tls.key
    serverSecret: server-cert
  storaged:
    config:
      stderrthreshold: "2"
    dataVolumeClaims:
    - resources:
        requests:
          storage: 2Gi
      storageClassName: local-path
    enableAutoBalance: true
    image: reg.vesoft-inc.com/vesoft-ent/nebula-storaged-ent
    #image: reg.vesoft-inc.com/rc/nebula-storaged-ent
    replicas: 3
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 300m
        memory: 500Mi
    version: v3.5-snap-ent
  topologySpreadConstraints:
  - topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  imagePullSecrets:
  - name: image-nebula-ent-sc-secret
  nodeSelector:
    nebula: cloud

5 个 nodes分别打了3个不同zone的label,可以搜索topology.kubernetes.io/zone=us-east-2

# kubectl get nodes --show-labels
NAME         STATUS   ROLES           AGE    VERSION   LABELS
k8s-master   Ready    control-plane   193d   v1.27.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,kubernetes.io/zone=us-east-2a,nebula=cloud,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=,topology.kubernetes.io/zone=us-east-2a
k8s-node1    Ready    <none>          191d   v1.27.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node1,kubernetes.io/os=linux,kubernetes.io/zone=us-east-2b,nebula=cloud,topology.kubernetes.io/zone=us-east-2b
k8s-node2    Ready    <none>          193d   v1.27.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node2,kubernetes.io/os=linux,kubernetes.io/zone=us-east-2c,nebula=cloud,topology.kubernetes.io/zone=us-east-2c
liuxue       Ready    <none>          69d    v1.27.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=liuxue,kubernetes.io/os=linux,kubernetes.io/zone=us-east-2a,nebula=cloud,topology.kubernetes.io/zone=us-east-2a
sunny        Ready    <none>          68d    v1.27.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=sunny,kubernetes.io/os=linux,kubernetes.io/zone=us-east-2b,nebula=cloud,topology.kubernetes.io/zone=us-east-2b

Your Environments (required)

operator:snap-1.30
kubectl version: v1.27.3

Expected behavior

after expend, graphd spread even in different zone

@jinyingsunny jinyingsunny added severity/major Severity of bug type/bug Type: something is unexpected affects/master PR/issue: this bug affects master version. labels Jan 25, 2024
@jinyingsunny
Copy link
Author

retry for 3time, not even occur 3times. graphd expand :
9->28 , 9->29, 9->21 . from graphd18->graphd20, it always separated on us-east-2b,us-east-2a,us-east-2b

I0125 05:37:31.977733       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-18] scheduled on node k8s-node1 in zone us-east-2b
I0125 05:37:31.980248       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-19] scheduled on node k8s-master in zone us-east-2a
I0125 05:37:32.041398       1 graphd_cluster.go:310] graphd pod [nebula/nebula2-graphd-20] scheduled on node sunny in zone us-east-2b

@jinyingsunny
Copy link
Author

jinyingsunny commented Feb 19, 2024

operator:snap-1.35没复现。
前面验证发现此问题,是因为使用的nebula集群配置不对,当拓扑结构不满足时,应该是DoNotSchedule.而不应该是下面的直接调度。另外开发同学反馈代码也有bug,当前已修复。

  topologySpreadConstraints:
  - topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway

另外,扩容后资源不足,补充资源后,继续完成扩容;这个场景可以通过缩容其他服务来实现。

@github-actions github-actions bot added the process/fixed Process of bug label Feb 19, 2024
@jinyingsunny jinyingsunny added process/done Process of bug and removed process/fixed Process of bug labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/master PR/issue: this bug affects master version. process/done Process of bug severity/major Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

2 participants