Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After Rebooting all my Control Plane Nodes at the same time: ovn-central details":"inconsistent data","error":"ovsdb error" #3919

Closed
Smithx10 opened this issue Apr 16, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@Smithx10
Copy link

How do we recover from the following condition?

PROBE_INTERVAL is set to 180000
OVN_LEADER_PROBE_INTERVAL is set to 5
OVN_NORTHD_N_THREADS is set to 1
ENABLE_COMPACT is set to false
172.16.0.1

  • ovn-northd is not running
  • ovnnb_db is not running
  • ovnsb_db is not running
    ovsdb-tool: /etc/ovn/ovnnb_db.db: record 175 attempts to truncate log from 1873 to 1839 entries, but commit index is already 1872
    backup /etc/ovn/ovnnb_db.db to /etc/ovn/ovnnb_db.db.backup-1713236899-d612f5
    detected database corruption for file /etc/ovn/ovnnb_db.db, try to fix it.
    ovsdb-tool: /etc/ovn/ovnnb_db.db: record 175 attempts to truncate log from 1873 to 1839 entries, but commit index is already 1872
    [{"uuid":["uuid","41001f66-a798-438c-b29f-d308c8b4f853"]},{"uuid":["uuid","d6d7e922-1438-40f8-ab60-6b96b09bd538"]}]
    [{"uuid":["uuid","58250fbc-e394-43b0-8d66-be020227f6ce"]},{"uuid":["uuid","7775de3b-d03c-45ae-b9da-74a39b50ea0b"]}]
    2024-04-16T03:08:19Z|00001|stream_ssl|ERR|Private key must be configured to use SSL
    2024-04-16T03:08:19Z|00002|stream_ssl|ERR|Certificate must be configured to use SSL
    2024-04-16T03:08:19Z|00003|stream_ssl|ERR|CA certificate must be configured to use SSL
    ovsdb-client: failed to connect to "ssl:[172.16.0.1]:6641" (Protocol not available)
    2024-04-16T03:08:19Z|00001|stream_ssl|ERR|Private key must be configured to use SSL
    2024-04-16T03:08:19Z|00002|stream_ssl|ERR|Certificate must be configured to use SSL
    2024-04-16T03:08:19Z|00003|stream_ssl|ERR|CA certificate must be configured to use SSL
    ovsdb-client: failed to connect to "ssl:[172.16.0.2]:6641" (Protocol not available)
    2024-04-16T03:08:19Z|00001|stream_ssl|ERR|Private key must be configured to use SSL
    2024-04-16T03:08:19Z|00002|stream_ssl|ERR|Certificate must be configured to use SSL
    2024-04-16T03:08:19Z|00003|stream_ssl|ERR|CA certificate must be configured to use SSL
    ovsdb-client: failed to connect to "ssl:[172.16.0.3]:6641" (Protocol not available)
  • Starting ovsdb-nb
    ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed ()
    2024-04-16T03:08:19Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
    2024-04-16T03:08:19Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
  • Waiting for OVN_Northbound to come up
  • Starting ovsdb-sb
    ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed ()
    2024-04-16T03:08:20Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
    2024-04-16T03:08:20Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
  • Waiting for OVN_Southbound to come up
  • Starting ovn-northd
    2024-04-16T03:08:21Z|00002|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
    ovn-nbctl: transaction error: {"details":"inconsistent data","error":"ovsdb error"}
  • Exiting ovn-northd (673)
  • Exiting ovnnb_db (280)
  • Exiting ovnsb_db (483)
@zhangzujian zhangzujian added the bug Something isn't working label Apr 16, 2024
@Smithx10
Copy link
Author

It appears this is only happening on 1 pod which is on headnode-01

[ use1 ] root@headnode-01:~$ kubectl ko nb status
faf1
Name: OVN_Northbound
Cluster ID: 277c (277c6d46-33b1-42b7-83be-4457951d8c54)
Server ID: faf1 (faf1ae0a-1103-48b4-993e-b06aab97f168)
Address: ssl:[172.16.0.3]:6643
Status: cluster member
Role: leader
Term: 20
Leader: self
Vote: self

Last Election started 44557 ms ago, reason: timeout
Last Election won: 44555 ms ago
Election timer: 5000
Log: [1802, 3219]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->ebb1) ->48ff <-48ff
Disconnections: 3
Servers:
    ebb1 (ebb1 at ssl:[172.16.0.1]:6643) next_index=3219 match_index=3218 last msg 13984 ms ago
    48ff (48ff at ssl:[172.16.0.2]:6643) next_index=3219 match_index=3218 last msg 650 ms ago
    faf1 (faf1 at ssl:[172.16.0.3]:6643) (self) next_index=3216 match_index=3218
status: ok


[ use1 ] root@headnode-01:~$ kubectl ko sb status
9e2c
Name: OVN_Southbound
Cluster ID: 7da9 (7da9f927-2c1d-4160-94c2-5be40d54cb81)
Server ID: 9e2c (9e2cd89b-5a30-4ae4-b3b8-7dfbd659ca53)
Address: ssl:[172.16.0.3]:6644
Status: cluster member
Role: leader
Term: 20
Leader: self
Vote: self

Last Election started 133221 ms ago, reason: timeout
Last Election won: 133218 ms ago
Election timer: 5000
Log: [1869, 3584]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: (->6919) ->81a3 <-81a3
Disconnections: 5
Servers:
    6919 (6919 at ssl:[172.16.0.1]:6644) next_index=1918 match_index=3583 last msg 21212 ms ago
    81a3 (81a3 at ssl:[172.16.0.2]:6644) next_index=3584 match_index=3583 last msg 1600 ms ago
    9e2c (9e2c at ssl:[172.16.0.3]:6644) (self) next_index=3580 match_index=3583
status: ok


Γöé kube-system     ovn-central-868c6dc8c7-f9sr5      ΓùÅ      0/1       CrashLoopBackOff                 3       0       0          0          0          0          0 172.16.0.1      headnode-01      116s       Γöé
Γöé kube-system     ovn-central-868c6dc8c7-jfkzp      ΓùÅ      1/1       Running                          0      46      70         15          1         35          1 172.16.0.2      headnode-02      2m6s       Γöé
Γöé kube-system     ovn-central-868c6dc8c7-kl45v      ΓùÅ      1/1       Running                          0      88      74         29          2         37          1 172.16.0.3      headnode-03      2m2s       Γöé

@bobz965
Copy link
Collaborator

bobz965 commented Apr 16, 2024

please see the doc: https://kubeovn.github.io/docs/v1.13.x/ops/recover-db/

@Smithx10
Copy link
Author

@bobz965 Great! Kick From cluster worked perfectly. Thank You :)

@Smithx10 Smithx10 reopened this Apr 16, 2024
@Smithx10
Copy link
Author

After kicking heanode-01 out of the cluster:

nbctl show is empty. "_" I think the database is gone?

[ use1 ] root@headnode-02:$ k ko nbctl show
[ use1 ] root@headnode-02:
$

@Smithx10
Copy link
Author

Alright, something might be strange with the plugin.

seems k ko nbctl isn't returning results, but running the ovn-nbctl show command inside 1 of the ovn-central pods works as expected.
root@headnode-03:/kube-ovn# ovn-nbctl show
switch 55df0ec4-af42-4035-9252-6c06b7b19a9a (storage)
port localnet.storage
type: localnet
addresses: ["unknown"]
port t2.default.storage.default.ovn
addresses: ["00:00:00:EE:7E:6A 172.16.17.105"]
switch a4cfaf08-da84-4735-9b3e-08f5cbf78644 (ovn-default)
port virt-operator-66b7f94b6d-tf9h9.kubevirt
addresses: ["00:00:00:08:2E:01 172.16.128.15"]
port linstor-csi-node-z2p7t.piraeus-datastore
addresses: ["00:00:00:D8:D9:EA 172.16.128.54"]
port ha-controller-7q7hf.piraeus-datastore
addresses: ["00:00:00:DA:4D:6D 172.16.128.132"]
port kube-ovn-pinger-sgdg6.kube-system
addresses: ["00:00:00:40:31:FC 172.16.129.45"]
port rke2-coredns-rke2-coredns-5b8c65d87f-nl9f2.kube-system
addresses: ["00:00:00:2D:47:1D 172.16.128.17"]
port kube-ovn-pinger-h6g65.kube-system
addresses: ["00:00:00:74:BA:A1 172.16.128.144"]
port ha-controller-rmp9q.piraeus-datastore
addresses: ["00:00:00:FB:74:57 172.16.128.159"]
port rke2-ingress-nginx-controller-8cvgx.kube-system
addresses: ["00:00:00:AA:DA:D3 172.16.128.129"]
port rke2-coredns-rke2-coredns-5b8c65d87f-g8kwq.kube-system
addresses: ["00:00:00:23:47:3D 172.16.128.18"]
port rke2-ingress-nginx-controller-h7z4m.kube-system
addresses: ["00:00:00:2C:C6:E2 172.16.128.98"]
port virt-handler-7hw2b.kubevirt
addresses: ["00:00:00:45:17:13 172.16.128.120"]
port ha-controller-zhspc.piraeus-datastore
addresses: ["00:00:00:AB:2E:61 172.16.128.122"]
port redisoperator-fb5478dbb-nz8bg.harbor-operator-ns
addresses: ["00:00:00:63:BC:B1 172.16.128.97"]
port virt-api-5686f97bc-vfzhb.kubevirt
addresses: ["00:00:00:8E:19:A1 172.16.128.2"]
port rke2-ingress-nginx-controller-5vgt8.kube-system
addresses: ["00:00:00:38:E8:40 172.16.128.125"]
port piraeus-operator-controller-manager-65c7fbbb5b-2ghg8.piraeus-datastore
addresses: ["00:00:00:FF:94:06 172.16.128.6"]
port ha-controller-445kw.piraeus-datastore
addresses: ["00:00:00:F4:C2:9C 172.16.128.143"]
port rke2-coredns-rke2-coredns-5b8c65d87f-xlwsz.kube-system
addresses: ["00:00:00:B4:D8:6F 172.16.128.13"]
port kube-ovn-pinger-5hdvw.kube-system
addresses: ["00:00:00:99:E9:27 172.16.129.43"]
port virt-controller-678998f868-8c4bg.kubevirt
addresses: ["00:00:00:33:69:26 172.16.128.16"]
port harbor-operator-7b48c67445-xvh8d.harbor-operator-ns
addresses: ["00:00:00:91:12:7E 172.16.128.14"]
port linstor-csi-node-n8tsq.piraeus-datastore
addresses: ["00:00:00:00:4D:85 172.16.128.50"]
port virt-handler-55gpf.kubevirt
addresses: ["00:00:00:81:EE:65 172.16.128.127"]
port ovn-default-ovn-cluster
type: router
router-port: ovn-cluster-ovn-default
port ha-controller-t76f4.piraeus-datastore
addresses: ["00:00:00:93:68:28 172.16.128.124"]
port virt-handler-55nl4.kubevirt
addresses: ["00:00:00:2B:7D:5A 172.16.129.36"]
port rke2-coredns-rke2-coredns-5b8c65d87f-lm744.kube-system
addresses: ["00:00:00:7C:5F:89 172.16.128.12"]
port kube-ovn-pinger-d6hbn.kube-system
addresses: ["00:00:00:C6:36:8C 172.16.128.48"]
port postgres-operator-95754cbfd-q48qg.harbor-operator-ns
addresses: ["00:00:00:8F:A2:13 172.16.128.203"]
port linstor-csi-node-rj4rq.piraeus-datastore
addresses: ["00:00:00:9E:D5:D9 172.16.128.32"]
port virt-handler-pmn4j.kubevirt
addresses: ["00:00:00:4E:12:DA 172.16.128.165"]
port linstor-csi-node-x6kt2.piraeus-datastore
addresses: ["00:00:00:84:08:D6 172.16.128.87"]
port ha-controller-7hwxc.piraeus-datastore
addresses: ["00:00:00:9F:62:58 172.16.129.48"]
port kube-ovn-pinger-49zch.kube-system
addresses: ["00:00:00:3B:AD:C6 172.16.129.35"]
port linstor-csi-node-4sv9g.piraeus-datastore
addresses: ["00:00:00:7A:0C:0A 172.16.128.61"]
port cert-manager-85cfbd86f5-gzb5q.cert-manager
addresses: ["00:00:00:58:AB:67 172.16.128.7"]
port linstor-csi-controller-557f665789-9xpfq.piraeus-datastore
addresses: ["00:00:00:01:69:C2 172.16.128.11"]
port kube-ovn-pinger-jcttq.kube-system
addresses: ["00:00:00:83:A1:8C 172.16.128.93"]
port virt-handler-lmzcj.kubevirt
addresses: ["00:00:00:63:E1:D3 172.16.129.32"]
port ha-controller-rkjld.piraeus-datastore
addresses: ["00:00:00:C5:31:F5 172.16.129.54"]
port kube-ovn-pinger-4q448.kube-system
addresses: ["00:00:00:32:9E:FF 172.16.129.33"]
port openebs-zfs-localpv-controller-0.openebs
addresses: ["00:00:00:44:ED:99 172.16.128.113"]
port rke2-coredns-rke2-coredns-autoscaler-945fbd459-f898l.kube-system
addresses: ["00:00:00:E7:63:1D 172.16.128.5"]
port ha-controller-zq7xm.piraeus-datastore
addresses: ["00:00:00:3D:3B:0E 172.16.128.168"]
port linstor-csi-node-snwlf.piraeus-datastore
addresses: ["00:00:00:96:D9:AC 172.16.128.91"]
port rke2-ingress-nginx-controller-m82ff.kube-system
addresses: ["00:00:00:5A:00:94 172.16.128.138"]
port kube-ovn-pinger-xgc27.kube-system
addresses: ["00:00:00:EA:1C:48 172.16.128.30"]
port my-nginx-684dd4dcd4-25t4z.default
addresses: ["00:00:00:D1:B6:39 172.16.128.188"]
port linstor-controller-7df855f57-94hd6.piraeus-datastore
addresses: ["00:00:00:BD:F7:D6 172.16.128.214"]
port rke2-coredns-rke2-coredns-5b8c65d87f-gb6b7.kube-system
addresses: ["00:00:00:B1:96:DD 172.16.128.102"]
port rke2-coredns-rke2-coredns-5b8c65d87f-ht7xw.kube-system
addresses: ["00:00:00:AF:2C:1A 172.16.128.103"]
port rke2-ingress-nginx-controller-tvq9r.kube-system
addresses: ["00:00:00:56:33:54 172.16.128.156"]
port rke2-coredns-rke2-coredns-5b8c65d87f-ck8cx.kube-system
addresses: ["00:00:00:1C:D3:DD 172.16.128.105"]
port virt-handler-4vv4x.kubevirt
addresses: ["00:00:00:5D:30:0C 172.16.129.57"]
port kube-ovn-pinger-dft8j.kube-system
addresses: ["00:00:00:E6:F7:33 172.16.129.44"]
port linstor-csi-node-58p86.piraeus-datastore
addresses: ["00:00:00:F3:66:53 172.16.128.85"]
port t3.default
addresses: ["00:00:00:10:DB:23 172.16.128.106"]
port cert-manager-webhook-847d7676c9-fs9ld.cert-manager
addresses: ["00:00:00:1C:85:CA 172.16.128.212"]
port kube-ovn-pinger-wlkf7.kube-system
addresses: ["00:00:00:7E:C2:4E 172.16.128.28"]
port cdi-apiserver-78d5585c5d-wpv24.cdi
addresses: ["00:00:00:54:3C:B6 172.16.128.209"]
port kube-ovn-pinger-hvxvq.kube-system
addresses: ["00:00:00:5F:53:3C 172.16.128.44"]
port linstor-csi-node-w8xnt.piraeus-datastore
addresses: ["00:00:00:A9:D5:62 172.16.128.76"]
port virt-handler-hk6dn.kubevirt
addresses: ["00:00:00:2C:12:AF 172.16.128.151"]
port kube-ovn-pinger-mdz4d.kube-system
addresses: ["00:00:00:F7:27:00 172.16.128.46"]
port virt-handler-959xv.kubevirt
addresses: ["00:00:00:37:DF:63 172.16.128.162"]
port linstor-csi-node-j5wfv.piraeus-datastore
addresses: ["00:00:00:80:3D:E5 172.16.128.86"]
port linstor-csi-node-dgzsz.piraeus-datastore
addresses: ["00:00:00:CF:18:C2 172.16.128.100"]
port linstor-csi-node-mgj7t.piraeus-datastore
addresses: ["00:00:00:CD:B4:65 172.16.128.35"]
port virt-handler-2hx96.kubevirt
addresses: ["00:00:00:45:5F:DF 172.16.128.152"]
port virt-handler-dx9qx.kubevirt
addresses: ["00:00:00:CA:D0:A0 172.16.128.142"]
port virt-handler-cmql9.kubevirt
addresses: ["00:00:00:26:77:9D 172.16.128.135"]
port virt-handler-6dfqn.kubevirt
addresses: ["00:00:00:43:4B:AC 172.16.128.158"]
port kube-ovn-pinger-wg9hh.kube-system
addresses: ["00:00:00:86:79:4C 172.16.128.145"]
port kube-ovn-pinger-cvgzs.kube-system
addresses: ["00:00:00:9B:E3:6F 172.16.128.47"]
port t4.default
addresses: ["00:00:00:34:07:AE 172.16.128.107"]
port rke2-snapshot-validation-webhook-54c5989b65-nzp5l.kube-system
addresses: ["00:00:00:9F:C5:0A 172.16.128.211"]
port virt-handler-nhnwd.kubevirt
addresses: ["00:00:00:99:75:FF 172.16.128.141"]
port ha-controller-rbnm6.piraeus-datastore
addresses: ["00:00:00:B5:6D:10 172.16.128.147"]
port my-nginx-684dd4dcd4-p9fvk.default
addresses: ["00:00:00:9F:C3:89 172.16.128.31"]
port piraeus-operator-gencert-7c5d64d5fc-ddmz8.piraeus-datastore
addresses: ["00:00:00:82:9E:49 172.16.128.99"]
port rke2-coredns-rke2-coredns-5b8c65d87f-vg6g6.kube-system
addresses: ["00:00:00:22:AA:BE 172.16.128.26"]
port ha-controller-pg6vk.piraeus-datastore
addresses: ["00:00:00:D7:5F:88 172.16.128.128"]
port virt-api-5686f97bc-564nw.kubevirt
addresses: ["00:00:00:EA:F9:E2 172.16.128.204"]
port rke2-ingress-nginx-controller-sd4ql.kube-system
addresses: ["00:00:00:A9:E0:3C 172.16.128.133"]
port rke2-ingress-nginx-controller-722zg.kube-system
addresses: ["00:00:00:14:CF:D3 172.16.128.121"]
port linstor-csi-node-2hnwh.piraeus-datastore
addresses: ["00:00:00:CB:18:17 172.16.128.43"]
port ha-controller-2r2ww.piraeus-datastore
addresses: ["00:00:00:08:E9:56 172.16.128.153"]
port cert-manager-cainjector-c7d4dbdd9-bv6ln.cert-manager
addresses: ["00:00:00:0D:1C:19 172.16.128.207"]
port rke2-ingress-nginx-controller-v5bcd.kube-system
addresses: ["00:00:00:B6:C5:B2 172.16.128.166"]
port rke2-snapshot-controller-59cc9cd8f4-4sfxv.kube-system
addresses: ["00:00:00:A3:1B:1B 172.16.128.213"]
port cdi-deployment-74b786dcc6-twr5m.cdi
addresses: ["00:00:00:C3:3D:EC 172.16.128.8"]
port ha-controller-xsbh5.piraeus-datastore
addresses: ["00:00:00:04:58:8A 172.16.128.163"]
port virt-handler-lvclp.kubevirt
addresses: ["00:00:00:A8:09:4B 172.16.129.51"]
port kube-ovn-pinger-tggf5.kube-system
addresses: ["00:00:00:44:CC:01 172.16.128.146"]
port cdi-operator-75d5789946-mgr5h.cdi
addresses: ["00:00:00:30:08:69 172.16.128.3"]
port virt-handler-v5qfj.kubevirt
addresses: ["00:00:00:95:E7:07 172.16.128.131"]
port linstor-csi-node-c7z97.piraeus-datastore
addresses: ["00:00:00:82:1A:37 172.16.128.62"]
port rke2-ingress-nginx-controller-47c4f.kube-system
addresses: ["00:00:00:DF:5A:01 172.16.128.148"]
port kube-ovn-pinger-t2vfk.kube-system
addresses: ["00:00:00:E4:42:09 172.16.128.45"]
port virt-operator-66b7f94b6d-mptpm.kubevirt
addresses: ["00:00:00:4A:AC:DE 172.16.128.9"]
port ha-controller-4fhcc.piraeus-datastore
addresses: ["00:00:00:6B:A9:1D 172.16.128.104"]
port rke2-ingress-nginx-controller-ftvnb.kube-system
addresses: ["00:00:00:4D:95:3F 172.16.128.154"]
port virt-handler-48pvp.kubevirt
addresses: ["00:00:00:80:C2:C5 172.16.129.34"]
port virt-handler-z6ddd.kubevirt
addresses: ["00:00:00:B3:8E:87 172.16.129.47"]
port rke2-metrics-server-544c8c66fc-s4cpr.kube-system
addresses: ["00:00:00:3C:26:A5 172.16.128.4"]
port kube-ovn-pinger-rqx5w.kube-system
addresses: ["00:00:00:EA:8F:67 172.16.129.31"]
port virt-controller-678998f868-jw4wp.kubevirt
addresses: ["00:00:00:B8:CF:6E 172.16.128.206"]
port virt-handler-45zv9.kubevirt
addresses: ["00:00:00:46:3B:8E 172.16.128.167"]
port kube-ovn-pinger-npk7p.kube-system
addresses: ["00:00:00:DC:A2:06 172.16.128.96"]
port ha-controller-8zvg6.piraeus-datastore
addresses: ["00:00:00:4E:81:D7 172.16.128.155"]
port rke2-ingress-nginx-controller-24r7b.kube-system
addresses: ["00:00:00:62:3A:F7 172.16.129.50"]
port rke2-ingress-nginx-controller-7f7dp.kube-system
addresses: ["00:00:00:43:73:F2 172.16.129.46"]
port minio-operator-855cd887f4-hfqwf.harbor-operator-ns
addresses: ["00:00:00:00:5D:B4 172.16.128.201"]
port linstor-csi-node-92r8k.piraeus-datastore
addresses: ["00:00:00:72:D9:4B 172.16.128.42"]
port rke2-ingress-nginx-controller-lbr8p.kube-system
addresses: ["00:00:00:50:E9:95 172.16.128.137"]
port linstor-csi-node-zfdx5.piraeus-datastore
addresses: ["00:00:00:F2:E8:43 172.16.128.81"]
port rke2-ingress-nginx-controller-2ml82.kube-system
addresses: ["00:00:00:50:18:89 172.16.129.55"]
port rke2-ingress-nginx-controller-8q8vk.kube-system
addresses: ["00:00:00:CA:E0:20 172.16.128.160"]
port rke2-ingress-nginx-controller-hw7hp.kube-system
addresses: ["00:00:00:E9:3A:0A 172.16.128.170"]
port ha-controller-p6l9t.piraeus-datastore
addresses: ["00:00:00:4C:2B:AA 172.16.128.136"]
port cdi-uploadproxy-5c4d65444d-nczdp.cdi
addresses: ["00:00:00:23:5D:5A 172.16.128.208"]
port linstor-csi-node-lpxcv.piraeus-datastore
addresses: ["00:00:00:E2:68:65 172.16.128.84"]
port virt-handler-rcdng.kubevirt
addresses: ["00:00:00:56:06:CB 172.16.128.119"]
port ha-controller-js7wn.piraeus-datastore
addresses: ["00:00:00:C6:79:7A 172.16.129.52"]
port kube-ovn-pinger-kqmnn.kube-system
addresses: ["00:00:00:48:91:40 172.16.128.95"]
port minio-operator-855cd887f4-z6xqr.harbor-operator-ns
addresses: ["00:00:00:4A:79:48 172.16.128.40"]
switch 22cb15be-d2fe-4a3f-a9d5-2a0552d86da7 (join)
port node-nsc-08
addresses: ["00:00:00:B3:43:B2 100.64.0.5"]
port node-nsc-04
addresses: ["00:00:00:63:AC:EB 100.64.0.7"]
port node-spinning-02
addresses: ["00:00:00:39:EA:28 100.64.0.16"]
port node-nvme-02
addresses: ["00:00:00:E2:53:21 100.64.0.2"]
port node-nvme-01
addresses: ["00:00:00:16:B4:94 100.64.0.3"]
port node-headnode-02
addresses: ["00:00:00:19:C3:4B 100.64.0.22"]
port node-nsc-07
addresses: ["00:00:00:A4:B5:DE 100.64.0.8"]
port node-nsc-03
addresses: ["00:00:00:EF:D0:E9 100.64.0.13"]
port join-ovn-cluster
type: router
router-port: ovn-cluster-join
port node-headnode-03
addresses: ["00:00:00:34:EF:6B 100.64.0.21"]
port node-nvme-03
addresses: ["00:00:00:EB:D6:F6 100.64.0.4"]
port node-headnode-01
addresses: ["00:00:00:63:89:BF 100.64.0.23"]
port node-spinning-03
addresses: ["00:00:00:D4:4C:5A 100.64.0.15"]
port node-spinning-01
addresses: ["00:00:00:1F:CC:1B 100.64.0.17"]
port node-nsc-05
addresses: ["00:00:00:32:AE:F8 100.64.0.10"]
port node-nsc-01
addresses: ["00:00:00:B6:61:59 100.64.0.12"]
port node-nsc-10
addresses: ["00:00:00:4A:73:0A 100.64.0.6"]
port node-nsc-06
addresses: ["00:00:00:05:C9:1A 100.64.0.14"]
port node-nsc-09
addresses: ["00:00:00:2C:2F:7D 100.64.0.9"]
port node-nsc-02
addresses: ["00:00:00:BE:FA:EE 100.64.0.11"]
switch 88a16dca-c621-4801-b754-8393b7e78e16 (external2080)
port t2.default
addresses: ["00:00:00:AE:C3:21 10.91.237.1"]
port localnet.external2080
type: localnet
tag: 2080
addresses: ["unknown"]
switch 9a9d40e6-934b-487a-bbd7-98de0fa3cfce (external)
port localnet.external
type: localnet
tag: 1998
addresses: ["unknown"]
port t1.default
addresses: ["00:00:00:65:0A:B1 10.91.64.3"]
port external-ovn-cluster
type: router
router-port: ovn-cluster-external
router 79a8ca90-715b-47b0-bb21-3c43ed2be277 (ovn-cluster)
port ovn-cluster-external
mac: "00:00:00:9E:F1:47"
networks: ["10.91.64.1/19"]
gateway chassis: [66585c3d-7730-4548-8339-74334208a934 fc35cde7-fdca-4280-aa68-31da9aa25654 4e94822c-a871-4aef-8adb-37558537534a af148592-4f16-432e-972b-a9a934502a41 e98adbf1-4fb8-41fa-b5d9-3975f18cc9a2 d6af920a-a9d0-4a32-92b8-4be203b28f4f bea6ae2e-6c61-4506-a2af-725f6164a31c 594ac44a-e6b6-430c-8032-aeaf67cc7eaa a0226e57-92f2-48e3-943e-3cfb78c9388d 21167e15-e730-49e5-8298-5beccf59120a 0b64effb-6818-4cd0-9a84-224f5fb6f8e5 0e11f83a-85bf-4e53-b088-251b8dbf55fb 3a697fd6-fa3b-4403-b781-ab23c844b0fd 42645368-650c-4ad6-91a9-87b58631ad14 2871172c-4d32-4c33-b9c9-2fa27137428b 0d6e24cd-7ab9-4e8a-8625-f25f692404ec e0d7b65f-28d1-4c73-a96e-0fe443940110 57eacb42-4981-4a1b-a3ae-95565e393726 624f5878-23b6-40ac-8593-2064be0fc543]
port ovn-cluster-join
mac: "00:00:00:F0:B4:C8"
networks: ["100.64.0.1/16"]
port ovn-cluster-ovn-default
mac: "00:00:00:63:7A:29"
networks: ["172.16.128.1/17"]
nat 310923ad-0955-4070-86d5-89cddd246f4b
external ip: "10.91.64.220"
logical ip: "172.16.128.106"
type: "dnat_and_snat"

@Smithx10
Copy link
Author

After coming back, I went to test EIP..... Tried deleting a eip and fip..... not found.

E0416 14:08:14.390556       1 ovn-nb-nat.go:302] not found logical router ovn-cluster nat 'type dnat_and_snat external ip 10.91.64.5 logical ip 172.16.128.134'
E0416 14:08:14.390587       1 ovn-nb-nat.go:214] not found logical router ovn-cluster nat 'type dnat_and_snat external ip 10.91.64.5 logical ip 172.16.128.134'
E0416 14:08:14.390600       1 ovn_fip.go:464] failed to delete fip eip-static, not found logical router ovn-cluster nat 'type dnat_and_snat external ip 10.91.64.5 logical ip 172.16.128.134'
E0416 14:08:14.390645       1 ovn_fip.go:172] error syncing 'eip-static': not found logical router ovn-cluster nat 'type dnat_and_snat external ip 10.91.64.5 logical ip 172.16.128.134', requeuing
I0416 14:08:25.356517       1 ovn_eip.go:324] handle del ovn eip eip-static
E0416 14:08:25.356582       1 ovn_eip.go:637] ovn eip 'eip-static' is still in use, finalizer will not be removed
E0416 14:08:25.356596       1 ovn_eip.go:348] failed to handle remove ovn eip finalizer , ovn eip 'eip-static' is still in use, finalizer will not be removed
E0416 14:08:25.356635       1 ovn_eip.go:204] error syncing 'eip-static': ovn eip 'eip-static' is still in use, finalizer will not be removed, requeuing

@Smithx10
Copy link
Author

Restarting all 3 of the ovn-central pods allowed the kubectl ko command to function as expected. I assume its ttempting to execute this command on an old leader or?

@Smithx10
Copy link
Author

After bouncing them..... kube-ovn-controller logs are flooded with

E0416 14:15:10.352430       1 pod.go:433] error syncing 'kubevirt/virt-handler-nhnwd': generate operations for creating logical switch port virt-handler-nhnwd.kubevirt: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default", requeuing
I0416 14:15:10.352608       1 event.go:364] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kubevirt", Name:"virt-handler-nhnwd", UID:"690eb317-6d9b-4bd1-affb-64eb906d1f1c", APIVersion:"v1", ResourceVersion:"29244282", FieldPath:""}): type: 'Warning' reason: 'CreateOVNPortFailed' generate operations for creating logical switch port virt-handler-nhnwd.kubevirt: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"
I0416 14:15:10.451737       1 pod.go:578] handle add/update pod harbor-operator-ns/postgres-operator-95754cbfd-q48qg
I0416 14:15:10.451832       1 pod.go:635] sync pod harbor-operator-ns/postgres-operator-95754cbfd-q48qg allocated
I0416 14:15:10.451855       1 ipam.go:72] allocating static ip 172.16.128.203 from subnet ovn-default
I0416 14:15:10.451876       1 ipam.go:108] allocate v4 172.16.128.203, mac 00:00:00:8F:A2:13 for harbor-operator-ns/postgres-operator-95754cbfd-q48qg from subnet ovn-default
E0416 14:15:10.452595       1 ovn-nb-logical_switch.go:379] not found logical switch "ovn-default"
E0416 14:15:10.452614       1 ovn-nb-logical_switch_port.go:730] get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"
E0416 14:15:10.452624       1 ovn-nb-logical_switch_port.go:111] get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"
E0416 14:15:10.452654       1 pod.go:737] generate operations for creating logical switch port postgres-operator-95754cbfd-q48qg.harbor-operator-ns: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"
E0416 14:15:10.452666       1 pod.go:617] generate operations for creating logical switch port postgres-operator-95754cbfd-q48qg.harbor-operator-ns: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"
E0416 14:15:10.452686       1 pod.go:433] error syncing 'harbor-operator-ns/postgres-operator-95754cbfd-q48qg': generate operations for creating logical switch port postgres-operator-95754cbfd-q48qg.harbor-operator-ns: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default", requeuing
I0416 14:15:10.452708       1 event.go:364] Event(v1.ObjectReference{Kind:"Pod", Namespace:"harbor-operator-ns", Name:"postgres-operator-95754cbfd-q48qg", UID:"5284fc0b-0064-42d1-bee3-fb5c64c2c11f", APIVersion:"v1", ResourceVersion:"29244285", FieldPath:""}): type: 'Warning' reason: 'CreateOVNPortFailed' generate operations for creating logical switch port postgres-operator-95754cbfd-q48qg.harbor-operator-ns: get logical switch ovn-default when generate mutate operations: not found logical switch "ovn-default"

@Smithx10
Copy link
Author

Smithx10 commented Apr 16, 2024

Bouncing the kube-ovn-controllers and ran into

│ I0416 14:19:31.176288       1 init.go:451] take 0.01 seconds to initialize IPAM                                                                                                                                  │
│ I0416 14:19:31.901405       1 vpc.go:721] vpc ovn-cluster add static route: &{Policy:policyDst CIDR:0.0.0.0/0 NextHopIP:100.64.0.1 ECMPMode: BfdID: RouteTable:}                                                 │
│ I0416 14:19:31.902370       1 ovn-nb-logical_router_route.go:103] logical router ovn-cluster del static routes: []                                                                                               │
│ I0416 14:19:31.902471       1 init.go:619] start to sync subnets                                                                                                                                                 │
│ E0416 14:19:31.903057       1 subnet.go:2223] ipam subnet external2080 has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0                                                            │
│ E0416 14:19:31.903080       1 init.go:636] failed to calculate subnet external2080 used ip: ipam subnet external2080 has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0              │
│ E0416 14:19:31.903139       1 klog.go:10] "failed to sync crd subnets" err="ipam subnet external2080 has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0"                             │
│ Stream closed EOF for kube-system/kube-ovn-controller-6d4cc9b96b-5s6pl (kube-ovn-controller)                                                                                                                     │

[ use1 ] root@headnode-01:~/yamls/eip$ k get subnet
NAME           PROVIDER              VPC           PROTOCOL   CIDR              PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS                       U2OINTERCONNECTIONIP
external       ovn                   ovn-cluster   IPv4       10.91.64.0/19     false             false     distributed   3        8186          0        0             ["10.91.95.254"]
external2080   ovn                   ovn-cluster   IPv4       10.91.237.0/24    false     false   false     distributed   1        252           0        0             ["10.91.237.254"]
join           ovn                   ovn-cluster   IPv4       100.64.0.0/16     false     false   false     distributed   19       65514         0        0             ["100.64.0.1"]
ovn-default    ovn                   ovn-cluster   IPv4       172.16.128.0/17   false     true    true      distributed   176      32589         0        0             ["172.16.128.1"]
storage        storage.default.ovn   ovn-cluster   IPv4       172.16.16.0/21    false     false   false     distributed   1        1791          0        0             ["172.16.16.1..172.16.16.254"]

[ use1 ] root@headnode-01:~/yamls/eip$ k get ip  | grep external
t1.default                                                               10.91.64.3              00:00:00:65:0A:B1   nsc-06        external
t2.default                                                               10.91.237.1             00:00:00:AE:C3:21   nsc-05        external2080
│ I0416 14:41:01.712318       1 init.go:619] start to sync subnets                                                                                                                                                 │
│ E0416 14:41:01.712830       1 subnet.go:2223] ipam subnet storage has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0                                                                 │
│ E0416 14:41:01.712872       1 init.go:636] failed to calculate subnet storage used ip: ipam subnet storage has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0                        │
│ E0416 14:41:01.712933       1 klog.go:10] "failed to sync crd subnets" err="ipam subnet storage has no ip in using, but some ip cr left: ip 1, vip 0, iptable eip 0, ovn eip 0"                                  │

@Smithx10
Copy link
Author

I was able the controller to get passed init by kubectl deleting all the pods, and ips on those subnets which required me to patch the finalizer null.

These aren't in production, but I imagine that subnet init function may have a bug in it.

@bobz965
Copy link
Collaborator

bobz965 commented Apr 17, 2024

After kicking heanode-01 out of the cluster:

nbctl show is empty. "_" I think the database is gone?

[ use1 ] root@headnode-02:$ k ko nbctl show [ use1 ] root@headnode-02:$

i think you should kick the bad one, and then clean its nb db and sb db data, and add it back.

@oilbeater
Copy link
Collaborator

should be fixed by #3928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants