You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of our production 3-member etcd clusters had 2 leader at a time. During this event, the droplet (the Physical Host on which the EC2 instances run) behind the old leader (A) was unavailable and the associated EBS volume failed to serve WAL fsync.
Ideally, the checkquorum raft message should be raised after the election timeout breaches the limit and RecentActive should be false for each peers if no recent hearbeat response or MsgAppResp reached leader.
Isolated old Leader A is still the leader from its point of view
# run A with the failpoint enabled
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 | 3.4.18 | 20 kB |true|false| 35 | 1741005 | 1741005 ||
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-61-148 bin]# curl http://127.0.0.1:1234/go.etcd.io/etcd/etcdserver/raftAfterSave -XPUT -d'sleep(600000)'
[root@ip-10-0-61-148 bin]# iptables -A INPUT -s 10.0.123.82 -j DROP && iptables -A INPUT -s 10.0.171.218 -j DROP
[root@ip-10-0-61-148 bin]# curl -sL http://localhost:2379/metrics | grep "is_leader"# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.# TYPE etcd_server_is_leader gauge
etcd_server_is_leader 1
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 | 3.4.18 | 20 kB |true|false| 35 | 1741050 | 1741050 ||
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Follower B (after the network partition injected, it becomes the leader)
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 | 3.4.17 | 20 kB | false | false | 35 | 1741005 | 1741005 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-171-218 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 | 3.4.17 | 20 kB | true | false | 36 | 1741062 | 1741062 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Follower C
etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 | 3.4.17 | 20 kB |false|false| 35 | 1741005 | 1741005 ||
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-123-82 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-123-82 bin]# etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 | 3.4.17 | 20 kB |false|false| 36 | 1741058 | 1741058 ||
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Questions:
Is it required that the r.tick() and rd := <-r.Ready() should be executed mutually exclusively? Otherwise, we could separate the rd := <- r.Ready() in another indefinite loop as a background routine.
Is the above behavior expected from a raft design perspective?
The text was updated successfully, but these errors were encountered:
chaochn47
changed the title
network partitioned leader was not able to step down to follower
disk write failed and network partitioned leader was not able to step down to follower
Dec 8, 2021
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Hi etcd community:
One of our production 3-member etcd clusters had 2 leader at a time. During this event, the droplet (the Physical Host on which the EC2 instances run) behind the old leader (A) was unavailable and the associated EBS volume failed to serve WAL fsync.
Ideally, the checkquorum raft message should be raised after the election timeout breaches the limit and
RecentActive
should be false for each peers if no recent hearbeat response or MsgAppResp reached leader.etcd/raft/raft.go
Lines 1000 to 1020 in 161bf7e
However, the ticker is not triggered due to the disk write stalls in the raft output ready handling logic.
etcd/etcdserver/raft.go
Lines 170 to 173 in 161bf7e
This can be easily reproduced like the following
Isolated old Leader A is still the leader from its point of view
Follower B (after the network partition injected, it becomes the leader)
Follower C
Questions:
r.tick()
andrd := <-r.Ready()
should be executed mutually exclusively? Otherwise, we could separate therd := <- r.Ready()
in another indefinite loop as a background routine.@gyuho @ptabor @serathius @hexfusion @wilsonwang371 PTAL, thx!
The text was updated successfully, but these errors were encountered: