etcd leader changed
#18438
Replies: 2 comments
-
The frequent leader changing should be caused by environment issue, i.e network or disk I/O. So I think you need to figure out the root cause in the first step. For example, just ping between each pair of master VM to check network latency, and use fio to check the disk I/O latency. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I have 3 control planes and HDD disks. $ ping -c 5 10.10.40.36
PING 10.10.40.36 (10.10.40.36) 56(84) bytes of data.
64 bytes from 10.10.40.36: icmp_seq=1 ttl=64 time=0.053 ms
64 bytes from 10.10.40.36: icmp_seq=2 ttl=64 time=0.035 ms
64 bytes from 10.10.40.36: icmp_seq=3 ttl=64 time=0.037 ms
64 bytes from 10.10.40.36: icmp_seq=4 ttl=64 time=0.049 ms
64 bytes from 10.10.40.36: icmp_seq=5 ttl=64 time=0.090 ms
--- 10.10.40.36 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4075ms
rtt min/avg/max/mdev = 0.035/0.052/0.090/0.019 ms
$ ping -c 5 10.10.40.37
PING 10.10.40.37 (10.10.40.37) 56(84) bytes of data.
64 bytes from 10.10.40.37: icmp_seq=1 ttl=64 time=0.473 ms
64 bytes from 10.10.40.37: icmp_seq=2 ttl=64 time=0.324 ms
64 bytes from 10.10.40.37: icmp_seq=3 ttl=64 time=0.152 ms
64 bytes from 10.10.40.37: icmp_seq=4 ttl=64 time=0.268 ms
64 bytes from 10.10.40.37: icmp_seq=5 ttl=64 time=0.296 ms
--- 10.10.40.37 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4076ms
rtt min/avg/max/mdev = 0.152/0.302/0.473/0.103 ms
$ ping -c 5 10.10.40.38
PING 10.10.40.38 (10.10.40.38) 56(84) bytes of data.
64 bytes from 10.10.40.38: icmp_seq=1 ttl=64 time=0.508 ms
64 bytes from 10.10.40.38: icmp_seq=2 ttl=64 time=0.163 ms
64 bytes from 10.10.40.38: icmp_seq=3 ttl=64 time=0.249 ms
64 bytes from 10.10.40.38: icmp_seq=4 ttl=64 time=0.331 ms
64 bytes from 10.10.40.38: icmp_seq=5 ttl=64 time=0.212 ms
--- 10.10.40.38 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4099ms
rtt min/avg/max/mdev = 0.163/0.292/0.508/0.120 ms $ sudo fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=1 --size=1G --numjobs=1 --runtime=60 --group_reporting --filename=/var/lib/etcd/fio_test_file
randwrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.16
Starting 1 process
randwrite: Laying out IO file (1 file / 1024MiB)
Jobs: 1 (f=1): [w(1)][100.0%][w=40KiB/s][w=10 IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=3731317: Tue Aug 20 09:23:15 2024
write: IOPS=10, BW=41.5KiB/s (42.5kB/s)(2488KiB/60016msec); 0 zone resets
slat (usec): min=30, max=36941, avg=110.31, stdev=1479.29
clat (msec): min=4, max=1887, avg=96.37, stdev=174.12
lat (msec): min=4, max=1887, avg=96.48, stdev=174.12
clat percentiles (msec):
| 1.00th=[ 9], 5.00th=[ 15], 10.00th=[ 15], 20.00th=[ 20],
| 30.00th=[ 26], 40.00th=[ 32], 50.00th=[ 40], 60.00th=[ 50],
| 70.00th=[ 66], 80.00th=[ 103], 90.00th=[ 230], 95.00th=[ 405],
| 99.00th=[ 852], 99.50th=[ 1053], 99.90th=[ 1888], 99.95th=[ 1888],
| 99.99th=[ 1888]
bw ( KiB/s): min= 8, max= 120, per=100.00%, avg=45.97, stdev=34.68, samples=108
iops : min= 2, max= 30, avg=11.47, stdev= 8.68, samples=108
lat (msec) : 10=2.09%, 20=22.67%, 50=35.53%, 100=19.45%, 250=10.77%
lat (msec) : 500=5.79%, 750=1.93%, 1000=1.13%, 2000=0.64%
cpu : usr=0.04%, sys=0.06%, ctx=626, majf=0, minf=13
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,622,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=41.5KiB/s (42.5kB/s), 41.5KiB/s-41.5KiB/s (42.5kB/s-42.5kB/s), io=2488KiB (2548kB), run=60016-60016msec
Disk stats (read/write):
sda: ios=158/2448, merge=0/640, ticks=5688/282670, in_queue=282712, util=82.50% |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I need help with an issue in my Kubernetes cluster, which is set up with 3 control-plane nodes and 3 worker nodes using kubeadm. The etcd pods are located on the control-plane nodes. Initially, everything was working fine, but after a few months of usage, I've started encountering errors in etcd that are causing other pods to crash.
The error message I see is:
After researching online, I added the following options to my etcd manifests:
While the frequency of errors has decreased, they are still persisting, and it feels like I've only moved the problem rather than fully resolving it.
Currently, the
/var/lib/etcd
directory is mounted on the system partition. I want to mount a new partition onsdb
and then moveetcd
to it. Do you think this will resolve the performance issues with etcd?Could you please help me with this issue?
Beta Was this translation helpful? Give feedback.
All reactions