Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in etcd 3.4.3 followers when using leases. #11495

Closed
hpdvanwyk opened this issue Jan 6, 2020 · 5 comments
Closed

Memory leak in etcd 3.4.3 followers when using leases. #11495

hpdvanwyk opened this issue Jan 6, 2020 · 5 comments

Comments

@hpdvanwyk
Copy link

Putting values with leases on an etcd 3.4.3 cluster causes the cluster followers to slowly leak memory.
Order of events:

  • Grant lease
  • Put kv pair with lease
  • Get kv pair
  • Expire lease

This was first seen on a RancherOS 1.5.4 Kubernetes cluster has been reproduced with Docker on Ubuntu 18.04.

The memory leak only seems to happen on followers and not on the leader.

To reproduce use the docker-compose at https://gist.github.com/hpdvanwyk/b5ace96e3c8a37dcea1a70c690e0562e with:

mkdir /dev/shm/etcd.tmp1 /dev/shm/etcd.tmp2 /dev/shm/etcd.tmp3
docker-compose up

The RPC rate seems to affect the leak rate so using a tmpfs like /dev/shm speeds up the leak.

For the client use https://gist.github.com/hpdvanwyk/a94f9cdf2fdad31056cda6ce08402645

go get github.com/google/uuid
go get go.etcd.io/etcd/clientv3
go run main.go

Let this run for an hour or so.

Sample pprof heap output:

hendrik@buildboxhvw:~/pprof$ go tool pprof http://192.168.128.2:2379/debug/pprof/heap
Fetching profile over HTTP from http://192.168.128.2:2379/debug/pprof/heap
Saved profile in /home/hendrik/pprof/pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.030.pb.gz
File: etcd
Type: inuse_space
Time: Jan 6, 2020 at 2:07pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 380.98MB, 93.63% of 406.92MB total
Dropped 107 nodes (cum <= 2.03MB)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
     153MB 37.60% 37.60%   186.11MB 45.74%  go.etcd.io/etcd/lease.(*LeaseExpiredNotifier).RegisterOrUpdate
     123MB 30.23% 67.83%   313.21MB 76.97%  go.etcd.io/etcd/lease.(*lessor).Grant
   33.11MB  8.14% 75.96%    33.11MB  8.14%  go.etcd.io/etcd/lease.(*LeaseQueue).Push
   29.17MB  7.17% 83.13%    29.17MB  7.17%  go.etcd.io/etcd/mvcc/backend.(*bucketBuffer).Copy
    9.72MB  2.39% 85.52%     9.72MB  2.39%  go.etcd.io/etcd/mvcc/backend.newBucketBuffer
    9.50MB  2.33% 87.86%    12.50MB  3.07%  go.etcd.io/etcd/mvcc.(*keyIndex).tombstone
    6.50MB  1.60% 89.46%     6.50MB  1.60%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*PutRequest).Unmarshal
    5.98MB  1.47% 90.93%    12.48MB  3.07%  go.etcd.io/etcd/etcdserver/api/rafthttp.startPeer
    5.98MB  1.47% 92.40%     5.98MB  1.47%  go.etcd.io/etcd/etcdserver/api/rafthttp.startStreamWriter
       5MB  1.23% 93.63%     5.50MB  1.35%  go.etcd.io/etcd/mvcc.(*treeIndex).Put
(pprof) quit
hendrik@buildboxhvw:~/pprof$ go tool pprof http://192.168.128.2:2479/debug/pprof/heap
Fetching profile over HTTP from http://192.168.128.2:2479/debug/pprof/heap
Saved profile in /home/hendrik/pprof/pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.031.pb.gz
File: etcd
Type: inuse_space
Time: Jan 6, 2020 at 2:07pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 49.61MB, 82.91% of 59.83MB total
Showing top 10 nodes out of 142
      flat  flat%   sum%        cum   cum%
      10MB 16.72% 16.72%    17.50MB 29.25%  go.etcd.io/etcd/mvcc.(*keyIndex).tombstone
    8.50MB 14.21% 30.92%     8.50MB 14.21%  go.etcd.io/etcd/mvcc.(*treeIndex).Put
    7.50MB 12.54% 43.46%     7.50MB 12.54%  go.etcd.io/etcd/mvcc.(*keyIndex).put
       7MB 11.70% 55.16%        7MB 11.70%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*PutRequest).Unmarshal
    4.49MB  7.50% 62.66%     7.48MB 12.50%  go.etcd.io/etcd/etcdserver/api/rafthttp.startPeer
    2.99MB  5.00% 67.66%     2.99MB  5.00%  go.etcd.io/etcd/etcdserver/api/rafthttp.(*streamWriter).closeUnlocked
    2.99MB  5.00% 72.66%     2.99MB  5.00%  go.etcd.io/etcd/etcdserver/api/rafthttp.startStreamWriter
    2.50MB  4.18% 76.84%        3MB  5.02%  github.com/google/btree.(*node).mutableFor
    2.31MB  3.87% 80.71%     2.31MB  3.87%  go.etcd.io/etcd/etcdserver/api/rafthttp.newMsgAppV2Encoder
    1.32MB  2.20% 82.91%     1.32MB  2.20%  go.etcd.io/etcd/lease.(*LeaseExpiredNotifier).RegisterOrUpdate
(pprof) quit
hendrik@buildboxhvw:~/pprof$ go tool pprof http://192.168.128.2:2579/debug/pprof/heap
Fetching profile over HTTP from http://192.168.128.2:2579/debug/pprof/heap
Saved profile in /home/hendrik/pprof/pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.032.pb.gz
File: etcd
Type: inuse_space
Time: Jan 6, 2020 at 2:07pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 370.45MB, 93.75% of 395.14MB total
Dropped 52 nodes (cum <= 1.98MB)
Showing top 10 nodes out of 81
      flat  flat%   sum%        cum   cum%
     153MB 38.72% 38.72%   186.11MB 47.10%  go.etcd.io/etcd/lease.(*LeaseExpiredNotifier).RegisterOrUpdate
     122MB 30.88% 69.60%   311.70MB 78.88%  go.etcd.io/etcd/lease.(*lessor).Grant
   33.11MB  8.38% 77.98%    33.11MB  8.38%  go.etcd.io/etcd/lease.(*LeaseQueue).Push
   20.15MB  5.10% 83.08%    20.15MB  5.10%  go.etcd.io/etcd/mvcc/backend.(*bucketBuffer).Copy
      11MB  2.78% 85.86%       14MB  3.54%  go.etcd.io/etcd/mvcc.(*keyIndex).tombstone
    8.70MB  2.20% 88.06%     8.70MB  2.20%  go.etcd.io/etcd/mvcc/backend.newBucketBuffer
    8.50MB  2.15% 90.21%       10MB  2.53%  go.etcd.io/etcd/mvcc.(*treeIndex).Put
    5.98MB  1.51% 91.73%     8.98MB  2.27%  go.etcd.io/etcd/etcdserver/api/rafthttp.startPeer
       5MB  1.27% 92.99%        5MB  1.27%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*PutRequest).Unmarshal
       3MB  0.76% 93.75%        3MB  0.76%  go.etcd.io/etcd/mvcc.(*keyIndex).put
(pprof) quit

Sample after waiting even longer:

Fetching profile over HTTP from http://192.168.128.2:2379/debug/pprof/heap
Saved profile in /home/hendrik/pprof/pprof.etcd.alloc_objects.alloc_space.inuse_objects.inuse_space.022.pb.gz
File: etcd
Type: inuse_space
Time: Jan 3, 2020 at 1:20pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2030.24MB, 96.76% of 2098.18MB total
Dropped 132 nodes (cum <= 10.49MB)
Showing top 10 nodes out of 55
      flat  flat%   sum%        cum   cum%
  956.51MB 45.59% 45.59%  1082.85MB 51.61%  go.etcd.io/etcd/lease.(*LeaseExpiredNotifier).RegisterOrUpdate
  423.51MB 20.18% 65.77%  1514.04MB 72.16%  go.etcd.io/etcd/lease.(*lessor).Grant
  372.19MB 17.74% 83.51%   372.19MB 17.74%  go.etcd.io/etcd/mvcc/backend.(*bucketBuffer).Copy
  126.34MB  6.02% 89.53%   126.34MB  6.02%  go.etcd.io/etcd/lease.(*LeaseQueue).Push
      47MB  2.24% 91.77%       62MB  2.96%  go.etcd.io/etcd/mvcc.(*keyIndex).tombstone
      37MB  1.76% 93.54%    42.01MB  2.00%  go.etcd.io/etcd/mvcc.(*treeIndex).Put
   29.17MB  1.39% 94.93%    29.17MB  1.39%  go.etcd.io/etcd/mvcc/backend.newBucketBuffer
      23MB  1.10% 96.02%       23MB  1.10%  go.etcd.io/etcd/etcdserver/etcdserverpb.(*PutRequest).Unmarshal
      15MB  0.71% 96.74%       15MB  0.71%  go.etcd.io/etcd/mvcc.(*keyIndex).put
    0.50MB 0.024% 96.76%   372.69MB 17.76%  go.etcd.io/etcd/mvcc/backend.(*txReadBuffer).unsafeCopy
@hpdvanwyk
Copy link
Author

I forgot to add that this does not happen in etcd 3.3.18

@rearden-steel
Copy link

We also experience this problem. Stolon (PostgreSQL HA cluster software) uses etcd in such way and we get slow memory growth for etcd slaves.

@rearden-steel
Copy link

Can someone find time to check this issue?
Etcd memory graph:
Screenshot 2020-01-23 at 18 34 03

@tangcong
Copy link
Contributor

tangcong commented Mar 29, 2020

@rearden-steel @hpdvanwyk we meet the same issue, i have submitted a pr #11731 to fix it. thanks for your information.

@gyuho
Copy link
Contributor

gyuho commented Mar 29, 2020

Cutting a new release with @tangcong's fix

@gyuho gyuho closed this as completed Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants