Memory leak in ZeroMQPubServerChannel #40507

nir-stratoscale · 2017-04-03T21:20:58Z

Description of Issue/Question

I have a salt master connected to 50 minions, and I see a slow memory leak in the ZeroMQPubServerChannel
it started with 40Mb, the memory usage keeps going up about 40MB per hour.
I'm also having minions disconnection issue but not sure its related.

 ps -aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0 297428 44700 ?        Ss   17:20   0:01 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug ProcessManager
root         20  0.0  0.0 215080 24008 ?        S    17:20   0:00 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug MultiprocessingLoggingQueue
root         25  0.0  0.3 575160 207484 ?       Sl   17:20   0:01 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug ZeroMQPubServerChannel
root         28  0.0  0.0 306412 46996 ?        S    17:20   0:06 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug EventPublisher

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
The minions are connecting to the master behind an haproxy.

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Master:
salt --versions-report
Salt Version:
Salt: 2016.11.1

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.0
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 1.0.6
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: 1.2.5
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 6 2016, 00:28:07)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.10.1.el7.strato0004.46fa8ebfb529.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

Minions:
salt --versions-report
Salt Version:
Salt: 2016.11.1

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.12 (default, Sep 29 2016, 13:30:34)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.4.2
ZMQ: 4.1.4

System Versions:
dist: fedora 24 Twenty Four
machine: x86_64
release: 4.5.5-300.fc24.x86_64
system: Linux
version: Fedora 24 Twenty Four

The text was updated successfully, but these errors were encountered:

eranco74 · 2017-04-04T14:20:57Z

More info:
minions are sending beacons every 15 seconds.
There is a new zmq req channel for each beacon:
[salt.transport.zeromq][DEBUG ][951] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'cbb6750d-a816-4502-942f-5ab7b50d3e17', 'tcp://10.16.145.239:4506', 'aes')

Might be related to #30411

gtmanfred · 2017-04-04T15:20:49Z

Can you provide me with the minion and master configurations so that I can attempt to replicate this?

Thanks,
Daniel

eranco74 · 2017-04-04T18:40:33Z

master config:

ping_on_rotate: True 
presence_events: False
worker_threads: 3
publish_port: 4507
ret_port: 4508

minion config:

master: haproxy-ip
master_tries: -1
auth_tries: -1
master_alive_interval: 30
startup_states: highstate
log_level: debug
beacons:
  status:
    interval: 15
    time:
      - all

log_level_logfile: debug
acceptance_wait_time: 30
zmq_monitor: True

haproxy.cfg:

#---------------------------------------------------------------------
# Example configuration for a possible web application.  See the
# full configuration options online.
#
#   http://haproxy.1wt.eu/download/1.4/doc/configuration.txt
#
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    # user        haproxy
    # group       haproxy
    daemon
    debug
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    log                     global
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# static backend for serving up images, stylesheets and such
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
resolvers mydns
  nameserver dns1 127.0.0.1:53
  resolve_retries       3
  timeout retry         1s
  hold other           30s
  hold refused         30s
  hold nx              30s
  hold timeout         30s
  hold valid           10s
frontend nm_front_1
    mode tcp
    option tcplog
    bind 0.0.0.0:4505
    default_backend nm_backend_1
frontend nm_front_2
        mode tcp
        option tcplog
        bind 0.0.0.0:4506
        default_backend nm_backend_2
backend nm_backend_1
    mode tcp
    option tcplog
    option tcp-check
    server server1 127.0.0.1:4507
backend nm_backend_2
        mode tcp
        option tcplog
        option tcp-check
        server server1 127.0.0.1:4508

eranco74 · 2017-04-04T18:51:24Z

I'm sure the memory leak is related to the fact that we have an haproxy in the middle.
Without it the memory consumption stay low.

gtmanfred · 2017-04-04T19:59:37Z

We are aware of a leak in the master publisher.

This was the original issue for this. #31454

It sounds like @DmitryKuzmenko is still investigating the issue.

Thanks,
Daniel

romfreiman · 2017-04-04T20:23:12Z

@gtmanfred @DmitryKuzmenko the 'strange' behavior that we see between the ha-proxy and salt master is that once in few sec, there is SYN (from ha-proxy towards the salt-master) -> SYN/ACK (salt-master towards ha-proxy) -> RST (ha-proxy -> salt master).
We don not see this behavior in direct connection between minions and master - hence assuming the RST might cause some leaks in the master objects.

Thoughts?

DmitryKuzmenko · 2017-04-05T08:37:16Z

What's first I'd try to do here is to write a simple 0MQ pub-sub test and try it via ha-proxy.

romfreiman · 2017-04-05T10:44:27Z

@DmitryKuzmenko another suggestion I have is to write a simple python code that causes syn/syn-ack/rst in a loop the salt-master (may be with Scapy?)

eranco74 · 2017-04-05T11:03:51Z

It seems that the:
SYN (ha-proxy -> salt-master),
SYN/ACK (salt-master -> ha-proxy),
RST/ACK (ha-proxy -> salt-master).
RST (ha-proxy -> salt-master),
is the haproxy health check.
Looks like it is not the cause of the leak.
The health check is performed regardless of minion connections.
When I don't have minions connected the memory isn't leaking.

eranco74 · 2017-04-06T11:41:24Z

managed to significantly reduce the leak by adding this configuration to the haproxy default conf:
timeout tunnel 1h
timeout client-fin 30s

The leak still exist.

romfreiman · 2017-04-16T17:38:30Z

@DmitryKuzmenko any news? Thanks

stale · 2018-10-06T21:28:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

DmitryKuzmenko · 2018-10-15T07:30:51Z

@rom-stratoscale sorry I've missed your comment here. Currently I have no time to work on this issue but want to confirm have you tried last Salt releases? Is the issue still to be there?

stale · 2018-10-15T11:28:13Z

Thank you for updating this issue. It is no longer marked as stale.

goodwillcoding · 2018-10-19T00:36:36Z

@DmitryKuzmenko ... might be similar issues here.

Complete out of the box install from official salt repos, stock configuration with only , only 2 minions connected. After several days of idle running (no actual usage of either salt master or salt minions) the memory usage grows to 5GB and keeps growing.

Unfortunately I cycled the master but in the next few days I'll try to determine where it's leaking.

Versions report

# salt-master --versions-report
Salt Version:
           Salt: 2018.3.2
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: Ubuntu 16.04 xenial
         locale: UTF-8
        machine: x86_64
        release: 4.4.0-135-generic
         system: Linux
        version: Ubuntu 16.04 xenial

DmitryKuzmenko · 2018-10-19T11:38:35Z

@goodwillcoding thank you for report. Please keep me notified.

goodwillcoding · 2018-10-31T19:09:38Z

@DmitryKuzmenko ... another week same problem. nothing changes. B. At this point we are considering installing a cron task that restarts the service just to avoid this, which is frankly an insane solution. I also commented here #50313 ... maybe the same issue.

@DmitryKuzmenko can you please advise if this is even being looked at cause it's not going to be possible to even use salt if this continues, it would completely unusable as all the memory is consumed.

simmel · 2019-06-28T12:07:26Z

We're also having this issue on:

$ salt --versions
Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: 1.12.3
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: 0.24.5
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.19
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.24.2
         Python: 3.5.3 (default, Sep 27 2018, 17:25:39)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1

System Versions:
           dist: debian 9.9
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-9-amd64
         system: Linux
        version: debian 9.9

It sure does look a lot like zeromq/libzmq#954 and zeromq/libzmq#1565 when I looked at memory usage in ZeroMQPubServerChannel while connecting to it with netcat in a tight loop. I can reproduce this 100%.

DmitryKuzmenko · 2020-05-08T20:14:16Z

@simmel can you please provide a detailed description of your scenario?

simmel · 2020-05-09T13:59:49Z

We have a hardware loadbalancer for fronting our salt master zmq port. This hw lb can't easily be modified to do protocol specific checks, in this case "talk salt mq" or connect to zmq. So it just does a very simple TCP connect and then closes the connection.
ZeroMQ on the other hand when you open a connection with just TCP and then closes the connection (and don't talk the ZeroMQ protocol properly) leaks memory on each connection. The memory is supposedly reclaimed when a "real and proper connection" to ZeroMQ is being made but I could not reproduce that.

So to reproduce it I just did, I can't confirm it because I'm not at work:

while true; do
  nc -z localhost 4505;
done

and then watch the memory with top -oRSS.

Note that this might only be triggered on an idle master which supports the "memory reclaimed on real connections" theory.

DmitryKuzmenko · 2020-05-10T20:41:30Z

@simmel thank you for details. I'll check it and share my results.

sagetherage · 2021-05-24T22:04:37Z

@nir-stratoscale @simmel are you seeing this on a supported version of Salt >v3000.9?

simmel · 2021-05-27T13:05:45Z

@sagetherage I can confirm that it was solved when we upgraded to 3002.2 as you can see in this graph = )

sagetherage · 2021-05-27T22:30:02Z

@simmel that is great to hear!

gtmanfred added the info-needed waiting for more info label Apr 4, 2017

gtmanfred added this to the Blocked milestone Apr 4, 2017

gtmanfred added Bug broken, incorrect, or confusing behavior and removed info-needed waiting for more info labels Apr 4, 2017

gtmanfred modified the milestones: Approved, Blocked Apr 4, 2017

gtmanfred added Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P1 Priority 1 labels Apr 4, 2017

stale bot added the stale label Oct 6, 2018

stale bot closed this as completed Oct 13, 2018

gtmanfred reopened this Oct 15, 2018

stale bot removed the stale label Oct 15, 2018

sagetherage assigned DmitryKuzmenko Mar 23, 2020

sagetherage added the ZRelease-Sodium retired label label Mar 23, 2020

doesitblend added the ZD The issue is related to a Zendesk customer support ticket. label Apr 2, 2020

sagetherage added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged and removed ZRelease-Sodium retired label labels May 21, 2020

sagetherage modified the milestones: Approved, Blocked May 21, 2020

sagetherage unassigned DmitryKuzmenko May 21, 2020

sagetherage added Magnesium Mg release after Na prior to Al and removed P1 Priority 1 labels May 27, 2020

sagetherage removed the Magnesium Mg release after Na prior to Al label Jul 16, 2020

sagetherage assigned cmcmarrow Aug 13, 2020

sagetherage assigned sagetherage and unassigned cmcmarrow May 7, 2021

sagetherage removed the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label May 18, 2021

sagetherage assigned dwoz and unassigned sagetherage May 18, 2021

sagetherage added the info-needed waiting for more info label May 24, 2021

sagetherage closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in ZeroMQPubServerChannel #40507

Memory leak in ZeroMQPubServerChannel #40507

nir-stratoscale commented Apr 3, 2017

eranco74 commented Apr 4, 2017

gtmanfred commented Apr 4, 2017

eranco74 commented Apr 4, 2017 •

edited

Loading

eranco74 commented Apr 4, 2017

gtmanfred commented Apr 4, 2017

romfreiman commented Apr 4, 2017 •

edited

Loading

DmitryKuzmenko commented Apr 5, 2017

romfreiman commented Apr 5, 2017

eranco74 commented Apr 5, 2017 •

edited

Loading

eranco74 commented Apr 6, 2017

romfreiman commented Apr 16, 2017

stale bot commented Oct 6, 2018

DmitryKuzmenko commented Oct 15, 2018

stale bot commented Oct 15, 2018

goodwillcoding commented Oct 19, 2018

DmitryKuzmenko commented Oct 19, 2018

goodwillcoding commented Oct 31, 2018

simmel commented Jun 28, 2019

DmitryKuzmenko commented May 8, 2020

simmel commented May 9, 2020

DmitryKuzmenko commented May 10, 2020

sagetherage commented May 24, 2021

simmel commented May 27, 2021

sagetherage commented May 27, 2021

Memory leak in ZeroMQPubServerChannel #40507

Memory leak in ZeroMQPubServerChannel #40507

Comments

nir-stratoscale commented Apr 3, 2017

Description of Issue/Question

Setup

Steps to Reproduce Issue

Versions Report

eranco74 commented Apr 4, 2017

gtmanfred commented Apr 4, 2017

eranco74 commented Apr 4, 2017 • edited Loading

eranco74 commented Apr 4, 2017

gtmanfred commented Apr 4, 2017

romfreiman commented Apr 4, 2017 • edited Loading

DmitryKuzmenko commented Apr 5, 2017

romfreiman commented Apr 5, 2017

eranco74 commented Apr 5, 2017 • edited Loading

eranco74 commented Apr 6, 2017

romfreiman commented Apr 16, 2017

stale bot commented Oct 6, 2018

DmitryKuzmenko commented Oct 15, 2018

stale bot commented Oct 15, 2018

goodwillcoding commented Oct 19, 2018

DmitryKuzmenko commented Oct 19, 2018

goodwillcoding commented Oct 31, 2018

simmel commented Jun 28, 2019

DmitryKuzmenko commented May 8, 2020

simmel commented May 9, 2020

DmitryKuzmenko commented May 10, 2020

sagetherage commented May 24, 2021

simmel commented May 27, 2021

sagetherage commented May 27, 2021

eranco74 commented Apr 4, 2017 •

edited

Loading

romfreiman commented Apr 4, 2017 •

edited

Loading

eranco74 commented Apr 5, 2017 •

edited

Loading