Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in ZeroMQPubServerChannel #40507

Closed
nir-stratoscale opened this issue Apr 3, 2017 · 24 comments
Closed

Memory leak in ZeroMQPubServerChannel #40507

nir-stratoscale opened this issue Apr 3, 2017 · 24 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt info-needed waiting for more info severity-high 2nd top severity, seen by most users, causes major problems ZD The issue is related to a Zendesk customer support ticket.
Milestone

Comments

@nir-stratoscale
Copy link

Description of Issue/Question

I have a salt master connected to 50 minions, and I see a slow memory leak in the ZeroMQPubServerChannel
it started with 40Mb, the memory usage keeps going up about 40MB per hour.
I'm also having minions disconnection issue but not sure its related.

 ps -aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0 297428 44700 ?        Ss   17:20   0:01 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug ProcessManager
root         20  0.0  0.0 215080 24008 ?        S    17:20   0:00 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug MultiprocessingLoggingQueue
root         25  0.0  0.3 575160 207484 ?       Sl   17:20   0:01 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug ZeroMQPubServerChannel
root         28  0.0  0.0 306412 46996 ?        S    17:20   0:06 /usr/bin/python2 /usr/bin/salt-master --log-file-level=debug EventPublisher

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
The minions are connecting to the master behind an haproxy.

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Master:
salt --versions-report
Salt Version:
Salt: 2016.11.1

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.6.0
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: 1.0.6
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: 1.2.5
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 6 2016, 00:28:07)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.10.1.el7.strato0004.46fa8ebfb529.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

Minions:
salt --versions-report
Salt Version:
Salt: 2016.11.1

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.12 (default, Sep 29 2016, 13:30:34)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.4.2
ZMQ: 4.1.4

System Versions:
dist: fedora 24 Twenty Four
machine: x86_64
release: 4.5.5-300.fc24.x86_64
system: Linux
version: Fedora 24 Twenty Four

@eranco74
Copy link

eranco74 commented Apr 4, 2017

More info:
minions are sending beacons every 15 seconds.
There is a new zmq req channel for each beacon:
[salt.transport.zeromq][DEBUG ][951] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'cbb6750d-a816-4502-942f-5ab7b50d3e17', 'tcp://10.16.145.239:4506', 'aes')

Might be related to #30411

@gtmanfred
Copy link
Contributor

Can you provide me with the minion and master configurations so that I can attempt to replicate this?

Thanks,
Daniel

@gtmanfred gtmanfred added the info-needed waiting for more info label Apr 4, 2017
@gtmanfred gtmanfred added this to the Blocked milestone Apr 4, 2017
@eranco74
Copy link

eranco74 commented Apr 4, 2017

master config:

ping_on_rotate: True 
presence_events: False
worker_threads: 3
publish_port: 4507
ret_port: 4508

minion config:

master: haproxy-ip
master_tries: -1
auth_tries: -1
master_alive_interval: 30
startup_states: highstate
log_level: debug
beacons:
  status:
    interval: 15
    time:
      - all

log_level_logfile: debug
acceptance_wait_time: 30
zmq_monitor: True

haproxy.cfg:

#---------------------------------------------------------------------
# Example configuration for a possible web application.  See the
# full configuration options online.
#
#   http://haproxy.1wt.eu/download/1.4/doc/configuration.txt
#
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    # to have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    # user        haproxy
    # group       haproxy
    daemon
    debug
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    log                     global
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000
#---------------------------------------------------------------------
# main frontend which proxys to the backends
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# static backend for serving up images, stylesheets and such
#---------------------------------------------------------------------
#---------------------------------------------------------------------
# round robin balancing between the various backends
#---------------------------------------------------------------------
resolvers mydns
  nameserver dns1 127.0.0.1:53
  resolve_retries       3
  timeout retry         1s
  hold other           30s
  hold refused         30s
  hold nx              30s
  hold timeout         30s
  hold valid           10s
frontend nm_front_1
    mode tcp
    option tcplog
    bind 0.0.0.0:4505
    default_backend nm_backend_1
frontend nm_front_2
        mode tcp
        option tcplog
        bind 0.0.0.0:4506
        default_backend nm_backend_2
backend nm_backend_1
    mode tcp
    option tcplog
    option tcp-check
    server server1 127.0.0.1:4507
backend nm_backend_2
        mode tcp
        option tcplog
        option tcp-check
        server server1 127.0.0.1:4508

@eranco74
Copy link

eranco74 commented Apr 4, 2017

I'm sure the memory leak is related to the fact that we have an haproxy in the middle.
Without it the memory consumption stay low.

@gtmanfred
Copy link
Contributor

We are aware of a leak in the master publisher.

This was the original issue for this. #31454

It sounds like @DmitryKuzmenko is still investigating the issue.

Thanks,
Daniel

@gtmanfred gtmanfred added Bug broken, incorrect, or confusing behavior and removed info-needed waiting for more info labels Apr 4, 2017
@gtmanfred gtmanfred modified the milestones: Approved, Blocked Apr 4, 2017
@gtmanfred gtmanfred added Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P1 Priority 1 labels Apr 4, 2017
@romfreiman
Copy link

romfreiman commented Apr 4, 2017

@gtmanfred @DmitryKuzmenko the 'strange' behavior that we see between the ha-proxy and salt master is that once in few sec, there is SYN (from ha-proxy towards the salt-master) -> SYN/ACK (salt-master towards ha-proxy) -> RST (ha-proxy -> salt master).
We don not see this behavior in direct connection between minions and master - hence assuming the RST might cause some leaks in the master objects.

Thoughts?

@DmitryKuzmenko
Copy link
Contributor

What's first I'd try to do here is to write a simple 0MQ pub-sub test and try it via ha-proxy.

@romfreiman
Copy link

@DmitryKuzmenko another suggestion I have is to write a simple python code that causes syn/syn-ack/rst in a loop the salt-master (may be with Scapy?)

@eranco74
Copy link

eranco74 commented Apr 5, 2017

It seems that the:
SYN (ha-proxy -> salt-master),
SYN/ACK (salt-master -> ha-proxy),
RST/ACK (ha-proxy -> salt-master).
RST (ha-proxy -> salt-master),
is the haproxy health check.
Looks like it is not the cause of the leak.
The health check is performed regardless of minion connections.
When I don't have minions connected the memory isn't leaking.

@eranco74
Copy link

eranco74 commented Apr 6, 2017

managed to significantly reduce the leak by adding this configuration to the haproxy default conf:
timeout tunnel 1h
timeout client-fin 30s

The leak still exist.

@romfreiman
Copy link

@DmitryKuzmenko any news? Thanks

@stale
Copy link

stale bot commented Oct 6, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Oct 6, 2018
@stale stale bot closed this as completed Oct 13, 2018
@DmitryKuzmenko
Copy link
Contributor

@rom-stratoscale sorry I've missed your comment here. Currently I have no time to work on this issue but want to confirm have you tried last Salt releases? Is the issue still to be there?

@gtmanfred gtmanfred reopened this Oct 15, 2018
@stale
Copy link

stale bot commented Oct 15, 2018

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Oct 15, 2018
@goodwillcoding
Copy link

@DmitryKuzmenko ... might be similar issues here.

Complete out of the box install from official salt repos, stock configuration with only , only 2 minions connected. After several days of idle running (no actual usage of either salt master or salt minions) the memory usage grows to 5GB and keeps growing.

Unfortunately I cycled the master but in the next few days I'll try to determine where it's leaking.

Versions report

# salt-master --versions-report
Salt Version:
           Salt: 2018.3.2
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
   python-gnupg: 0.3.8
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: Ubuntu 16.04 xenial
         locale: UTF-8
        machine: x86_64
        release: 4.4.0-135-generic
         system: Linux
        version: Ubuntu 16.04 xenial

@DmitryKuzmenko
Copy link
Contributor

@goodwillcoding thank you for report. Please keep me notified.

@goodwillcoding
Copy link

@DmitryKuzmenko ... another week same problem. nothing changes. B. At this point we are considering installing a cron task that restarts the service just to avoid this, which is frankly an insane solution. I also commented here #50313 ... maybe the same issue.

@DmitryKuzmenko can you please advise if this is even being looked at cause it's not going to be possible to even use salt if this continues, it would completely unusable as all the memory is consumed.

@simmel
Copy link
Contributor

simmel commented Jun 28, 2019

We're also having this issue on:

$ salt --versions
Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: 1.12.3
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: 0.24.5
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.19
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.24.2
         Python: 3.5.3 (default, Sep 27 2018, 17:25:39)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1

System Versions:
           dist: debian 9.9
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-9-amd64
         system: Linux
        version: debian 9.9

It sure does look a lot like zeromq/libzmq#954 and zeromq/libzmq#1565 when I looked at memory usage in ZeroMQPubServerChannel while connecting to it with netcat in a tight loop. I can reproduce this 100%.

@sagetherage sagetherage added the ZRelease-Sodium retired label label Mar 23, 2020
@doesitblend doesitblend added the ZD The issue is related to a Zendesk customer support ticket. label Apr 2, 2020
@DmitryKuzmenko
Copy link
Contributor

@simmel can you please provide a detailed description of your scenario?

@simmel
Copy link
Contributor

simmel commented May 9, 2020

We have a hardware loadbalancer for fronting our salt master zmq port. This hw lb can't easily be modified to do protocol specific checks, in this case "talk salt mq" or connect to zmq. So it just does a very simple TCP connect and then closes the connection.
ZeroMQ on the other hand when you open a connection with just TCP and then closes the connection (and don't talk the ZeroMQ protocol properly) leaks memory on each connection. The memory is supposedly reclaimed when a "real and proper connection" to ZeroMQ is being made but I could not reproduce that.

So to reproduce it I just did, I can't confirm it because I'm not at work:

while true; do
  nc -z localhost 4505;
done

and then watch the memory with top -oRSS.

Note that this might only be triggered on an idle master which supports the "memory reclaimed on real connections" theory.

@DmitryKuzmenko
Copy link
Contributor

@simmel thank you for details. I'll check it and share my results.

@sagetherage sagetherage added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged and removed ZRelease-Sodium retired label labels May 21, 2020
@sagetherage sagetherage modified the milestones: Approved, Blocked May 21, 2020
@sagetherage sagetherage added Magnesium Mg release after Na prior to Al and removed P1 Priority 1 labels May 27, 2020
@sagetherage sagetherage removed the Magnesium Mg release after Na prior to Al label Jul 16, 2020
@sagetherage sagetherage assigned sagetherage and unassigned cmcmarrow May 7, 2021
@sagetherage sagetherage removed the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label May 18, 2021
@sagetherage sagetherage assigned dwoz and unassigned sagetherage May 18, 2021
@sagetherage
Copy link
Contributor

@nir-stratoscale @simmel are you seeing this on a supported version of Salt >v3000.9?

@sagetherage sagetherage added the info-needed waiting for more info label May 24, 2021
@simmel
Copy link
Contributor

simmel commented May 27, 2021

@sagetherage I can confirm that it was solved when we upgraded to 3002.2 as you can see in this graph = )
Screenshot 2021-05-27 at 15 04 11

@sagetherage
Copy link
Contributor

@simmel that is great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt info-needed waiting for more info severity-high 2nd top severity, seen by most users, causes major problems ZD The issue is related to a Zendesk customer support ticket.
Projects
None yet
Development

No branches or pull requests