Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory Leak in EventPublisher process #61565

Open
dwoz opened this issue Feb 1, 2022 · 25 comments
Open

[BUG] Memory Leak in EventPublisher process #61565

dwoz opened this issue Feb 1, 2022 · 25 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior memory-leak VMware

Comments

@dwoz
Copy link
Contributor

dwoz commented Feb 1, 2022

Description

We've observed a memory leak in the EventPubisher process on the master. This can be seen by adding a simple engine to a salt-minion that sends lots of auth requests. It looks like the leak isn't specific to auth events. A large number of any events should trigger it.

Setup

import salt.crypt
import salt.ext.tornado.ioloop
import salt.ext.tornado.gen
import logging

log = logging.getLogger(__name__)


@salt.ext.tornado.gen.coroutine
def do_auth(opts, io_loop):
    while True:
        auth = salt.crypt.AsyncAuth(opts, io_loop=io_loop)
        log.info("ENGINE DO AUTH")
        yield auth.sign_in()


def start():
    io_loop = salt.ext.tornado.ioloop.IOLoop()
    __opts__['master_uri'] = 'tcp://127.0.0.1:4506'
    io_loop.spawn_callback(do_auth, __opts__, io_loop)
    io_loop.start()

Versions

v3004

@dwoz dwoz added Bug broken, incorrect, or confusing behavior needs-triage and removed needs-triage labels Feb 1, 2022
@dwoz dwoz removed the needs-triage label Feb 1, 2022
@dwoz dwoz self-assigned this Feb 1, 2022
@frebib
Copy link
Contributor

frebib commented Feb 2, 2022

Does the number of open file descriptors appear to increase too? I'm wondering if this is related to #61521

@dwoz
Copy link
Contributor Author

dwoz commented Feb 2, 2022

Here is a patch that will address the issue for 3004. I've tested this against the master branch and I don't see the memory leak. Looks like the recent transport refactor inadvertently addressed the issue.

0001-Slow-memory-leak-fix.patch.txt

@frebib I don't believe this is the same issue as #61521. I believe the root of #61521 is caused by this commit. We're probably creating multiple instances of events and transports which needs to be addressed. #61468 should be a step in the right direction.

@Zpell82
Copy link

Zpell82 commented Nov 27, 2023

is there any progress on this ?
`Salt Version:
Salt: 3006.4

Python Version:
Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]

Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4

System Versions:
dist: alpine 3.14.6
locale: utf-8
machine: x86_64
release: 5.15.0-87-generic
system: Linux
version: Alpine Linux 3.14.6 `

image
image
image
image
image

@Zpell82
Copy link

Zpell82 commented Dec 22, 2023

it looks a lot better in 3006.5 when using this af12352
solution from minon.py will be observing over the Christmas holidays

@dwoz
Copy link
Contributor Author

dwoz commented Dec 28, 2023

it looks a lot better in 3006.5 when using this af12352 solution from minon.py will be observing over the Christmas holidays

Fantastic news, thanks for the update.

@max-arnold
Copy link
Contributor

How to debug a memory leak like this? tracemalloc or some other tools?

@dwoz
Copy link
Contributor Author

dwoz commented Dec 29, 2023

@max-arnold

In the past we've relied on the __del__ to clean things like this up. That is generally considered an anti-pattern in python and we've been working towards cleaning that practice up. Most recently I've started adding warnings if an object is getting garbage collected without being properly closed (#65559). Several places (now fixed) where transport client's were not being closed have been revealed in our test suite. With this code in place it should be easier for users to identify these kinds of issues an report them with useful debug info.

We've also been working towards better debugging of running salt processes with better tooling. We've added debug symbol packages in 3006.x and there is a newer tool relenv-gdb-dbg to help debug these kinds of issues.

@Zpell82
Copy link

Zpell82 commented Jan 3, 2024

looks a lot better when i review Eventpublisher
image

@Zpell82
Copy link

Zpell82 commented Jan 5, 2024

@dwoz i needed to do a flip on this row in the minion.py code
image
if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@network-shark
Copy link
Contributor

@Zpell82 sounds like a bug which should be an issue ?

@dwoz
Copy link
Contributor Author

dwoz commented Feb 13, 2024

@dwoz i needed to do a flip on this row in the minion.py code image if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

@jheiselman
Copy link

We are running salt 3006.6 and are seeing this memory leak. I examined salt/minion.py and it already has the fix suggested in #61565 (comment)

Further, after the upgrade from 3006.5 to 3006.6 we are now seeing this problem present much faster than before. It used to take about 20 days, now we’ve noticed it after just 7 days.

@aalewis
Copy link

aalewis commented Feb 20, 2024

Yeah, seeing this in 3006.6 in testing as well.

@dwoz
Copy link
Contributor Author

dwoz commented Mar 2, 2024

I believe this is resolved with 3006.7 can folks confirm?

@jheiselman
Copy link

I attempted to install 3006.7, but I got an error while trying to salt-pip install pygit2 to use with a git backend.

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'readelf'

OS is Debian Bullseye

Maybe a new dependency needs to be added to the deb?

@lomeroe
Copy link
Contributor

lomeroe commented Mar 10, 2024

I still see 3006.7's leaking in our environment...EventPublisher process is over 8GB memory in ~48 hours since upgrading (edited from originally saying "less than 24 hours", I lost track of what day it was /facepalm)

A master still at 3006.4 EventPublisher up to 24GB in about 10 days since its last restart

These masters are both using an external job cache. Masters not using an external job cache don't seem to leak noticeably - are others seeing the leak in 3006.6/etc using an external job cache? Maybe the external job cache is a red herring, I haven't dug in to all that is all glued together and if that is handled by the EventPublisher process...

@jheiselman
Copy link

@lomeroe as a matter of fact, ours is using an external job cache (redis).

@lomeroe
Copy link
Contributor

lomeroe commented Mar 11, 2024

interesting...maybe there is something there with the external job cache and the EventPublisher process leaking (we are not using redis as our external job cache, so it would seem to not specifically be related to a single external job cache type at least)

@dwoz - thoughts?

@johje349
Copy link

This is still a problem in 3006.7 for us, and we have a cronjob to restart Salt master once a day.
We are not using external job cache.

@lomeroe
Copy link
Contributor

lomeroe commented Mar 22, 2024

@johje349 @jheiselman - do either of you run orchestrations or other background jobs that utilize batch mode? After restarting a master with the patch I mention in #66249, memory for the EventPublisher process hasn't gone over 400MB in almost 24 hours, which seems considerably better from what we have been seeing - typically it is several GB in a day. Obviously need more time monitoring to really tell if it has something to do with it, so it could just be coincidental....

We've got quite a few orchestrations/api initiated jobs that use batch mode (all initiated on the masters exhibiting the issue), so if they were hanging up/never fully returning, I suppose it's possible that could cause memory issues in the EventPublisher process somehow. I wouldn't have ever guessed that, but...

@jheiselman
Copy link

We do not have very many, but yes, we do have a few scheduled jobs (started via the salt-api) that utilize batch mode. None of our orchestrations use batch mode.

At this point in time, both of our salt-masters were last restarted four days ago. One of our masters is currently using 3.5 GB of memory and less than 1 GB on another. They share the same workloads/minions. There's no difference between the two. Their salt-api's are behind a load balancer, so some one of them may be getting heavier jobs than the other purely by chance. By the load balancer is configured for a simple round-robin so they should be getting the same number of jobs.

@johje349
Copy link

Yes we use orchestration states with batch set to 20%.

@lomeroe
Copy link
Contributor

lomeroe commented Mar 27, 2024

After 5 days, EventPublisher memory usage is still hovering around 400MB. The only change made was applying the patch mentioned in #66249. Seems fairly likely to me that issue causes memory usage increases in the EventPublisher process due to the jobs run in batch mode never actually ending.

@Zpell82
Copy link

Zpell82 commented Apr 5, 2024

@dwoz i needed to do a flip on this row in the minion.py code image if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

Found the issue , it was that i hade my master : ["mastername"] in minion config , removed the brakets and it started working just fine

@thinkst-marco
Copy link

@lomeroe's patch in #66249 worked for us in eliminating the memory leak we were observing in EventPublisher. We don't rely on orchestration, external job caches, or salt-api but do use batches. Thanks for the patch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior memory-leak VMware
Projects
None yet
Development

No branches or pull requests