[BUG] Memory Leak in EventPublisher process #61565

dwoz · 2022-02-01T17:50:38Z

Description

We've observed a memory leak in the EventPubisher process on the master. This can be seen by adding a simple engine to a salt-minion that sends lots of auth requests. It looks like the leak isn't specific to auth events. A large number of any events should trigger it.

Setup

import salt.crypt
import salt.ext.tornado.ioloop
import salt.ext.tornado.gen
import logging

log = logging.getLogger(__name__)


@salt.ext.tornado.gen.coroutine
def do_auth(opts, io_loop):
    while True:
        auth = salt.crypt.AsyncAuth(opts, io_loop=io_loop)
        log.info("ENGINE DO AUTH")
        yield auth.sign_in()


def start():
    io_loop = salt.ext.tornado.ioloop.IOLoop()
    __opts__['master_uri'] = 'tcp://127.0.0.1:4506'
    io_loop.spawn_callback(do_auth, __opts__, io_loop)
    io_loop.start()

Versions

v3004

The text was updated successfully, but these errors were encountered:

frebib · 2022-02-02T12:28:25Z

Does the number of open file descriptors appear to increase too? I'm wondering if this is related to #61521

dwoz · 2022-02-02T21:35:48Z

Here is a patch that will address the issue for 3004. I've tested this against the master branch and I don't see the memory leak. Looks like the recent transport refactor inadvertently addressed the issue.

0001-Slow-memory-leak-fix.patch.txt

@frebib I don't believe this is the same issue as #61521. I believe the root of #61521 is caused by this commit. We're probably creating multiple instances of events and transports which needs to be addressed. #61468 should be a step in the right direction.

Zpell82 · 2023-11-27T09:09:28Z

is there any progress on this ?
`Salt Version:
Salt: 3006.4

Python Version:
Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]

Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4

System Versions:
dist: alpine 3.14.6
locale: utf-8
machine: x86_64
release: 5.15.0-87-generic
system: Linux
version: Alpine Linux 3.14.6 `

Zpell82 · 2023-12-22T10:16:01Z

it looks a lot better in 3006.5 when using this af12352
solution from minon.py will be observing over the Christmas holidays

dwoz · 2023-12-28T00:05:22Z

it looks a lot better in 3006.5 when using this af12352 solution from minon.py will be observing over the Christmas holidays

Fantastic news, thanks for the update.

max-arnold · 2023-12-28T02:02:47Z

How to debug a memory leak like this? tracemalloc or some other tools?

dwoz · 2023-12-29T09:16:33Z

@max-arnold

In the past we've relied on the __del__ to clean things like this up. That is generally considered an anti-pattern in python and we've been working towards cleaning that practice up. Most recently I've started adding warnings if an object is getting garbage collected without being properly closed (#65559). Several places (now fixed) where transport client's were not being closed have been revealed in our test suite. With this code in place it should be easier for users to identify these kinds of issues an report them with useful debug info.

We've also been working towards better debugging of running salt processes with better tooling. We've added debug symbol packages in 3006.x and there is a newer tool relenv-gdb-dbg to help debug these kinds of issues.

Zpell82 · 2024-01-03T07:38:58Z

looks a lot better when i review Eventpublisher

Zpell82 · 2024-01-05T09:01:14Z

@dwoz i needed to do a flip on this row in the minion.py code

if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

network-shark · 2024-01-13T11:55:53Z

@Zpell82 sounds like a bug which should be an issue ?

dwoz · 2024-02-13T22:23:39Z

@dwoz i needed to do a flip on this row in the minion.py code if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

jheiselman · 2024-02-16T16:07:09Z

We are running salt 3006.6 and are seeing this memory leak. I examined salt/minion.py and it already has the fix suggested in #61565 (comment)

Further, after the upgrade from 3006.5 to 3006.6 we are now seeing this problem present much faster than before. It used to take about 20 days, now we’ve noticed it after just 7 days.

aalewis · 2024-02-20T21:03:33Z

Yeah, seeing this in 3006.6 in testing as well.

dwoz · 2024-03-02T07:49:46Z

I believe this is resolved with 3006.7 can folks confirm?

jheiselman · 2024-03-06T14:32:39Z

I attempted to install 3006.7, but I got an error while trying to salt-pip install pygit2 to use with a git backend.

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'readelf'

OS is Debian Bullseye

Maybe a new dependency needs to be added to the deb?

lomeroe · 2024-03-10T19:23:21Z

I still see 3006.7's leaking in our environment...EventPublisher process is over 8GB memory in ~48 hours since upgrading (edited from originally saying "less than 24 hours", I lost track of what day it was /facepalm)

A master still at 3006.4 EventPublisher up to 24GB in about 10 days since its last restart

These masters are both using an external job cache. Masters not using an external job cache don't seem to leak noticeably - are others seeing the leak in 3006.6/etc using an external job cache? Maybe the external job cache is a red herring, I haven't dug in to all that is all glued together and if that is handled by the EventPublisher process...

jheiselman · 2024-03-10T19:41:00Z

@lomeroe as a matter of fact, ours is using an external job cache (redis).

lomeroe · 2024-03-11T13:58:39Z

interesting...maybe there is something there with the external job cache and the EventPublisher process leaking (we are not using redis as our external job cache, so it would seem to not specifically be related to a single external job cache type at least)

@dwoz - thoughts?

johje349 · 2024-03-22T10:32:58Z

This is still a problem in 3006.7 for us, and we have a cronjob to restart Salt master once a day.
We are not using external job cache.

lomeroe · 2024-03-22T20:17:14Z

@johje349 @jheiselman - do either of you run orchestrations or other background jobs that utilize batch mode? After restarting a master with the patch I mention in #66249, memory for the EventPublisher process hasn't gone over 400MB in almost 24 hours, which seems considerably better from what we have been seeing - typically it is several GB in a day. Obviously need more time monitoring to really tell if it has something to do with it, so it could just be coincidental....

We've got quite a few orchestrations/api initiated jobs that use batch mode (all initiated on the masters exhibiting the issue), so if they were hanging up/never fully returning, I suppose it's possible that could cause memory issues in the EventPublisher process somehow. I wouldn't have ever guessed that, but...

jheiselman · 2024-03-22T20:28:36Z

We do not have very many, but yes, we do have a few scheduled jobs (started via the salt-api) that utilize batch mode. None of our orchestrations use batch mode.

At this point in time, both of our salt-masters were last restarted four days ago. One of our masters is currently using 3.5 GB of memory and less than 1 GB on another. They share the same workloads/minions. There's no difference between the two. Their salt-api's are behind a load balancer, so some one of them may be getting heavier jobs than the other purely by chance. By the load balancer is configured for a simple round-robin so they should be getting the same number of jobs.

johje349 · 2024-03-25T12:42:56Z

Yes we use orchestration states with batch set to 20%.

lomeroe · 2024-03-27T12:50:00Z

After 5 days, EventPublisher memory usage is still hovering around 400MB. The only change made was applying the patch mentioned in #66249. Seems fairly likely to me that issue causes memory usage increases in the EventPublisher process due to the jobs run in batch mode never actually ending.

Zpell82 · 2024-04-05T10:01:45Z

@dwoz i needed to do a flip on this row in the minion.py code if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

Found the issue , it was that i hade my master : ["mastername"] in minion config , removed the brakets and it started working just fine

thinkst-marco · 2024-06-13T07:49:33Z

@lomeroe's patch in #66249 worked for us in eliminating the memory leak we were observing in EventPublisher. We don't rely on orchestration, external job caches, or salt-api but do use batches. Thanks for the patch!

dwoz added Bug broken, incorrect, or confusing behavior needs-triage and removed needs-triage labels Feb 1, 2022

github-actions bot added the needs-triage label Feb 1, 2022

github-actions bot assigned waynew Feb 1, 2022

dwoz removed the needs-triage label Feb 1, 2022

dwoz self-assigned this Feb 1, 2022

frebib mentioned this issue Feb 3, 2022

[BUG] v3004 leaks fds/pipes causing Too many open files crash #61521

Closed

6 tasks

jertel mentioned this issue Mar 8, 2022

FIX: Salt error during setup - [ERROR ] Unable to connect pusher: Stream is closed Security-Onion-Solutions/securityonion#7203

Closed

OrangeDog added the memory-leak label Jun 15, 2022

doesitblend added the VMware label Oct 23, 2023

dmurphy18 unassigned waynew Feb 14, 2024

jheiselman mentioned this issue Feb 21, 2024

[BUG] Un-closed transports while starting up minion #65932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory Leak in EventPublisher process #61565

[BUG] Memory Leak in EventPublisher process #61565

dwoz commented Feb 1, 2022

frebib commented Feb 2, 2022

dwoz commented Feb 2, 2022

Zpell82 commented Nov 27, 2023

Zpell82 commented Dec 22, 2023

dwoz commented Dec 28, 2023

max-arnold commented Dec 28, 2023

dwoz commented Dec 29, 2023 •

edited

Loading

Zpell82 commented Jan 3, 2024

Zpell82 commented Jan 5, 2024 •

edited

Loading

network-shark commented Jan 13, 2024

dwoz commented Feb 13, 2024

jheiselman commented Feb 16, 2024

aalewis commented Feb 20, 2024

dwoz commented Mar 2, 2024

jheiselman commented Mar 6, 2024

lomeroe commented Mar 10, 2024 •

edited

Loading

jheiselman commented Mar 10, 2024

lomeroe commented Mar 11, 2024

johje349 commented Mar 22, 2024

lomeroe commented Mar 22, 2024

jheiselman commented Mar 22, 2024

johje349 commented Mar 25, 2024

lomeroe commented Mar 27, 2024

Zpell82 commented Apr 5, 2024

thinkst-marco commented Jun 13, 2024

[BUG] Memory Leak in EventPublisher process #61565

[BUG] Memory Leak in EventPublisher process #61565

Comments

dwoz commented Feb 1, 2022

frebib commented Feb 2, 2022

dwoz commented Feb 2, 2022

Zpell82 commented Nov 27, 2023

Zpell82 commented Dec 22, 2023

dwoz commented Dec 28, 2023

max-arnold commented Dec 28, 2023

dwoz commented Dec 29, 2023 • edited Loading

Zpell82 commented Jan 3, 2024

Zpell82 commented Jan 5, 2024 • edited Loading

network-shark commented Jan 13, 2024

dwoz commented Feb 13, 2024

jheiselman commented Feb 16, 2024

aalewis commented Feb 20, 2024

dwoz commented Mar 2, 2024

jheiselman commented Mar 6, 2024

lomeroe commented Mar 10, 2024 • edited Loading

jheiselman commented Mar 10, 2024

lomeroe commented Mar 11, 2024

johje349 commented Mar 22, 2024

lomeroe commented Mar 22, 2024

jheiselman commented Mar 22, 2024

johje349 commented Mar 25, 2024

lomeroe commented Mar 27, 2024

Zpell82 commented Apr 5, 2024

thinkst-marco commented Jun 13, 2024

dwoz commented Dec 29, 2023 •

edited

Loading

Zpell82 commented Jan 5, 2024 •

edited

Loading

lomeroe commented Mar 10, 2024 •

edited

Loading