Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 3005 broke syndication #62577

Closed
ebusto opened this issue Aug 30, 2022 · 30 comments
Closed

[BUG] 3005 broke syndication #62577

ebusto opened this issue Aug 30, 2022 · 30 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic VMware

Comments

@ebusto
Copy link

ebusto commented Aug 30, 2022

Description
3005 broke syndication.

Setup
Upgrade an existing 3004.2 environment to 3005. Watch syndicated masters no longer connect to the syndication master.

Expected behavior
Syndication to be actually tested before releases.

@ebusto ebusto added Bug broken, incorrect, or confusing behavior needs-triage labels Aug 30, 2022
@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic Phosphorus v3005.0 Release code name and version labels Aug 30, 2022
@Ch3LL Ch3LL added this to the Phosphorus v3005.0 milestone Aug 30, 2022
@dwoz
Copy link
Contributor

dwoz commented Aug 30, 2022

@ebusto

Can you please include you configs and some logs from the master and syndics?

@ymasson
Copy link
Contributor

ymasson commented Aug 30, 2022

Servers:

  • 2 MoM (mom1,mom2)
  • 2 Syndic (syndic1,syndic2)

Upgraded from 3004.2 to 3005. (Debian 11 package from Salt's repositories)

MoM runs master and minion, and sharing /etc/salt/pki/master
master.conf: order_masters: True
minion.conf: master: ['mom1','mom2']

Syndic runs master, minion and syndic, and sharing /etc/salt/pki/master
master.conf: syndic_master: ['mom1','mom2'] and syndic_forward_all_events: True
minion.conf: master: ['syndic1','syndic2']

syndic connect to MoM, but also to the local master, and overflood it with start event like

 salt/syndic/syndic2/start  {
    "_stamp": "2022-08-30T21:52:39.188723",
    "cmd": "_minion_event",
    "data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
    "id": "syndic2",
    "pretag": null,
    "tag": "salt/syndic/syndic2/start"
}

[..]

syndic/syndic2/salt/syndic/syndic2/start      {
    "_stamp": "2022-08-30T21:52:39.384410",
    "cmd": "_minion_event",
    "data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
    "id": "syndic2",
    "pretag": null,
    "tag": "salt/syndic/syndic2/start"
}

[..]

syndic/syndic2/syndic/syndic2/syndic/syndic2/syndic/syndic2/syndic/syndic2/salt/syndic/syndic2/start      {
    "_stamp": "2022-08-30T21:52:41.948653",
    "cmd": "_minion_event",
    "data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
    "id": "syndic2",
    "pretag": null,
    "tag": "salt/syndic/syndic2/start"
}

@ymasson
Copy link
Contributor

ymasson commented Aug 30, 2022

on the MoM event bus, on syndic startup, I have

salt/auth	{
    "_stamp": "2022-08-30T21:52:38.550539",
    "act": "accept",
    "id": "syndic2",
    "pub": "-----BEGIN PUBLIC KEY-----XXXXXXXXXXXX\n-----END PUBLIC KEY-----\n",
    "result": true
}
minion/refresh/syndic2	{
    "Minion data cache refresh": "syndic2",
    "_stamp": "2022-08-30T21:52:38.870116"
}

Nothing else

@tkaehn
Copy link

tkaehn commented Aug 31, 2022

I am not sure if it's related to the syndic setup. However after upgrading the "syndic" server stops working correctly after a couple of minutes with the errors seen below in the log.
When the error occurs it is not possible to contact minions from the master using the "salt" command. Running "salt-call" on the minions works fine.
In contrast to this the "mom" server is running fine and can contact its own directly connected minions.

What might be the reason for this problem?

I am running Debian 10 with packages from the official Salt repository.

22-08-30 17:04:44,564 [tornado.general  :435 ][ERROR   ][7840] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
    evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-08-30 17:04:44,565 [tornado.general  :462 ][ERROR   ][7840] Uncaught exception in zmqstream callback
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
    evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-08-30 17:04:44,565 [tornado.application:640 ][ERROR   ][7840] Exception in callback (<zmq.sugar.socket.Socket object at 0x7f7fe0183b40>, <function wrap.<locals>.null_wrapper at 0x7f7fdb323b70>)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 890, in start
    handler_func(fd_obj, events)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
    evt["description"] = self.event_map[evt["event"]]
KeyError: 4096

@d--j
Copy link
Contributor

d--j commented Sep 1, 2022

Since Debian 10 has an old version of python3-zmq this might be similar to #62550 and #62550 (comment) might be a workaround.

@dmurphy18
Copy link
Contributor

@d--j Tested pyzmq for hang issue on Debian 10 and saw no problem with OS provided python3-zmq, see #62550 (comment)

@d--j
Copy link
Contributor

d--j commented Sep 1, 2022

@d--j Tested pyzmq for hang issue on Debian 10 and saw no problem with OS provided python3-zmq, see #62550 (comment)

At least @tkaehn is running Debian 10, not Debian 11. Debian 10 has python3-zmq version 17.1.2. See https://packages.debian.org/buster/python3-zmq

@tkaehn
Copy link

tkaehn commented Sep 2, 2022

Downgrading the Salt packages to 3004.2+ds-1 on the "syndic" server solved the issue. The "mom" system is still running on 3005+ds-1. Both systems on Debian 10.

@tkaehn
Copy link

tkaehn commented Sep 5, 2022

After upgrading both servers from Debian 10 to 11 and both to Salt 3005 everything is running stable again. So the problem is solved for me. However it seems that 3005 introduced an incompatibility with Debian 10 that should be fixed.

@morsik
Copy link
Contributor

morsik commented Sep 10, 2022

I have problem with Syndic on Salt 3005 and Rocky Linux 8.6. It went into infinite loop of "minion start" events. This causes Salt Master to stop behaving right after few seconds of start…

Event Bus is getting spammed all over with this weird events and Master just stops responding right after Syndic startup after it spams event bus with all of this:

syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start	{
    "_stamp": "2022-09-10T19:58:12.007623",
    "cmd": "_minion_event",
    "data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
    "id": "my-salt-master",
    "pretag": null,
    "tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start	{
    "_stamp": "2022-09-10T19:58:12.013383",
    "cmd": "_minion_event",
    "data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
    "id": "my-salt-master",
    "pretag": null,
    "tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start	{
    "_stamp": "2022-09-10T19:58:12.014113",
    "cmd": "_minion_event",
    "data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
    "id": "my-salt-master",
    "pretag": null,
    "tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start	{
    "_stamp": "2022-09-10T19:58:12.014525",
    "cmd": "_minion_event",
    "data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
    "id": "my-salt-master",
    "pretag": null,
    "tag": "salt/minion/my-salt-master/start"
}

@dmurphy18
Copy link
Contributor

@ebusto Would it be possible to upgrade the version of pyzmq on Debian 10 system which is exhibiting the problems with syndic, a version of pyzmq >= 20.0.0. The following example command should work:

python3 -m pip install -U "pyzmq>=20.0.0"

Been sure to restart salt-master and salt-minion after installing the update pyzmq, just to ensure the update is picked up.

On the Debian 10 I tested the command, pyzmq v23.2.1 was installed.
Checking a fix in pyzmq thread handling that was fixed in pyzmq v20.0.0 that resolved some other issues since with pyzmq used on Ubuntu 20.04.

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 16, 2022

bump ^ can anyone confirm the issue is related to the version of pyzmq as @dmurphy18 outlined.

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 22, 2022

Just checking in again. Can anyone confirm the pyzmq upgrade @dmurphy18 outlined solves this issue?

@satellite-no
Copy link

This did not help me as im on PyZMQ: 24.0.1 and still seeing issues with comms from syndic node to master.

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 29, 2022

@satellite-no can you post your entire versions report: salt --versions-report
and @tkaehn can you post your versions report for your working instance?

I'm guessing this is a dependency issue since its working on a OS that has different version dependencies.

@tkaehn
Copy link

tkaehn commented Sep 30, 2022

Please see the version reports below. As I've said before everything is working again (on the surface).
However in the log files I get tons of messages like the following. This didn't happen before 3005:

2022-09-29 10:43:30,061 [salt.loaded.int.returner.local_cache:147 ][ERROR   ][1567775] An extra return was detected from minion [...]

When targeting a lot of minions (~100) at the same time it happens occasionally that some of them do not return. This happened before 3005 as well. Is it possible that increasing worker_threads (currently 32) could solve this problem?

mom# salt --versions-report
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.5
     gitpython: 3.1.14
        Jinja2: 2.11.3
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.0
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.9.7
        pygit2: Not Installed
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
  python-gnupg: Not Installed
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
         smmap: 4.0.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-18-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye
syndic# salt --versions-report
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: Not Installed
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: 4.0.5
     gitpython: 3.1.14
        Jinja2: 2.11.3
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.0
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: Not Installed
  pycryptodome: 3.9.7
        pygit2: Not Installed
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
  python-gnupg: Not Installed
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
         smmap: 4.0.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-18-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye

@satellite-no
Copy link

Sure thing!

Master Versions Report
[root@salt deepwatch]# salt --versions-report
Salt Version:
          Salt: 3005

Dependency Versions:
          cffi: 1.15.1
      cherrypy: Not Installed
      dateutil: Not Installed
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10
       libgit2: 1.5.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 0.5.6
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.6.1
        pygit2: 1.10.0
        Python: 3.7.10 (default, Jun  3 2021, 00:02:01)
  python-gnupg: Not Installed
        PyYAML: 4.2
         PyZMQ: 24.0.1
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: amzn 2
        locale: UTF-8
       machine: x86_64
       release: 5.10.135-122.509.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2
Syndic Versions Report
[root@syndic ec2-user]# salt --versions-report
Salt Version:
          Salt: 3005

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
        Python: 3.9.13 (main, Aug 23 2022, 18:33:16)
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: amzn 2
        locale: utf-8
       machine: x86_64
       release: 5.10.135-122.509.amzn2.x86_64
        system: Linux
       version: Amazon Linux 2

This is the error I see on the master

syndic:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20220930141808737696

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 30, 2022

Thanks. The only thing that stands out is there are different pyzmq versions on your syndic and master. @satellite-no would you mind upgrading your pyzmq version on your syndic, restart the services and see if that makes a difference?

@satellite-no
Copy link

@Ch3LL So silly question, I'm not sure how to do that LOL.. According to my versions-report Salt is using python3.9 which I'm not sure where Salt installed. If I look at the python version of the OS its Python 3.7.10 and the pyzmq is 20.0.0.

This is an yum install of salt so if you know where salt installs its python version let me know (I attempted a find but with no avail).

@Ch3LL
Copy link
Contributor

Ch3LL commented Oct 3, 2022

Did you install the onedir packages or classic packages?

@satellite-no
Copy link

I used the onedir package.

@Ch3LL
Copy link
Contributor

Ch3LL commented Oct 6, 2022

okay so your master is classic packages and syndic is onedir?

If you want to upgrade a package on onedir you can use the salt-pip cli tool to upgrade the package.

@xamindar
Copy link

This is not good. Why is syndic still broken in 3005.1? Is there no movement on fixing this bug?

This was referenced Oct 21, 2022
@tkaehn
Copy link

tkaehn commented Oct 24, 2022

I've tried to upgrade to the onedir packages to see if this makes any difference for me.
However the log files are still flooded with messages like this potentially hiding real errors:

salt-master[275557]: [ERROR   ] An extra return was detected from minion ...

And also like this:

salt-minion[2539]: [ERROR   ] Future <salt.ext.tornado.concurrent.Future object at 0x7f561c07f370> exception was never retrieved: Traceback (most recent
 call last):
[...]
salt-minion[2539]: salt.exceptions.SaltReqTimeoutError: Message timed out

I am very disappointed that a new release breaks so much and that there seems to be no quick solution.

What is your advise? Downgrading to 3004?

From the documentation:

Running commands on remote systems is the core function of Salt.
Salt can execute multiple commands across thousands of systems in
seconds with a single execution.

How can this work if targeting around 100 minions out of a few hundred results in occasional timeouts? Running again all minions do respond. This happened with 3004 as well.

@waynew waynew self-assigned this Nov 9, 2022
@Jookadin
Copy link

I'm unsure if this is the same issue we're facing, but after upgrading to 3005, our systems loadavg goes haywire if the masters local minion is connecting to itself, and the syndic is connecting to a different master.

If the minion connects to the same master as the syndic, there's no issues...
This happens on Centos 7

Salt Version:
Salt: 3005.1

Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.0
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
Python: 3.9.15 (main, Nov 8 2022, 03:44:36)
python-gnupg: 0.4.8
PyYAML: 5.4.1
PyZMQ: 23.2.0
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4

System Versions:
dist: centos 7 Core
locale: utf-8
machine: x86_64
release: 3.10.0-1160.76.1.el7.x86_64
system: Linux
version: CentOS Linux 7 Core

Screenshot 2022-11-09 at 12 37 34

@jlrcontegix
Copy link

jlrcontegix commented Nov 14, 2022

We are experiencing all of the issues reported here, and on a hunch from @Jookadin's comment we moved the minion on the syndic to be connected to the MoM and the problems have gone away.

I don't know if this helps, but on our syndic when salt-minion is connected to the master on the syndic it goes into a loop with the following error:

  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2700, in handle_event
    tries=_minion.opts["return_retry_tries"],
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 295, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 252, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 235, in _do_transfer
    timeout=timeout,
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 914, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
    recv = yield future
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During this period the salt-syndic process is throwing the following error constantly:

[ERROR   ] Unable to connect pusher: Stream is closed

Worth noting we are also seeing the behavior reported above with the logs littered with messages about an extra return from minions:

[ERROR   ] An extra return was detected from minion

@waynew
Copy link
Contributor

waynew commented Nov 18, 2022

For those interested/able, in salt/minion.py, in _send_req_async try changing

        with salt.utils.event.get_event(                             
            "minion", opts=self.opts, listen=False                   
        ) as event:                                                  
            ret = yield event.fire_event_async(                      
                load, "__master_req_channel_payload", timeout=timeout
            )                                                        

to

        with salt.transport.client.AsyncReqChannel.factory(self.opts) as channel:
            log.warning('Sending this load %r', load)                            
            ret = yield channel.send(                                            
                load, timeout=timeout, tries=self.opts["return_retry_tries"]     
            )               

It's probably not the right fix, but at least in my limited testing it appears to work, at least if the problem is just the syndic returning to the MoM.

You can verify that's your problem by checking on one of your minions off the syndic master:

ls /tmp/test.txt

Assuming there's nothing there, from your MoM, given a minion of bob:

salt bob cmd.run 'touch /tmp/test.txt'

Then back on bob, ls /tmp/test.txt again. If it's there, your syndic simply isn't returning to the MoM. If it's not, then your MoM might not be connecting to the syndic -- we've seen some wonkiness happen there, looks like it may have been the above 4096 issue. We haven't fully tracked it down but our current theory on that is that at some point the connection to the minion(s) crash, and it never gets brought back up 🤔. Simply restarting your master should fix that, though.

If you do try my above change, let me know how it works!

@nickhilliard
Copy link

I don't know if this helps, but on our syndic when salt-minion is connected to the master on the syndic it goes into a loop with the following error:

seeing exactly this problem on ubuntu22.04 running 3005.1 installed from repo.saltproject.io. master, syndic and minion running on a single device. Worked fine on 3004.2 on ubuntu20.04.

Salt version report

Salt Version:
          Salt: 3005.1

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.0
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
        Python: 3.9.15 (main, Nov  8 2022, 03:42:58)
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: ubuntu 22.04 jammy
        locale: utf-8
       machine: x86_64
       release: 5.15.0-53-generic
        system: Linux
       version: Ubuntu 22.04 jammy

@waynew
Copy link
Contributor

waynew commented Dec 16, 2022

#63257 should sort this out 👍

@waynew
Copy link
Contributor

waynew commented Jan 13, 2023

Fixed by #63382

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases. Salt-Syndic VMware
Projects
None yet
Development

Successfully merging a pull request may close this issue.