A possible implementation for managing non-python agents #320

cnweaver · 2023-02-24T22:29:19Z

This is a tentative set of changes to support managing agents (or indeed any program) which are not written in python (and without using Docker).

Description

This change allows agents to be specified with an executable, agent-exe, as an alternative to agent-class. If this is done, the executable is run directly, rather than running the python interpreter on a script or module.

Arguments are passed somewhat differently to agent executables: The agent_arguments are passed directly, and the WAMP (router) URL and realm are passed, instead of the OCS site file and host. Additionally, the instance ID passed to the child is fully qualified, with the OCS address root. The logic for this is to avoid every agent having to reimplement the OCS config parsing logic, when the HostManager has already done the parsing, and has all necessary information available to it to pass on. The disadvantage is that handling of python and generic agents is not symmetric.

A capture_logs boolean option is added for agents. When this is enabled, the AgentProcessHelper writes data captured from stdout/stderr to a file alongside existing logs. This is not fancy, but at least ensures that log data from programs which write normally to stdout/stderr can be checked.

These changes are orthogonal to, and should not interfere with, the existing Docker agent management mechanism/encoding, but it is somewhat dissatisfying that python, docker, and generic code paths are not more parallel. The hope is that this avoids breaking any of the Docker handling, which I cannot easily test, but it is not very elegant.

Motivation and Context

I would like to manage agents which are not written in python due to their requirements for multi-threading and low-latency operation on high data volumes, e.g. the CMB-S4 DAQ collector prototype. It is already possible for such agents to communicate over WAMP to publish and subscribe to data feeds and have operations controlled with an OCSClient without changes to OCS, but it is also desirable to be able to manage their lifetimes (via ocsbow, and the HostManager, etc.) as well.

How Has This Been Tested?

All tests in the test suite pass, and I am successfully able to manage (bring up, keep running, shut down, and view logs) a C++ agent alongside standard python agents on an Ubuntu host.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Documentation of these features is still missing; I will be happy to write it if/when we converge on the implementation details. I am also not entirely sure that there are not gaps in my implementation; I have tested by running my own use-case (with both python and non-python agents), and the test suite, but I don't have a good feel for the completeness of the coverage that provides.

for more information, see https://pre-commit.ci

updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.1 → v2.0.2](pre-commit/mirrors-autopep8@v2.0.1...v2.0.2)

…config [pre-commit.ci] pre-commit autoupdate

* Separate blocks for each registered operation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: Remove check for exception in changed agent ops test * Switch dict representation --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Brian Koopman <[email protected]>

…adata-dep Add `importlib_metadata` dependency for ocs-agent-cli

Instead of showing up in top as "ocs-agent-cli", it shows up as "ocs-agent:{instance_id}".

ocs-agent-cli renames itself in process list

* Update os+python version on readthedocs * Add rtd config to ignore paths for workflows

* Add extra dependencies for development * Qualify image names with repository * Fix test failures from calling `os.path.join` on mock objects * Fix mistakes in dev dependency list * Revert "Qualify image names with repository" This change was unsafe, as it would run tests over previously released code, instead of possibly modified, local code. This reverts commit f357ad3. --------- Co-authored-by: C. Weaver <[email protected]>

* ocs-local-support: warn if crossbar/scf disagree on realm/address_root * registry subscribes to {address_root}.* not observatory.* Also changed OCSAgent to fail if any startup subscriptions fail. * aggregator subscribes to {address_root}.* not observatory.* * influxdb_publisher subscribes to {address_root}.* not observatory.* * InfluxDBAgent: exit cleanly on ctrl-c ... even when looping on failed connections to influxdb. * Tweak a few references to "observatory" in docstrings * docs: clarify "observatory" as default address_root * Remove all references to "registry_address" This SCF param actually wasn't used anywhere, except optionally in the ocs-agent-cli (which now simply defaults to {address_root}.registry and can still be overridden). The command-line argument remains, but is doc'ed as deprecated and has no effect on anything. * Fix registry tests (needs args.address_root) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ocs-local-support: better error message if crossbar not managed ... in response to requests for "generate_crossbar_config" and "start crossbar". Also update erroneous docs for the former. * ocs-crossbar image can be configured more, with env Now the crossbar config can be bind-mounted in, or else some envvars are used to generate one from a template before crossbar is started. * ocs_agent: more informative error message on realm mismatch Also the SCF docs warn about updating the crossbar config. * Update documentation related to crossbar configuration --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Brian Koopman <[email protected]>

Fixes simonsobs#274.

Cast tuple param before type check

…sses (simonsobs#333) * Log exceptions caught within start() * Create failing test for start task with None params * Handle ParamHandler getting None params * Simplify None params handling * Fix log formatting syntax

…t-docker-compose Replace pytest-docker-compose with pytest-docker plugin

updates: - [github.com/pycqa/flake8: 6.0.0 → 6.1.0](PyCQA/flake8@6.0.0...6.1.0)

…config [pre-commit.ci] pre-commit autoupdate

HostManager can crash and restart really quickly, so this rate-limiting is necessary on some systems to prevent systemd from giving up on the service.

systemd service file: add RestartSec=10s

updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.2 → v2.0.4](pre-commit/mirrors-autopep8@v2.0.2...v2.0.4)

…config [pre-commit.ci] pre-commit autoupdate

mhasself

Thanks for this contribution, and sorry it took so long to review it.

My comments below are not intended as the final word -- happy to discuss with you and other OCS devs.

I am on board with adding this capability, but I would like to suggest a modification to the implementation. I would like for the agent-class to still be a mandatory setting in every agent block of the SCF. The alternate operation path would then be entirely triggered by whether agent-exe is set to non-None. (As a by-product, this does simplify a lot of arg-checking logic I think.)

The AgentClass plays an important role in ocs-web, for example to help instantiate the correct control panel for an Agent. (ocs-web uses the class reported by the Agent instance itself, through get_api, rather than whatever HostManager has read from the SCF ... but still, it is useful to think of agent instances as belonging to a particular named class.)

As a parallel with the docker case, where AgentClass is reported in tables with a [d] suffix, I think executable case should have an AgentClass reported, with the [exe] suffix. The path to the binary, which is currently jammed into the [agent] column (in the ocsbow status output for example), could instead be added as a separate column. (For the docker-based agents that column would contain the docker-compose file + service name I guess; beyond the scope of this work but just thinking ahead.)

See a few other comments inline. I am happy to do some of the work on this, as I am in the process of addressing a bunch of other HM issues and it would be great for this to all evolve in harmony.

mhasself · 2023-09-06T00:52:41Z

ocs/agents/host_manager/drivers.py

+        if self.log_file is not None:
+            self.log_file.flush()


In my brief tests, this flush happens too quickly -- before the INT is sent and processed. You should probably do a flush in processExited in addition to / instead of here.

Agreed; moving the flush to processExited seems to work much better.

mhasself · 2023-09-06T00:59:24Z

ocs/site_config.py

+            The Agent executable.  This
+            may be matched against the agent_exe name provided by
+            the Agent instance, as a way of finding the right
+            InstanceConfig.
+


Is this feature (matching against agent_exe) useful? I have been thinking that the class-matching feature (that it mimics) never really turned out to be that useful, and was considering deprecating it. I would not miss it!

I don't think I had any particular use in mind; I was just trying to mimic the class-matching feature in case it was somehow important. On that basis, I don't think I would object to removing it, now or later.

mhasself · 2023-09-06T14:15:09Z

ocs/agents/host_manager/agent.py

+                cmd = [instance['agent_exe'], '--instance-id', self.address_root + '.' + iid,
+                       '--wamp-url', self.wamp_url, '--wamp-realm', self.wamp_realm]


I understand wanting to join address_root and iid, but can you call it --agent-address or --address? I don't like redefining --instance-id.

I think I called it --instance-id solely out of an attempt to better fit in with the surrounding code; my agents have always taken --address and I later added this as an alias. So, if you prefer --address, so do I, and I'm happy to switch it. :)

BrianJKoopman

I finally took the time to test this out today, it worked great to startup an external executable. I don't have much to add to @mhasself's review. The one thing I will comment on is the logging.

I'd like to suggest logging stdout/stderr by default, with the option to disable it. We currently run into situations where logging isn't on (because log_dir isn't set) then an error occurs and we'd like to have logs to debug, but don't. Logging by default avoids that, and allows one to disable the feature if logging is handled some other way, perhaps in the non-python exe.

BrianJKoopman · 2023-09-13T20:51:16Z

ocs/agents/host_manager/agent.py

+                if "agent_arguments" in instance:
+                    cmd.extend(instance["agent_arguments"])
+            if instance['write_logs']:
+                log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log"


If log_dir isn't defined in the SCF this'll throw:

2023-09-13T16:50:49-0400 Unhandled Error Traceback (most recent call last): File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 672, in main runner.run(agent, auto_reconnect=True) File "/home/koopman/.venv/user/lib/python3.11/site-packages/autobahn/twisted/wamp.py", line 439, in run reactor.run() File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1318, in run self.mainLoop() File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1328, in mainLoop reactorBaseSelf.runUntilCurrent() --- <exception caught here> --- File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 967, in runUntilCurrent f(*a, **kw) File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 353, in _launch_instance log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log" builtins.TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

If this isn't set, consistent behavior with OCS would be to just not log.

Thanks for spotting that; it was definitely a bug.

for more information, see https://pre-commit.ci

…nto non-python-agents

cnweaver · 2023-09-15T21:46:49Z

Thanks for the review comments! I do have a few questions on how exactly to implement some of the bigger picture ones:

I would like for the agent-class to still be a mandatory setting in every agent block of the SCF.

I didn't do this in large part because it wasn't obvious to me what value one should use for the agent class of something which is not in fact a python class. It could be treated as just a label, but the instance ID already has that role? I suppose for display purposes the basename of the executable path could be used, as that's fairly analogous to a python class name, but it would still seem odd to me to make the user specify this, when doing so is both redundant with the executable path (which is necessary), and introduces a point of possible confusion if the user doesn't keep the two items properly synchronized when writing or editing the configuration.

The AgentClass plays an important role in ocs-web, for example to help instantiate the correct control panel for an Agent. (ocs-web uses the class reported by the Agent instance itself, through get_api, rather than whatever HostManager has read from the SCF ... but still, it is useful to think of agent instances as belonging to a particular named class.)

I've never used the web interface, so I missed this. However, this use-case still leaves unclear what the agent class should be for an executable. It's not clear to me what control panels there are, or where they come from; is there something sensible for ocs-web to select for an arbitrary agent? I guess I would have expected it, at least in the generic case, to use the reflection provided by WAMP to discover what commands, etc. each agent supports.

As a parallel with the docker case, where AgentClass is reported in tables with a [d] suffix, I think executable case should have an AgentClass reported, with the [exe] suffix. The path to the binary, which is currently jammed into the [agent] column (in the ocsbow status output for example), could instead be added as a separate column. (For the docker-based agents that column would contain the docker-compose file + service name I guess; beyond the scope of this work but just thinking ahead.)

This is certainly doable, but it would take up more space, and depending on what value was used for the agent class it might be no more informative? I do see the utility in being able to display more information about the docker case; if that is the direction in which to go, though, what label would make sense for the column which might contain an executable path, might contain docker info, or might be empty (for python agents, unless there's also something relevant to display for them)?

I'd like to suggest logging stdout/stderr by default, with the option to disable it. We currently run into situations where logging isn't on (because log_dir isn't set) then an error occurs and we'd like to have logs to debug, but don't. Logging by default avoids that, and allows one to disable the feature if logging is handled some other way, perhaps in the non-python exe.

I'm happy to make logging the default; in fact I'm not sure why I didn't to begin with. That being said, I'd like to better understand the situation around log_dir not being set: I've fixed my bug of crashing when it's unset (by just not logging, as suggested), but that just leads to exactly the situation that there are no logs if the user forgets to set it. That could be changed in general, e.g. by having a default log directory, but that seems like a more general change outside the scope of what I'm adding?

BrianJKoopman · 2023-09-18T14:58:22Z

I'm happy to make logging the default; in fact I'm not sure why I didn't to begin with. That being said, I'd like to better understand the situation around log_dir not being set: I've fixed my bug of crashing when it's unset (by just not logging, as suggested), but that just leads to exactly the situation that there are no logs if the user forgets to set it. That could be changed in general, e.g. by having a default log directory, but that seems like a more general change outside the scope of what I'm adding?

I don't really have much to add, only that log_dir isn't a required setting in the SCF at the moment. In practice on SO, essentially all agents are running in Docker containers and log aggregation is happening via Loki.

I like the idea of a default log directory, but agree it might be out of scope here. I'd need to consider how we would handle log rotation. Some Agents' logs can be kind of verbose. In the Docker context, this'll make container's disk usage grow (and Docker is also storing the logs, so we're doubling this log output at that point.)

C. Weaver and others added 3 commits February 24, 2023 17:20

A possible implementation for managing non-python agents

64fe132

[pre-commit.ci] auto fixes from pre-commit.com hooks

3bb7842

for more information, see https://pre-commit.ci

Fix reference to removed variable

feab97f

BrianJKoopman requested a review from mhasself March 1, 2023 19:27

pre-commit-ci bot and others added 19 commits March 6, 2023 19:25

[pre-commit.ci] pre-commit autoupdate

e1fbd11

updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.1 → v2.0.2](pre-commit/mirrors-autopep8@v2.0.1...v2.0.2)

Merge pull request simonsobs#321 from simonsobs/pre-commit-ci-update-…

455df18

…config [pre-commit.ci] pre-commit autoupdate

Add importlib_metadata dependency for ocs-agent-cli

6bfd154

Merge pull request simonsobs#323 from simonsobs/koopman/importlib-met…

91c51d5

…adata-dep Add `importlib_metadata` dependency for ocs-agent-cli

Use procsettitle to have ocs-agent-cli rename itself

919b966

Instead of showing up in top as "ocs-agent-cli", it shows up as "ocs-agent:{instance_id}".

Merge pull request simonsobs#325 from simonsobs/set-title

d6e95ec

ocs-agent-cli renames itself in process list

Fix readthedocs builds (simonsobs#328)

ff41bb8

* Update os+python version on readthedocs * Add rtd config to ignore paths for workflows

Cast tuple param before type check

5edbde1

Fixes simonsobs#274.

Merge pull request simonsobs#332 from simonsobs/koopman/issue-274

4b9fd3f

Cast tuple param before type check

Replace pytest-docker-compose with pytest-docker plugin

003260a

Merge pull request simonsobs#342 from simonsobs/koopman/replace-pytes…

befdda6

…t-docker-compose Replace pytest-docker-compose with pytest-docker plugin

[pre-commit.ci] pre-commit autoupdate

4222f53

updates: - [github.com/pycqa/flake8: 6.0.0 → 6.1.0](PyCQA/flake8@6.0.0...6.1.0)

Merge pull request simonsobs#340 from simonsobs/pre-commit-ci-update-…

c8d1fc5

…config [pre-commit.ci] pre-commit autoupdate

systemd service file: add RestartSec=10s

cf39d9c

HostManager can crash and restart really quickly, so this rate-limiting is necessary on some systems to prevent systemd from giving up on the service.

Merge pull request simonsobs#344 from simonsobs/hm-service-robustness

4a981c5

systemd service file: add RestartSec=10s

BrianJKoopman self-requested a review August 25, 2023 19:30

pre-commit-ci bot and others added 2 commits August 28, 2023 19:46

[pre-commit.ci] pre-commit autoupdate

28bc23d

updates: - [github.com/pre-commit/mirrors-autopep8: v2.0.2 → v2.0.4](pre-commit/mirrors-autopep8@v2.0.2...v2.0.4)

Merge pull request simonsobs#346 from simonsobs/pre-commit-ci-update-…

c805bab

…config [pre-commit.ci] pre-commit autoupdate

mhasself requested changes Sep 6, 2023

View reviewed changes

mhasself mentioned this pull request Sep 13, 2023

HostManager overhaul #353

Merged

6 tasks

BrianJKoopman reviewed Sep 13, 2023

View reviewed changes

A possible implementation for managing non-python agents

e1cea68

pre-commit-ci bot and others added 7 commits September 14, 2023 16:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

80bb45e

for more information, see https://pre-commit.ci

Fix reference to removed variable

87b6dce

Merge branch 'non-python-agents' of https://github.com/cnweaver/ocs i…

f1a29a3

…nto non-python-agents

Call the combined address root and instance ID the address

e11331b

Turn off logging instead of crashing if log_dir is not set

0f71bb5

Flush log file later to avoid confusing log delays

4c7549f

Log executable agent output by default

9d31362

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A possible implementation for managing non-python agents #320

A possible implementation for managing non-python agents #320

cnweaver commented Feb 24, 2023

mhasself left a comment

mhasself Sep 6, 2023

cnweaver Sep 15, 2023

mhasself Sep 6, 2023

cnweaver Sep 14, 2023

mhasself Sep 6, 2023

cnweaver Sep 15, 2023

BrianJKoopman left a comment

BrianJKoopman Sep 13, 2023

cnweaver Sep 15, 2023

cnweaver commented Sep 15, 2023

BrianJKoopman commented Sep 18, 2023

		cmd = [instance['agent_exe'], '--instance-id', self.address_root + '.' + iid,
		'--wamp-url', self.wamp_url, '--wamp-realm', self.wamp_realm]

A possible implementation for managing non-python agents #320

Are you sure you want to change the base?

A possible implementation for managing non-python agents #320

Conversation

cnweaver commented Feb 24, 2023

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

mhasself left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrianJKoopman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cnweaver commented Sep 15, 2023

BrianJKoopman commented Sep 18, 2023