Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A possible implementation for managing non-python agents #320

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

cnweaver
Copy link
Contributor

This is a tentative set of changes to support managing agents (or indeed any program) which are not written in python (and without using Docker).

Description

This change allows agents to be specified with an executable, agent-exe, as an alternative to agent-class. If this is done, the executable is run directly, rather than running the python interpreter on a script or module.

Arguments are passed somewhat differently to agent executables: The agent_arguments are passed directly, and the WAMP (router) URL and realm are passed, instead of the OCS site file and host. Additionally, the instance ID passed to the child is fully qualified, with the OCS address root. The logic for this is to avoid every agent having to reimplement the OCS config parsing logic, when the HostManager has already done the parsing, and has all necessary information available to it to pass on. The disadvantage is that handling of python and generic agents is not symmetric.

A capture_logs boolean option is added for agents. When this is enabled, the AgentProcessHelper writes data captured from stdout/stderr to a file alongside existing logs. This is not fancy, but at least ensures that log data from programs which write normally to stdout/stderr can be checked.

These changes are orthogonal to, and should not interfere with, the existing Docker agent management mechanism/encoding, but it is somewhat dissatisfying that python, docker, and generic code paths are not more parallel. The hope is that this avoids breaking any of the Docker handling, which I cannot easily test, but it is not very elegant.

Motivation and Context

I would like to manage agents which are not written in python due to their requirements for multi-threading and low-latency operation on high data volumes, e.g. the CMB-S4 DAQ collector prototype. It is already possible for such agents to communicate over WAMP to publish and subscribe to data feeds and have operations controlled with an OCSClient without changes to OCS, but it is also desirable to be able to manage their lifetimes (via ocsbow, and the HostManager, etc.) as well.

How Has This Been Tested?

All tests in the test suite pass, and I am successfully able to manage (bring up, keep running, shut down, and view logs) a C++ agent alongside standard python agents on an Ubuntu host.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Documentation of these features is still missing; I will be happy to write it if/when we converge on the implementation details. I am also not entirely sure that there are not gaps in my implementation; I have tested by running my own use-case (with both python and non-python agents), and the test suite, but I don't have a good feel for the completeness of the coverage that provides.

pre-commit-ci bot and others added 19 commits March 6, 2023 19:25
updates:
- [github.com/pre-commit/mirrors-autopep8: v2.0.1 → v2.0.2](pre-commit/mirrors-autopep8@v2.0.1...v2.0.2)
…config

[pre-commit.ci] pre-commit autoupdate
* Separate blocks for each registered operation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests: Remove check for exception in changed agent ops test

* Switch dict representation

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Brian Koopman <[email protected]>
…adata-dep

Add `importlib_metadata` dependency for ocs-agent-cli
Instead of showing up in top as "ocs-agent-cli", it shows up as
"ocs-agent:{instance_id}".
ocs-agent-cli renames itself in process list
* Update os+python version on readthedocs

* Add rtd config to ignore paths for workflows
* Add extra dependencies for development

* Qualify image names with repository

* Fix test failures from calling `os.path.join` on mock objects

* Fix mistakes in dev dependency list

* Revert "Qualify image names with repository"

This change was unsafe, as it would run tests over previously released
code, instead of possibly modified, local code.

This reverts commit f357ad3.

---------

Co-authored-by: C. Weaver <[email protected]>
* ocs-local-support: warn if crossbar/scf disagree on realm/address_root

* registry subscribes to {address_root}.* not observatory.*

Also changed OCSAgent to fail if any startup subscriptions fail.

* aggregator subscribes to {address_root}.* not observatory.*

* influxdb_publisher subscribes to {address_root}.* not observatory.*

* InfluxDBAgent: exit cleanly on ctrl-c

... even when looping on failed connections to influxdb.

* Tweak a few references to "observatory" in docstrings

* docs: clarify "observatory" as default address_root

* Remove all references to "registry_address"

This SCF param actually wasn't used anywhere, except optionally in the
ocs-agent-cli (which now simply defaults to {address_root}.registry
and can still be overridden).

The command-line argument remains, but is doc'ed as deprecated and has
no effect on anything.

* Fix registry tests (needs args.address_root)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ocs-local-support: better error message if crossbar not managed

... in response to requests for "generate_crossbar_config" and "start
crossbar".  Also update erroneous docs for the former.

* ocs-crossbar image can be configured more, with env

Now the crossbar config can be bind-mounted in, or else some envvars
are used to generate one from a template before crossbar is started.

* ocs_agent: more informative error message on realm mismatch

Also the SCF docs warn about updating the crossbar config.

* Update documentation related to crossbar configuration

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Brian Koopman <[email protected]>
…sses (simonsobs#333)

* Log exceptions caught within start()

* Create failing test for start task with None params

* Handle ParamHandler getting None params

* Simplify None params handling

* Fix log formatting syntax
…t-docker-compose

Replace pytest-docker-compose with pytest-docker plugin
updates:
- [github.com/pycqa/flake8: 6.0.0 → 6.1.0](PyCQA/flake8@6.0.0...6.1.0)
…config

[pre-commit.ci] pre-commit autoupdate
HostManager can crash and restart really quickly, so this
rate-limiting is necessary on some systems to prevent systemd from
giving up on the service.
@BrianJKoopman BrianJKoopman self-requested a review August 25, 2023 19:30
pre-commit-ci bot and others added 2 commits August 28, 2023 19:46
updates:
- [github.com/pre-commit/mirrors-autopep8: v2.0.2 → v2.0.4](pre-commit/mirrors-autopep8@v2.0.2...v2.0.4)
…config

[pre-commit.ci] pre-commit autoupdate
Copy link
Member

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, and sorry it took so long to review it.

My comments below are not intended as the final word -- happy to discuss with you and other OCS devs.

I am on board with adding this capability, but I would like to suggest a modification to the implementation. I would like for the agent-class to still be a mandatory setting in every agent block of the SCF. The alternate operation path would then be entirely triggered by whether agent-exe is set to non-None. (As a by-product, this does simplify a lot of arg-checking logic I think.)

The AgentClass plays an important role in ocs-web, for example to help instantiate the correct control panel for an Agent. (ocs-web uses the class reported by the Agent instance itself, through get_api, rather than whatever HostManager has read from the SCF ... but still, it is useful to think of agent instances as belonging to a particular named class.)

As a parallel with the docker case, where AgentClass is reported in tables with a [d] suffix, I think executable case should have an AgentClass reported, with the [exe] suffix. The path to the binary, which is currently jammed into the [agent] column (in the ocsbow status output for example), could instead be added as a separate column. (For the docker-based agents that column would contain the docker-compose file + service name I guess; beyond the scope of this work but just thinking ahead.)

See a few other comments inline. I am happy to do some of the work on this, as I am in the process of addressing a bunch of other HM issues and it would be great for this to all evolve in harmony.

Comment on lines 237 to 238
if self.log_file is not None:
self.log_file.flush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my brief tests, this flush happens too quickly -- before the INT is sent and processed. You should probably do a flush in processExited in addition to / instead of here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; moving the flush to processExited seems to work much better.

Comment on lines +233 to +237
The Agent executable. This
may be matched against the agent_exe name provided by
the Agent instance, as a way of finding the right
InstanceConfig.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature (matching against agent_exe) useful? I have been thinking that the class-matching feature (that it mimics) never really turned out to be that useful, and was considering deprecating it. I would not miss it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I had any particular use in mind; I was just trying to mimic the class-matching feature in case it was somehow important. On that basis, I don't think I would object to removing it, now or later.

Comment on lines 348 to 349
cmd = [instance['agent_exe'], '--instance-id', self.address_root + '.' + iid,
'--wamp-url', self.wamp_url, '--wamp-realm', self.wamp_realm]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand wanting to join address_root and iid, but can you call it --agent-address or --address? I don't like redefining --instance-id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I called it --instance-id solely out of an attempt to better fit in with the surrounding code; my agents have always taken --address and I later added this as an alias. So, if you prefer --address, so do I, and I'm happy to switch it. :)

@mhasself mhasself mentioned this pull request Sep 13, 2023
6 tasks
Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally took the time to test this out today, it worked great to startup an external executable. I don't have much to add to @mhasself's review. The one thing I will comment on is the logging.

I'd like to suggest logging stdout/stderr by default, with the option to disable it. We currently run into situations where logging isn't on (because log_dir isn't set) then an error occurs and we'd like to have logs to debug, but don't. Logging by default avoids that, and allows one to disable the feature if logging is handled some other way, perhaps in the non-python exe.

if "agent_arguments" in instance:
cmd.extend(instance["agent_arguments"])
if instance['write_logs']:
log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If log_dir isn't defined in the SCF this'll throw:

2023-09-13T16:50:49-0400 Unhandled Error
Traceback (most recent call last):
  File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 672, in main
    runner.run(agent, auto_reconnect=True)
  File "/home/koopman/.venv/user/lib/python3.11/site-packages/autobahn/twisted/wamp.py", line 439, in run
    reactor.run()
  File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/home/koopman/.venv/user/lib/python3.11/site-packages/twisted/internet/base.py", line 967, in runUntilCurrent
    f(*a, **kw)
  File "/home/koopman/git/ocs/ocs/agents/host_manager/agent.py", line 353, in _launch_instance
    log_file_path = self.log_dir + '/' + self.address_root + '.' + iid + ".log"
builtins.TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

If this isn't set, consistent behavior with OCS would be to just not log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting that; it was definitely a bug.

@cnweaver
Copy link
Contributor Author

Thanks for the review comments! I do have a few questions on how exactly to implement some of the bigger picture ones:

I would like for the agent-class to still be a mandatory setting in every agent block of the SCF.

I didn't do this in large part because it wasn't obvious to me what value one should use for the agent class of something which is not in fact a python class. It could be treated as just a label, but the instance ID already has that role? I suppose for display purposes the basename of the executable path could be used, as that's fairly analogous to a python class name, but it would still seem odd to me to make the user specify this, when doing so is both redundant with the executable path (which is necessary), and introduces a point of possible confusion if the user doesn't keep the two items properly synchronized when writing or editing the configuration.

The AgentClass plays an important role in ocs-web, for example to help instantiate the correct control panel for an Agent. (ocs-web uses the class reported by the Agent instance itself, through get_api, rather than whatever HostManager has read from the SCF ... but still, it is useful to think of agent instances as belonging to a particular named class.)

I've never used the web interface, so I missed this. However, this use-case still leaves unclear what the agent class should be for an executable. It's not clear to me what control panels there are, or where they come from; is there something sensible for ocs-web to select for an arbitrary agent? I guess I would have expected it, at least in the generic case, to use the reflection provided by WAMP to discover what commands, etc. each agent supports.

As a parallel with the docker case, where AgentClass is reported in tables with a [d] suffix, I think executable case should have an AgentClass reported, with the [exe] suffix. The path to the binary, which is currently jammed into the [agent] column (in the ocsbow status output for example), could instead be added as a separate column. (For the docker-based agents that column would contain the docker-compose file + service name I guess; beyond the scope of this work but just thinking ahead.)

This is certainly doable, but it would take up more space, and depending on what value was used for the agent class it might be no more informative? I do see the utility in being able to display more information about the docker case; if that is the direction in which to go, though, what label would make sense for the column which might contain an executable path, might contain docker info, or might be empty (for python agents, unless there's also something relevant to display for them)?

I'd like to suggest logging stdout/stderr by default, with the option to disable it. We currently run into situations where logging isn't on (because log_dir isn't set) then an error occurs and we'd like to have logs to debug, but don't. Logging by default avoids that, and allows one to disable the feature if logging is handled some other way, perhaps in the non-python exe.

I'm happy to make logging the default; in fact I'm not sure why I didn't to begin with. That being said, I'd like to better understand the situation around log_dir not being set: I've fixed my bug of crashing when it's unset (by just not logging, as suggested), but that just leads to exactly the situation that there are no logs if the user forgets to set it. That could be changed in general, e.g. by having a default log directory, but that seems like a more general change outside the scope of what I'm adding?

@BrianJKoopman
Copy link
Member

I'm happy to make logging the default; in fact I'm not sure why I didn't to begin with. That being said, I'd like to better understand the situation around log_dir not being set: I've fixed my bug of crashing when it's unset (by just not logging, as suggested), but that just leads to exactly the situation that there are no logs if the user forgets to set it. That could be changed in general, e.g. by having a default log directory, but that seems like a more general change outside the scope of what I'm adding?

I don't really have much to add, only that log_dir isn't a required setting in the SCF at the moment. In practice on SO, essentially all agents are running in Docker containers and log aggregation is happening via Loki.

I like the idea of a default log directory, but agree it might be out of scope here. I'd need to consider how we would handle log rotation. Some Agents' logs can be kind of verbose. In the Docker context, this'll make container's disk usage grow (and Docker is also storing the logs, so we're doubling this log output at that point.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants