Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostManager overhaul #353

Merged
merged 14 commits into from
Sep 20, 2023
Merged

HostManager overhaul #353

merged 14 commits into from
Sep 20, 2023

Conversation

mhasself
Copy link
Member

@mhasself mhasself commented Sep 13, 2023

Description

Updates to HostManager to address a number of recent concerns.

Biggest changes:

  • New options for instance "manage" setting, such that instances are available for HostManager control, but won't be brought up by default.
  • The "manager" process will not crash due to docker-compose.yaml or SCF syntax errors.
  • The manager process handles updates to compose files and SCF more gracefully. If the SCF+compose files are eventually brought into mutual consistency, and "update(reload_config=True)" is run, then the manager process should figure it all out and present the desired view of the system.

One interface change is the meaning of "manage: yes", and "manage: docker". Previously those would include the instance for HostManager management, but with a default target state of "down". I think it makes more sense for the default target state to be "up". This only matters if people are running HostManager, and if they are running HostManager they probably want to take advantage of the auto-startup. (Most installations probably have --initial-state=up added to the HM args -- that should be discouraged, now, as it would defeat the use of "host/down" and "docker/down" options.)

Motivation and Context

This is intended to:

This does not include the features proposed in #320. I can rebase on that work if it converges soon, or integrate those functions directly in this branch if that is easier for everyone.

How Has This Been Tested?

Manually.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

The agent-class column will show if there are discrepancies between
what a hostmanager reports and the local scf that ocsbow sees.  Also
tolerates broader range of mismatch.
This just factors out each docker inspect call so failures can be
trapped more easily.  Behavior should be altered.
Instead, the 'prot' elements for docker-managed things are long-lived
and will be immediately updated when we learn new service info.
The logic handling the agent list and reconciling it against docker
inspection results is rewritten (and probably improved).
Docstring in site_config explains them all.
Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for all the updates/bug fixes!

I tested several of the bug fixes, particularly those I reported. I, however, didn't figure out how to reproduce #348, maybe we'll just see how it performs on site.

I found one documentation/behavior inconsistency, noted below.

I still find the need for two separate host managers, one for Docker agents and one for Host agents, a bit awkward. More so now that manage options include "docker/up", "docker/down", "host/up", and "host/down", since one could potentially misconfigure a system by saying "host/up" on a Docker host manager, and vice versa.

I did test mixing these states a bit, and it puts things into a strange enough state (InfluxDBAgent[d?] ?/up, for instance) that one would notice and fix the issue. So I'd say this is alright, but maybe take the opportunity to bring up the idea of merging this so that a single host manager is needed to control both host and docker agents. Not in this PR of course, but down the road.

ocs/agents/host_manager/agent.py Outdated Show resolved Hide resolved
@mhasself
Copy link
Member Author

I tested several of the bug fixes, particularly those I reported. I, however, didn't figure out how to reproduce #348, maybe we'll just see how it performs on site.

Ah, let me take another look at this ... not sure I tested network / crossbar dropouts carefully.

I still find the need for two separate host managers, one for Docker agents and one for Host agents, a bit awkward. More so now that manage options include "docker/up", "docker/down", "host/up", and "host/down", since one could potentially misconfigure a system by saying "host/up" on a Docker host manager, and vice versa.

One doesn't need two separate HMs, right? It's just a bit more config constraint / work to get them all under single management?

For host-based and docker-based agents to run with a single HM, you need the instance descriptions to all be in a single host block (call it host1). Supposing all the defaults are set up to make life easy for host-based agents, the docker ones need:

  • to be told to look for their instance config in the host1 block (rather than, say host1-docker). This is achieved by simply setting the service hostnames to host1, or by passing --site-host=host1 as part of "command".
  • a way to get the right crossbar address. This is done with --site-hub/--site-http args, or the SITE_HUB/HTTP envvars. (Though I think, if network mode is host, you might not need to do that?)

This might be a touch easier if we'd also added a SITE_HOST envvar ... then you could use yaml anchors and state all the needed overrides (as environment) once at the top of the docker-compose.yaml.

@mhasself
Copy link
Member Author

See #354, which should make host+docker-based on single HM quite manageable.

I tested several of the bug fixes, particularly those I reported. I, however, didn't figure out how to reproduce #348, maybe we'll just see how it performs on site.

Ah, let me take another look at this ... not sure I tested network / crossbar dropouts carefully.

Ok, so HostManager crashes, just like everybody else, when crossbar/network goes down. I propose we revisit that after #337.

@BrianJKoopman
Copy link
Member

See #354, which should make host+docker-based on single HM quite manageable.

Great, I was just writing a reply to your first response -- I hadn't tried to run with just a single HM. The mode we're in now at site is 2 HM's, one host, one docker (though after moving the ACU over, we don't really need the host one at the moment.)

I fully support moving to a single HM though, so this is great.

I tested several of the bug fixes, particularly those I reported. I, however, didn't figure out how to reproduce #348, maybe we'll just see how it performs on site.

Ah, let me take another look at this ... not sure I tested network / crossbar dropouts carefully.

Ok, so HostManager crashes, just like everybody else, when crossbar/network goes down. I propose we revisit that after #337.

Alright, I'm going to remove #348 from the resolved list for now then.

@BrianJKoopman BrianJKoopman merged commit d75c972 into main Sep 20, 2023
5 checks passed
@BrianJKoopman BrianJKoopman deleted the mhasse/hostman-overhaul branch September 20, 2023 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants