Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user configurable agent timeout for crossbar disconnection #337

Merged
merged 12 commits into from
Sep 26, 2023

Conversation

BrianJKoopman
Copy link
Member

Description

This PR introduces a per Agent configurable agent time. This allows the user to set how long after the Agent loses connection to the crossbar server the Agent waits before cleaning up and shutting down.

A summary of changes:

  • Added crossbar-timeout argument globally to the site arguments and to ocs-agent-cli.
  • Implemented the same exponential backoff policy as twisted uses to reconnect to the crossbar server so wait times roughly line up (up to some random jitter.)
  • Implemented signal handling so the Agent can be interrupted when crossbar-timeout is 0.
  • Fixed some docstring formatting in the argparse "help" strings.

Usage

This timeout is available as a commandline argument --crossbar-timeout, which can also be set in the SCF in an individual instance's 'arguments' list, for example:

      {'agent-class': 'FakeDataAgent',
       'instance-id': 'fake-data1',
       'arguments': [['--crossbar-timeout', 0],
                     ['--mode', 'acq'],
                     ['--num-channels', '2'],
                     ['--sample-rate', '1']]},

And example on the command line:

ocs-agent-cli --instance-id fake-data1 --crossbar-timeout=40

Or in a docker-compose file:

  fake-data1:
      image: ocs:test
      hostname: ocs-docker
      environment:
        - LOGLEVEL=info
        - INSTANCE_ID=fake-data1
        - CROSSBAR_TIMEOUT=21
      volumes:
        - $OCS_CONFIG_DIR:/config:ro

A crossbar-timeout of 0 is special and will disable the timeout, allowing the Agent to run forever through a crossbar outage.

Logging

Upon disconnecting the Agent logs will display (note there are some debug statements active here indicating the top of the acquisition loop in the fake data agent):

2023-06-05T19:43:54+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 6165510818002603 before')
2023-06-05T19:43:54+0000 transport disconnected
2023-06-05T19:43:54+0000 waiting for reconnection
2023-06-05T19:43:54+0000 waiting at least 20.999996662139893 more seconds before giving up
2023-06-05T19:43:54+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fe4e2a3b550> in 2.371031904796556 seconds.
2023-06-05T19:43:54+0000 Top of the acq while loop
2023-06-05T19:43:55+0000 Top of the acq while loop
2023-06-05T19:43:55+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:43:56+0000 Top of the acq while loop
2023-06-05T19:43:56+0000 Scheduling retry 2 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fe4e2a3b550> in 2.9836467052303157 seconds.
2023-06-05T19:43:56+0000 waiting at least 18.56661033630371 more seconds before giving up
2023-06-05T19:43:57+0000 Top of the acq while loop
2023-06-05T19:43:58+0000 Top of the acq while loop
2023-06-05T19:43:58+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:43:59+0000 Top of the acq while loop
2023-06-05T19:43:59+0000 Scheduling retry 3 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fe4e2a3b550> in 4.081454896115918 seconds.
2023-06-05T19:43:59+0000 waiting at least 15.355729341506958 more seconds before giving up
2023-06-05T19:44:00+0000 Top of the acq while loop
2023-06-05T19:44:01+0000 Top of the acq while loop
2023-06-05T19:44:01+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:02+0000 Top of the acq while loop
2023-06-05T19:44:03+0000 Top of the acq while loop
2023-06-05T19:44:03+0000 waiting at least 11.623907089233398 more seconds before giving up
2023-06-05T19:44:03+0000 Scheduling retry 4 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fe4e2a3b550> in 5.853152524023113 seconds.
2023-06-05T19:44:04+0000 Top of the acq while loop
2023-06-05T19:44:04+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:05+0000 Top of the acq while loop
2023-06-05T19:44:06+0000 Top of the acq while loop
2023-06-05T19:44:06+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:07+0000 Top of the acq while loop
2023-06-05T19:44:08+0000 Top of the acq while loop
2023-06-05T19:44:09+0000 waiting at least 5.961486339569092 more seconds before giving up
2023-06-05T19:44:09+0000 Top of the acq while loop
2023-06-05T19:44:09+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:09+0000 Scheduling retry 5 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fe4e2a3b550> in 8.363641698118796 seconds.
2023-06-05T19:44:10+0000 Top of the acq while loop
2023-06-05T19:44:11+0000 Top of the acq while loop
2023-06-05T19:44:11+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:12+0000 Top of the acq while loop
2023-06-05T19:44:13+0000 Top of the acq while loop
2023-06-05T19:44:14+0000 Top of the acq while loop
2023-06-05T19:44:14+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:15+0000 Top of the acq while loop
2023-06-05T19:44:16+0000 Top of the acq while loop
2023-06-05T19:44:16+0000 Could not publish to Feed. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:17+0000 Top of the acq while loop
2023-06-05T19:44:17+0000 Stopping all running sessions
2023-06-05T19:44:17+0000 Stopping session acq
2023-06-05T19:44:17+0000 stop called for acq
2023-06-05T19:44:17+0000 Stopping session count
2023-06-05T19:44:17+0000 stop called for count
2023-06-05T19:44:17+0000 Unable to publish status. TransportLost. crossbar server likely unreachable.
2023-06-05T19:44:17+0000 count:1 Status is now "stopping".
2023-06-05T19:44:17+0000 Stopper for "count" terminated with ok=True and message (None,)
2023-06-05T19:44:17+0000 stopping reactor
2023-06-05T19:44:18+0000 Top of the acq while loop
2023-06-05T19:44:18+0000 Main loop terminated.

Questions/Self Comments

I'm not sure the best place to document this. It's in the argparse help for the site config arguments and in ocs-agent-cli, maybe that's sufficient? Feedback welcome here.

I've kept the default to 10 seconds, since that's what we've had. Do want to keep this or change it?

Motivation and Context

Resolves #331.

Ever since #180 this timeout has been 10 seconds. But we need the ability to have Agents continue to run indefinitely while the crossbar server might be down, this allows agents like the HWP agents in socs to continue to command the hardware, even if the Agent can't communicate with the rest of the network.

How Has This Been Tested?

I've tested this locally with a collection of fake data agents.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

* Delay reconnection check attempts in same manner as twisted reconnection

* Define signal handlers on disconnect. This allows interrupting agents that
  are running with a crossbar-timeout of 0.
@BrianJKoopman BrianJKoopman added the enhancement New feature or request label Jun 5, 2023
@BrianJKoopman
Copy link
Member Author

I can't reproduce this failing test locally. Will need to investigate a bit, but might not get to it immediately.

Copy link
Member

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked well for me, in basic testing (incl with Hostmanager)... couple of discussion points.

ocs/ocs_agent.py Outdated Show resolved Hide resolved
ocs/ocs_agent.py Outdated Show resolved Hide resolved
ocs/site_config.py Outdated Show resolved Hide resolved
ocs/ocs_agent.py Outdated Show resolved Hide resolved
@BrianJKoopman
Copy link
Member Author

I think I addressed everything. Also merged in the latest main, I want to see if that failing test persists, then need to work on that if it does before asking for re-review.

Now that we're properly waiting for Agent shutdown it takes a bit longer. These
times were sufficient when running tests locally.
@mhasself mhasself mentioned this pull request Sep 20, 2023
6 tasks
Copy link
Member

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Minor inline comment.

ocs/agent_cli.py Outdated Show resolved Hide resolved
@BrianJKoopman BrianJKoopman merged commit 6117ef8 into main Sep 26, 2023
5 checks passed
@BrianJKoopman BrianJKoopman deleted the koopman/disable-crossbar-timeout branch September 26, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configured agents to continue running without crossbar connection
2 participants