Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostManager network connection error #348

Open
BrianJKoopman opened this issue Aug 30, 2023 · 3 comments
Open

HostManager network connection error #348

BrianJKoopman opened this issue Aug 30, 2023 · 3 comments
Labels
agent: hostmanager bug Something isn't working

Comments

@BrianJKoopman
Copy link
Member

This error occurred in a Host Manager running the lakeshore240 agents on the LAT at site during a network outage, causing the manager process to crash. The Host Manager stays online, and was recovered by restarting the manager process, at which point all managed agents were in the "down" state and had "down" for their target as well.

2023-08-30T20-06-19.033016 startup-op: launching manager
2023-08-30T20-06-19.033420 start called for manager
2023-08-30T20-06-19.034188 manager:0 Status is now "starting".
2023-08-30T20-06-19.035269 manager:0 Status is now "running".
2023-08-30T20-07-19.987573 manager:0 CRASH: [Failure instance: Traceback: <class 'RuntimeError'>: Could not run docker-compose or could not parse docker-compose file; exit code 1, error text: b'An HTTP request took too long to complete. Retry with --verbose to obtain debug information.\nIf you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).\n'
/usr/lib/python3/dist-packages/twisted/internet/defer.py:501:errback
/usr/lib/python3/dist-packages/twisted/internet/defer.py:568:_startRunCallbacks
/usr/lib/python3/dist-packages/twisted/internet/defer.py:654:_runCallbacks
/usr/lib/python3/dist-packages/twisted/internet/defer.py:1475:gotResult
--- <exception caught here> ---
/usr/lib/python3/dist-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/lib/python3/dist-packages/twisted/python/failure.py:491:throwExceptionIntoGenerator
/home/ocs/.local/lib/python3.8/site-packages/ocs/agents/host_manager/agent.py:453:manager
/usr/lib/python3/dist-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/lib/python3/dist-packages/twisted/python/failure.py:491:throwExceptionIntoGenerator
/home/ocs/.local/lib/python3.8/site-packages/ocs/agents/host_manager/agent.py:134:_reload_config
/usr/lib/python3/dist-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/lib/python3/dist-packages/twisted/python/failure.py:491:throwExceptionIntoGenerator
/home/ocs/.local/lib/python3.8/site-packages/ocs/agents/host_manager/agent.py:100:_update_docker_services
/usr/lib/python3/dist-packages/twisted/internet/defer.py:1418:_inlineCallbacks
/home/ocs/.local/lib/python3.8/site-packages/ocs/agents/host_manager/drivers.py:369:parse_docker_state
]
2023-08-30T20-07-19.988809 manager:0 Status is now "done".
@BrianJKoopman BrianJKoopman added the bug Something isn't working label Aug 30, 2023
@mhasself
Copy link
Member

I guess the desired behavior is to not crash ... but to go into some uncertain state and post (to session.data e.g.) the fact that it can't run docker-compose successfully?

@BrianJKoopman
Copy link
Member Author

It looks like prior to the above error, the "boot looping" behavior I was seeing was a ping timeout to crossbar. This message gets repeated over and over again until the agent is stopped and restart:

2023-08-31T14-38-52.569042 transport connected
2023-08-31T14-38-52.573440 session joined:
SessionDetails(realm="test_realm",
               session=1695252694886510,
               authid="RSAV-4VT7-FTWL-3XTY-4SCX-7R7H",
               authrole="iocs_agent",
               authmethod="anonymous",
               authprovider="static",
               authextra={'x_cb_node': 'a97a4583e920-7', 'x_cb_worker': 'worker001', 'x_cb_peer': 'tcp4:192.168.20.12:46984', 'x_cb_pid': 12},
               serializer="cbor.batched",
               transport="websocket",
               resumed=None,
               resumable=None,
               resume_token=None)
2023-08-31T14-38-53.025218 realm rejoined
2023-08-31T14-39-07.041986 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 1695252694886510 before')
2023-08-31T14-39-07.043018 transport disconnected
2023-08-31T14-39-07.043468 waiting for reconnect for 9.99999451637268 more seconds
2023-08-31T14-39-07.044755 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fdb68aeb520> in 1.5993566024648573 seconds.
2023-08-31T14-39-08.069148 waiting for reconnect for 8.974653959274292 more seconds
2023-08-31T14-39-08.656320 transport connected
2023-08-31T14-39-09.071558 waiting for reconnect for 7.972212314605713 more seconds
2023-08-31T14-39-10.073429 waiting for reconnect for 6.970094919204712 more seconds
2023-08-31T14-39-10.332740 session joined:
SessionDetails(realm="test_realm",
               session=6971677156440914,
               authid="UNGY-6W3C-T7NU-7AHC-GN9X-KP5U",
               authrole="iocs_agent",
               authmethod="anonymous",
               authprovider="static",
               authextra={'x_cb_node': 'a97a4583e920-7', 'x_cb_worker': 'worker001', 'x_cb_peer': 'tcp4:192.168.20.12:50234', 'x_cb_pid': 12},
               serializer="cbor.batched",
               transport="websocket",
               resumed=None,
               resumable=None,
               resume_token=None)
2023-08-31T14-39-11.074450 realm rejoined
2023-08-31T14-39-33.001933 dropping connection to peer tcp4:192.168.0.230:8006 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
2023-08-31T14-39-33.003882 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 6971677156440914 before')
2023-08-31T14-39-33.004815 transport disconnected
2023-08-31T14-39-33.005309 waiting for reconnect for 9.999994277954102 more seconds
2023-08-31T14-39-33.006199 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7fdb68aeb520> in 1.5993534504293518 seconds.
2023-08-31T14-39-34.006612 waiting for reconnect for 8.9988272190094 more seconds

@BrianJKoopman
Copy link
Member Author

I guess the desired behavior is to not crash ... but to go into some uncertain state and post (to session.data e.g.) the fact that it can't run docker-compose successfully?

If it could automatically recover once network is restored, I think that'd be preferred, since at the moment we're relying on the host manager to restart the other agents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent: hostmanager bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants