You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This happens rarely, but increases likelihood when network traffic is increased (more tasks ongoing, hence more task updates over websocket)
Observations
dashboard becomes unusable and when refreshed gets a 404, as all the REST calls are pending as monitored on the network tab of browser Inspect
fleet adapter logs, broadcast client unable to connect to server URI, starts disconnecting and connecting continuously
server logs, show spurious connections from internal websocket (from fleet adapter broadcast client), without proper disconnections, hence the count of internal websockets keep going up
server logs, start seeing token expiries
server performance, fastapi seems to be the one stalling, consistent with the pending REST calls from the dashboard
Current solution
restarting the api-server resets everything and connections become healthy again
Steps to reproduce the bug
I personally have not been able to reproduce it, but according to steps from @koonpeng
Change packages/api-server/api_server/default_config.pyhost to 192.168.25.1
On term1, run sudo ip addr add 192.168.25.1/24 dev lo
On term2, cd to packages/dashboard and start the api-server pnpm run start:rmf-server
On term3, start rmf demos with limited cpu ros2 launch rmf_demos_gz office.launch.xml headless:=true server_uri:=ws://192.168.25.1:8000/_internal
On term1, send a patrol task ros2 launch rmf_demos office_patrol.launch.xml, wait a few secs
Then remove the ip, simulating network down sudo ip addr del 192.168.25.1/24 dev lo, wait a few secs
Add the ip back, simulating network recovered sudo ip addr add 192.168.25.1/24 dev lo
After the last step, start got spammed with a lot of
[fleet_adapter-15] [ERROR] [1707276604.567910352] [tinyRobot_fleet_adapter]: BroadcastClient unable to publish message: invalid statewhich finally followed by[fleet_adapter-15] [WARN] [1707276605.301806431] [tinyRobot_fleet_adapter]: BroadcastClient unable to connect to [ws://192.168.25.1:8000/_internal]. Please make sure server is running. Error msg: invalid state
[fleet_adapter-15] [INFO] [1707276605.304396331] [tinyRobot_fleet_adapter]: BroadcastClient successfully connected to uri: [ws://192.168.25.1:8000/_internal]sometimes it stops after one re-connect, but sometimes it get stucked in a loop like we see in prod.
I think we can say that the broadcast client cannot recover from a disconnect, but the question still remains what caused the initial disconnect and the token expiry.
End quote
Expected behavior
server continues to serve REST requests without stalling
internal websocket connections remain the expected number (1 or 2 depending on whether server_uri was provided to the task_dispatcher
Actual behavior
unknown way to reproduce at the moment (currently investigating bad network as a cause)
spurious connections from BroadcastClient on the internal websocket route without proper disconnections, causing the websocket count to increase
fastapi stalls, all REST calls from the dashboard display pending (from network tab in browser Inspect)
dashboard becomes unusable
Additional information or screenshots
No response
The text was updated successfully, but these errors were encountered:
Before proceeding, is there an existing issue or discussion for this?
OS and version
Ubuntu 22.04
Open-RMF installation type
Source build
Other Open-RMF installation methods
No response
Open-RMF version or commit hash
main, deploy/hammer
ROS distribution
Humble
ROS installation type
Docker
Other ROS installation methods
No response
Package or library, if applicable
No response
Description of the bug
This happens rarely, but increases likelihood when network traffic is increased (more tasks ongoing, hence more task updates over websocket)
Observations
pending
as monitored on the network tab of browserInspect
pending
REST calls from the dashboardCurrent solution
api-server
resets everything and connections become healthy againSteps to reproduce the bug
I personally have not been able to reproduce it, but according to steps from @koonpeng
Begin quote from @koonpeng
packages/api-server/api_server/default_config.py
host
to192.168.25.1
sudo ip addr add 192.168.25.1/24 dev lo
packages/dashboard
and start the api-serverpnpm run start:rmf-server
rmf demos
with limited cpuros2 launch rmf_demos_gz office.launch.xml headless:=true server_uri:=ws://192.168.25.1:8000/_internal
ros2 launch rmf_demos office_patrol.launch.xml
, wait a few secssudo ip addr del 192.168.25.1/24 dev lo
, wait a few secssudo ip addr add 192.168.25.1/24 dev lo
After the last step, start got spammed with a lot of
I think we can say that the broadcast client cannot recover from a disconnect, but the question still remains what caused the initial disconnect and the token expiry.
End quote
Expected behavior
server_uri
was provided to thetask_dispatcher
Actual behavior
BroadcastClient
on the internal websocket route without proper disconnections, causing the websocket count to increasepending
(from network tab in browserInspect
)Additional information or screenshots
No response
The text was updated successfully, but these errors were encountered: