[Bug]: server fastapi stalls in production environment #897

aaronchongth · 2024-02-13T02:57:39Z

Before proceeding, is there an existing issue or discussion for this?

I have done a search for similar issues and discussions.

OS and version

Ubuntu 22.04

Open-RMF installation type

Source build

Other Open-RMF installation methods

No response

Open-RMF version or commit hash

main, deploy/hammer

ROS distribution

Humble

ROS installation type

Docker

Other ROS installation methods

No response

Package or library, if applicable

No response

Description of the bug

This happens rarely, but increases likelihood when network traffic is increased (more tasks ongoing, hence more task updates over websocket)

Observations

dashboard becomes unusable and when refreshed gets a 404, as all the REST calls are pending as monitored on the network tab of browser Inspect
fleet adapter logs, broadcast client unable to connect to server URI, starts disconnecting and connecting continuously
server logs, show spurious connections from internal websocket (from fleet adapter broadcast client), without proper disconnections, hence the count of internal websockets keep going up
server logs, start seeing token expiries
server performance, fastapi seems to be the one stalling, consistent with the pending REST calls from the dashboard

Current solution

restarting the api-server resets everything and connections become healthy again

Steps to reproduce the bug

I personally have not been able to reproduce it, but according to steps from @koonpeng

Begin quote from @koonpeng

Change packages/api-server/api_server/default_config.py host to 192.168.25.1
On term1, run sudo ip addr add 192.168.25.1/24 dev lo
On term2, cd to packages/dashboard and start the api-server pnpm run start:rmf-server
On term3, start rmf demos with limited cpu ros2 launch rmf_demos_gz office.launch.xml headless:=true server_uri:=ws://192.168.25.1:8000/_internal
On term1, send a patrol task ros2 launch rmf_demos office_patrol.launch.xml, wait a few secs
Then remove the ip, simulating network down sudo ip addr del 192.168.25.1/24 dev lo, wait a few secs
Add the ip back, simulating network recovered sudo ip addr add 192.168.25.1/24 dev lo

After the last step, start got spammed with a lot of

[fleet_adapter-15] [ERROR] [1707276604.567910352] [tinyRobot_fleet_adapter]: BroadcastClient unable to publish message: invalid statewhich finally followed by[fleet_adapter-15] [WARN] [1707276605.301806431] [tinyRobot_fleet_adapter]: BroadcastClient unable to connect to [ws://192.168.25.1:8000/_internal]. Please make sure server is running. Error msg: invalid state
[fleet_adapter-15] [INFO] [1707276605.304396331] [tinyRobot_fleet_adapter]: BroadcastClient successfully connected to uri: [ws://192.168.25.1:8000/_internal]sometimes it stops after one re-connect, but sometimes it get stucked in a loop like we see in prod.

I think we can say that the broadcast client cannot recover from a disconnect, but the question still remains what caused the initial disconnect and the token expiry.

End quote

Expected behavior

server continues to serve REST requests without stalling
internal websocket connections remain the expected number (1 or 2 depending on whether server_uri was provided to the task_dispatcher

Actual behavior

unknown way to reproduce at the moment (currently investigating bad network as a cause)
spurious connections from BroadcastClient on the internal websocket route without proper disconnections, causing the websocket count to increase
fastapi stalls, all REST calls from the dashboard display pending (from network tab in browser Inspect)
dashboard becomes unusable

Additional information or screenshots

No response

The text was updated successfully, but these errors were encountered:

aaronchongth · 2024-03-04T13:39:29Z

Preliminary observation was that this is due to appending events/logs to the task phases, when a task alert is awknowledged

Removal of this feature seem to make the server much more stable

Will keep observing the performance before closing this

aaronchongth added the bug Something isn't working label Feb 13, 2024

aaronchongth mentioned this issue Feb 13, 2024

Shuts down server if internal websocket connections exceed max allowed defined in config #899

Merged

5 tasks

aaronchongth closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: server fastapi stalls in production environment #897

[Bug]: server fastapi stalls in production environment #897

aaronchongth commented Feb 13, 2024

aaronchongth commented Mar 4, 2024

[Bug]: server fastapi stalls in production environment #897

[Bug]: server fastapi stalls in production environment #897

Comments

aaronchongth commented Feb 13, 2024

Before proceeding, is there an existing issue or discussion for this?

OS and version

Open-RMF installation type

Other Open-RMF installation methods

Open-RMF version or commit hash

ROS distribution

ROS installation type

Other ROS installation methods

Package or library, if applicable

Description of the bug

Steps to reproduce the bug

Expected behavior

Actual behavior

Additional information or screenshots

aaronchongth commented Mar 4, 2024