Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: server fastapi stalls in production environment #897

Closed
1 task done
aaronchongth opened this issue Feb 13, 2024 · 1 comment
Closed
1 task done

[Bug]: server fastapi stalls in production environment #897

aaronchongth opened this issue Feb 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@aaronchongth
Copy link
Member

Before proceeding, is there an existing issue or discussion for this?

OS and version

Ubuntu 22.04

Open-RMF installation type

Source build

Other Open-RMF installation methods

No response

Open-RMF version or commit hash

main, deploy/hammer

ROS distribution

Humble

ROS installation type

Docker

Other ROS installation methods

No response

Package or library, if applicable

No response

Description of the bug

This happens rarely, but increases likelihood when network traffic is increased (more tasks ongoing, hence more task updates over websocket)

Observations

  • dashboard becomes unusable and when refreshed gets a 404, as all the REST calls are pending as monitored on the network tab of browser Inspect
  • fleet adapter logs, broadcast client unable to connect to server URI, starts disconnecting and connecting continuously
  • server logs, show spurious connections from internal websocket (from fleet adapter broadcast client), without proper disconnections, hence the count of internal websockets keep going up
  • server logs, start seeing token expiries
  • server performance, fastapi seems to be the one stalling, consistent with the pending REST calls from the dashboard

Current solution

  • restarting the api-server resets everything and connections become healthy again

Steps to reproduce the bug

I personally have not been able to reproduce it, but according to steps from @koonpeng

Begin quote from @koonpeng

  1. Change packages/api-server/api_server/default_config.py host to 192.168.25.1
  2. On term1, run sudo ip addr add 192.168.25.1/24 dev lo
  3. On term2, cd to packages/dashboard and start the api-server pnpm run start:rmf-server
  4. On term3, start rmf demos with limited cpu ros2 launch rmf_demos_gz office.launch.xml headless:=true server_uri:=ws://192.168.25.1:8000/_internal
  5. On term1, send a patrol task ros2 launch rmf_demos office_patrol.launch.xml, wait a few secs
  6. Then remove the ip, simulating network down sudo ip addr del 192.168.25.1/24 dev lo, wait a few secs
  7. Add the ip back, simulating network recovered sudo ip addr add 192.168.25.1/24 dev lo

After the last step, start got spammed with a lot of

[fleet_adapter-15] [ERROR] [1707276604.567910352] [tinyRobot_fleet_adapter]: BroadcastClient unable to publish message: invalid statewhich finally followed by[fleet_adapter-15] [WARN] [1707276605.301806431] [tinyRobot_fleet_adapter]: BroadcastClient unable to connect to [ws://192.168.25.1:8000/_internal]. Please make sure server is running. Error msg: invalid state
[fleet_adapter-15] [INFO] [1707276605.304396331] [tinyRobot_fleet_adapter]: BroadcastClient successfully connected to uri: [ws://192.168.25.1:8000/_internal]sometimes it stops after one re-connect, but sometimes it get stucked in a loop like we see in prod.

I think we can say that the broadcast client cannot recover from a disconnect, but the question still remains what caused the initial disconnect and the token expiry.

End quote

Expected behavior

  • server continues to serve REST requests without stalling
  • internal websocket connections remain the expected number (1 or 2 depending on whether server_uri was provided to the task_dispatcher

Actual behavior

  • unknown way to reproduce at the moment (currently investigating bad network as a cause)
  • spurious connections from BroadcastClient on the internal websocket route without proper disconnections, causing the websocket count to increase
  • fastapi stalls, all REST calls from the dashboard display pending (from network tab in browser Inspect)
  • dashboard becomes unusable

Additional information or screenshots

No response

@aaronchongth
Copy link
Member Author

Preliminary observation was that this is due to appending events/logs to the task phases, when a task alert is awknowledged

Removal of this feature seem to make the server much more stable

Will keep observing the performance before closing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant