Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DroneCAN: fixed deadlock and saturation of CAN bus #28218

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tridge
Copy link
Contributor

@tridge tridge commented Sep 24, 2024

This fixes a set of issues found on investigating a watchdog from @lthall

PR depends on: dronecan/libcanard#72

The situation was:

  • a H7 with mavlink over serial over DroneCAN setup, MissionPlanner connected to the serial port
  • SERIAL1_PROTOCOL set to GPS, with GPS1_TYPE=AUTO, but no GPS plugged in

There were several interacting bugs:

  • the canard layer could run out of buffer in its allocator part way through a message send, resulting in sending just some of the frames onto the bus, which results in a corrupt message, wasted bandwidth and broadcast() returning false
  • MissionPlanner sends CAN_FORWARD requests on all connections, regardless of type, which on a CAN serial port means it is requesting to send it's own packets over CAN, resulting in an "infinite" number of packets and complete bus saturation
  • MissionPlanner issue here: sending CAN_FORWARD requests when not in CAN UI (CRITICAL BUG) MissionPlanner#3417
  • once the bus was saturated, serial.update() in the DroneCAN thread would not make any progress, which meant it would keep trying to send the same bytes forever, but no full messages could ever be sent, so from the users point of view the CAN bus is dead (the UI shows no messages, as all messages are corrupt)
  • the lack of a sleep in the DroneCAN thread meant that all threads of lower priority would stop running, including UART threads, the main thread keeps running
  • the main thread then tries a begin() in AP_GPS for a new baudrate, the UART thread is still stuck in TX from the last set of GPS probe bytes and can't make progress due to the higher priority DroneCAN thread running non-stop
  • the main thread then gets stuck waiting on the _in_tx_timer boolean lock, causing the main thread to lockup
  • this triggers a watchdog

Copy link
Contributor

@tpwrules tpwrules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By inspection the new locks seem okay and I don't think they are held the same time as the existing ones. I assume you are okay with the performance of grabbing them at 1kHz (I don't imagine they have high overhead?) We do need to keep the two locks to avoid making UART half duplex.

The timeout is strictly better than what we had before so that's good. Note that now begin/end can block the tx/rx thread but again that is probably better than before.

I wonder if there's a better way for UART drivers so we can avoid having to have locks to resize buffers and protect multiple readers and writers. It's kind of a strange and complex feature to be able to support.

this replaces the two booleans used to mediate TX and RX buffer
protection with mutexes.

The booleans were a hangover from the very early HAL_ChibiOS code, and
can lead to a deadlock. The sequence is as follows:

 - a very high CAN bus bandwidth usage, triggered by MissionPlanner
   requesting CAN_FORWARD on a CAN serial port. That causes a
   "infinite" number of CAN_FRAME messages which saturates the bus,
   and leads to the DroneCAN thread looping with no pause

 - a serial port configured as GPS type AUTO, auto-probing for a GPS
   that isn't there. This calls begin() periodically

 - the UART TX thread assocated with that UART not making progress as
   the TX thread priority is below the DroneCAN thread priority

 - this causes the begin() in main thread waiting for _in_tx_timer to
   loop forever, which triggers a watchdog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants