Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Both ROS1 and ROS2] Bond Async Bringup Exception #66

Open
SteveMacenski opened this issue Jul 28, 2020 · 2 comments
Open

[Both ROS1 and ROS2] Bond Async Bringup Exception #66

SteveMacenski opened this issue Jul 28, 2020 · 2 comments

Comments

@SteveMacenski
Copy link
Member

Hi,

In my efforts to make navigation have some heartbeats I found a particular issue that plagues both ROS1 and ROS2 that would only show up when:

  • Have long running bringup processes ( greater than heartbeat timeout rate, 4 s by default)
  • Have remote servers that take some time to connect / discover (for ROS2) that can take more than the heartbeat timeout rate

The default state on bring up is WaitingForSister, which does not have a heartbeat timeout override (https://github.com/ros/bond_core/blob/kinetic-devel/bondcpp/src/BondSM_sm.cpp#L63-L147). Because of this, when the heartbeat timeout is triggered, it will call Default() and throw an exception (and probably crash). This is even though the Connection timeout hasn't yet been triggered because by default its set to 10 seconds, larger than the heartbeat timeout. That means in the case that it takes greater than the heartbeat timeout duration to connect, your liable to crash even though there's valid time left to connect via that timeout.

The 2 solutions without modifying bond are:

  • Increase your heartbeat timeout to always be larger than your connection timeout, which is a terrible idea. If you want to know that your servers are active but takes a while for everything to connect, you don't want to have to set your heart beat timeout any longer than necessary.
  • Guarantee that your server can always get up and running in less than heartbeat timeout period. Which given ROS2 and DDS discovery issues and use of remote servers, isn't necessary reliable to 100% of the time work when valid.

I think we need to fix bond in ROS1/ROS2 to deal with this bring up issue better. I think the waiting for sister state needs a valid heartbeat timeout override that won't throw an exception and won't do anything until after connection timeout or the first active status message is received.

@mjcarroll
Copy link
Member

It seems to me that there shouldn't be a heartbeat during the bring up because of the reason that you stated. It was introducing flakiness in the test_bond tests.

For the ROS 2 implementation, I moved the beginning of the heartbeat for once the connection is established, which I believe was the correct intended behavior in #87.

Do you think this is a valid approach? That changeset is still unreleased, so we can modify again before releasing into Humble/Rolling.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jun 14, 2022

That sounds reasonable to me, though I'm pretty sure that Bond is already in Humble/Rolling since Nav2 has binaries out and depends on this. So there's already been a release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants