Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to re-establish a connection varies between distributed machines after network is back up #745

Open
robert-preissl opened this issue Feb 27, 2024 · 6 comments

Comments

@robert-preissl
Copy link

robert-preissl commented Feb 27, 2024

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04.03
  • Installation type:
    • binaries (running the latest ros2/iron docker image)
  • Version or commit hash:
    • ros2 Iron
  • DDS implementation:
    • ros-iron-fastrtps = 2.10.3-1jammy
    • ros-iron-rmw-fastrtps-cpp = 7.1.3-1jammy
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

The setup is as follow using two distributed computers (both running a ros2 iron docker container)

  • machine A runs a discovery server in one terminal:
    fastdds discovery -i 0 -l 127.0.0.1 -p 11811

  • machine A runs in a second terminal the ros2 listener node (environment variables are the discovery server address/port, and RMW_IMPLEMENTATION=rmw_fastrtps_cpp):
    ros2 run demo_nodes_cpp listener --ros-args --remap __node:=simple_listener

  • machine B runs in a terminal the ros2 talker node (same environment variables as above):
    ros2 run demo_nodes_cpp talker --ros-args --remap __node:=simple_talker

  • machine A and machine B are on the same wifi network.

  • the terminal on machine A where the listener runs prints Hello World messages with increasing ids. (1,2,3, etc.)

  • Now, we disconnect the wifi on machine B

  • machine A does not print new Hello world messages (which makes sense)

  • Now, we connect machine B to the wifi network again. (verified via pings to google.com that we have a connection established)

  • machine A only prints Hello world messages after varying time.

Expected behavior

  • as soon as machine B has a connection to the network again, we should see machine A's listener print messages from B.

Actual behavior

  • this behavior seems to vary between experiments. we have seen 100+ seconds until A prints messages (meaning it takes 100 seconds or more after B is online again to see messages from B arriving at A), sometimes 50 seconds or sometimes 10 seconds.

Additional information

  • running ros2 multicast send on B and ros2 multicast receive on A seems to work (in the sense that A acknowledges and prints hello world) while B is back online, but A's listener has not printed B's messages
  • also, running the same with cyclone dds (RMW_IMPLEMENTATION=rmw_cyclonedds_cpp) seems to be without any problems. i.e., when B re-connects to wifi, the terminal from machine A immediately prints B's messages

Feature request

Feature description

Implementation considerations

@fujitatomoya
Copy link
Collaborator

@MiguelCompany @EduPonz any idea? this issue sounds like unreliable behavior.

@robert-preissl this problem only happens with discovery server? what if we start application without fastdds discovery server?

@robert-preissl
Copy link
Author

@fujitatomoya thanks for your message. Without the disco server it seems there is some challenge to get the talker and listener on these two different machines to talk.
(we use wifi to connect A and B. maybe we experience the challenge with simple discovery as mentioned on the doc with wifi / multicasting )

@robert-preissl
Copy link
Author

@fujitatomoya @MiguelCompany @EduPonz just thought I reach out again here with a quick friendly ping since this is impacting some of our operations here with rare but still occasional wifi (brief) outages. thanks

@robert-preissl
Copy link
Author

@fujitatomoya @MiguelCompany @EduPonz I thought I ping a last time here to check if there is anything you recommend here. Thanks.

@fujitatomoya
Copy link
Collaborator

@robert-preissl i really do not have any clue for this behavior right now.

just one question,

running the same with cyclone dds (RMW_IMPLEMENTATION=rmw_cyclonedds_cpp) seems to be without any problems. i.e., when B re-connects to wifi, the terminal from machine A immediately prints B's messages

this also works with cyclondds discovery server, right? not using multicast based discovery.

@robert-preissl
Copy link
Author

@fujitatomoya No, in regards to cyclone dds, no discovery server was used. (as far as I know cyclone does not have a disco server unless I am mistaken here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants