Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce amcl_node nodes #2483

Merged
merged 1 commit into from
May 26, 2022

Conversation

gezp
Copy link
Contributor

@gezp gezp commented Jul 28, 2021

Signed-off-by: zhenpeng ge [email protected]


Basic Info

Info Please fill out this column
Ticket(s) this addresses (tickets: #816)
Primary OS tested on (Ubuntu20.04, ROS 2 Galactic)
Robotic platform tested on (gazebo simulation of turtlebot3)

Description of contribution in a few bullet points

As described in #816, rclcpp_node_ in class AmclNode need be removed, you can find more details(other nodes which need be removed) in #816 (comment)

  • After the PR (Add lifecycle node support ros2/message_filters#59) was merged, we can use message_filters::Subscriber in LifecycleNode. and update tools/underlay.repos to sync the feature.
  • There is no blocked callback in AmclNode, so just use shared_from_this() instead of rclcpp_node_ for Transform and MessageFilter

Test

colcon test

colcon test --packages-select nav2_amcl

#result:0 errors, 0 failures, 0 skipped

navigation demo test (turtlebot3 in gazebo)

  • success

Future work that may be required in bullet points

  • i notice that the constructor of TransformListener has a argument spin_thread with default value true, which create a internal node to subscribe tf topic and a dedicated thread to spin this node.
  • i think the best way is to use new callback group and executor with dedicated thread so that we could avoid creating internal node which is overhead. i will open a issue for ros2/geometry to discuss these.

For Maintainers:

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@gezp gezp changed the title reduce amcl_node noddes Reduce amcl_node nodes Jul 28, 2021
@SteveMacenski
Copy link
Member

@ruffsl it looks like the underlay failed to build and use the message filters package in the underlay.repos file. I see in the CI job it recognizes that its there, but I don't see any logging that it was built

@ruffsl
Copy link
Member

ruffsl commented Jul 28, 2021

The underlay appears to have built correctly, but it seems that a dependency was added without also amending the respective package.xlm and CMake files. Also I think the message filter package already ships with rolling, so you could leave it commented out in the repos file, and let rosdep install it from apt for the overlay workspace.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 28, 2021

There is no blocked callback in AmclNode, so just use shared_from_this() instead of rclcpp_node_ for Transform and MessageFilter

AMCL does have the callback for the subscriber, laserReceivedwhich does a bit of work. This might be a good discussion point for @wjwwood please don't PM me on this PR so that he can see this discussion as well. @wjwwood when working with TF, what's your recommendation?

It looks like the TF buffer has an argument in the constructor for a node that is defaulted to making a new node just to create a getFrames service. This doesn't have as clear of a solution -- this feels out of place. Shouldn't that be in the TF Listener perhaps? It seems odd to me that the buffer would have a ROS interface at all. Seems like it should have the utility functions to get the set of frames, but then the actual service should reside in the listener using the buffer. That would also remove the need for a node in the TF buffer at all.

The TF listener then creates a node internally to spin and work with. This has a clearer solution described below.


i think the best way is to use new callback group and executor with dedicated thread so that we could avoid creating internal node which is overhead. i will open a issue for ros2/geometry to discuss these.

Agreed, for now, I think its worth letting TF do its own thing and control its own node for the time being until we find a solution on that front. I looked over these files https://github.com/ros2/geometry2/blob/ros2/tf2_ros/src/transform_listener.cpp#L68-L91 and I think there's a really reasonable change that can be made here to solve our problems, at least as relates to the Listener.

  • Have the TransformListener take in a node in the constructor
  • If spin_thread, then create the executor and handle all the same way
  • If not spin_thread, then its on the application to spin
  • Remove any ability for the listener to create a node

@ruffsl What do you mean? No new packages were added. We just want to use a newer master version of message_filters until it is synced with Rolling in order to continue moving this ticket forward. Changes were made we want to exploit.

@gezp
Copy link
Contributor Author

gezp commented Jul 29, 2021

@ruffsl could you give me more details? i don't know how to fixed the CI error.

@ruffsl
Copy link
Member

ruffsl commented Jul 29, 2021

What do you mean? No new packages were added. We just want to use a newer master version of message_filters until it is synced with Rolling in order to continue moving this ticket forward. Changes were made we want to exploit.

Ah, ok. I'm on my phone, so I didn't browes the code outside of the PR changes, and just assumed that the additional package to the repos file was still undelared in the manifests. Never the less, the underlay seems to be building fine:

Starting >>> message_filters
�]0;colcon cache [0/1 done] [1 ongoing]�Finished <<< message_filters [0.01s]
�]0;colcon cache [1/1 done] [0 ongoing]�
Summary: 1 package finished [0.22s]
BUILD_UNFINISHED: message_filters
BUILD_FAILED:
BUILD_INVALID: message_filters
BUILD_PACKAGES: message_filters
Starting >>> message_filters
�]0;colcon build [0/1 done] [1 ongoing]�[Processing: message_filters]
[Processing: message_filters]
[Processing: message_filters]
Finished <<< message_filters [1min 47s]
�]0;colcon build [1/1 done] [0 ongoing]�
Summary: 1 package finished [1min 48s]

@gezp , what distro or source are you developing this PR with locally? Are you using the latest Rolling release, or are you building everything from ros2 master? Check if the Nav2 project Dockerfile builds fine locally for your PR:

docker build --tag my/navigation2:tag .

@wjwwood
Copy link
Contributor

wjwwood commented Jul 29, 2021

I don't know if tf2 has been updated to make use of the fact that you can add callback groups to dedicated executors yet. You might want to bring it up on the geometry2 issue tracker. Or provide a pull request to that effect (rather than it having its own node). I don't use tf2 often enough to answer your architectural question about the buffer interface and how it should or shouldn't depend on ROS.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 29, 2021

Ok @gezp can you open the PR to implement what I describe above for the listener and open a ticket on the listener / buffer node usages? We can continue the discussion with Tully over there.

For now, lets leave TF alone to do its own thing and fix it over there instead.

I'd hold off on the getFrames service moving part, they might have a good reason for that and we should take their feedback and come up with a compromise that meets everyone's core needs.

@gezp
Copy link
Contributor Author

gezp commented Jul 29, 2021

@SteveMacenski I opened a ticket ros2/geometry2#440 ,and PR ros2/geometry2#442 for listener.

@gezp
Copy link
Contributor Author

gezp commented Jul 29, 2021

@ruffsl Thank you, i will wait for the PR (ros/rosdistro#30350) to be merged so that i could use the feature which i need on the latest Rolling release.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 2, 2021

@gezp those don't get released to rolling immediately, it requires a sync. The last sync was only a couple of weeks ago so it might be more than another month before we get another one. We should add the master to underlay and get this building so we can keep moving forward

In parallel to our work, we should also continue nudging and pushing the TF issue / PR forward. But for now, we should list anywhere that needs to be addressed in the ticket bullet point list and move onto non-TF problems in Nav2 while working on the TF problems at the same time in TF itself.

@gezp
Copy link
Contributor Author

gezp commented Aug 3, 2021

I guess the message filter is a basic package in ros so we can't overlay this package, but i use a weird way so that i could use the newest message filter from source on my local computer.

I install ROS2 Galactic form binary on my local computer , and delete message filter package in /opt/ros/galactic before source /opt/ros/galactic/local_setup.bash, then i git clone the newest message filter from github, and build it so that i can use it in nav2. anyway i can build nav2 with this PR on my local computer with the newest message filter package, but i don't think this is a good way, and i'm not sure if there is any problem after i delete files.

@ruffsl i can't wait for the next ROS rolling version ,and i don't know how to pass CI , could you give me some suggestions?

@ruffsl
Copy link
Member

ruffsl commented Aug 3, 2021

is a basic package in ros so we can't overlay this package

I'm not sure I understand why the message filter package can't be overlay'ed in this particular case. It being a core package shouldn't it self prevent this. One can overload an existing packe by including it in a workspace, where upon setup, the package sourced from the top most overlay should take environmental president over any duplicates in the underlay workspaces.

Complications can occur when trying to link to dependencies that were compiled against a different version of the common dependency from a deeper underlay, but in this case, from how you described your current development with monkey patching galactic and overlay confined to navigation2, that doesn't seem to be an issue already.

Is there anything else you're including in your underlay besides the master branch of the message filter package, such as master branches of the message filter package's own dependencies?

@gezp
Copy link
Contributor Author

gezp commented Aug 3, 2021

Is there anything else you're including in your underlay besides the master branch of the message filter package, such as master branches of the message filter package's own dependencies?

I only put message_filters in my underlay workspace locally.

currently, I try to use Docker to build nav2. I add message_filters in underlay.repos , but i failed to build Dockerfile (as same as CI). and i get a warning with error output.

--- stderr: nav2_rviz_plugins
CMake Warning at CMakeLists.txt:50 (add_library):
  Cannot generate a safe runtime search path for target nav2_rviz_plugins
  because there is a cycle in the constraint graph:

    dir 0 is [/opt/overlay_ws/install/nav2_msgs/lib]
    dir 1 is [/opt/ros/rolling/lib]
      dir 7 must precede it due to runtime library [libmessage_filters.so]
    dir 2 is [/opt/overlay_ws/install/nav2_lifecycle_manager/lib]
    dir 3 is [/opt/overlay_ws/install/nav2_util/lib]
    dir 4 is [/opt/ros/rolling/opt/yaml_cpp_vendor/lib]
    dir 5 is [/opt/ros/rolling/opt/rviz_ogre_vendor/lib]
    dir 6 is [/opt/ros/rolling/lib/x86_64-linux-gnu]
    dir 7 is [/opt/underlay_ws/install/message_filters/lib]
      dir 1 must precede it due to runtime library [libmessage_filters.so]

  Some of these libraries may not be found correctly.
---

then i modify Dockerfile and add some cmds before line70 in Dockerfile

# berore . /opt/ros/$ROS_DISTRO/setup.sh
# remove message_filters
RUN rm -rf /opt/ros/rolling/include/message_filters && \
    rm -rf /opt/ros/rolling/share/message_filters &&\
    rm -rf /opt/ros/rolling/libmessage_filters.so 

above warning is still existed, but i can build the Dockerfile successfully by this way. so do i need change CI ? or do you have a better way? @ruffsl

@SteveMacenski
Copy link
Member

mhm, that is complicated. I think the only real solution would be building it all from absolute scratch and replace the message filters with our branch at that point. That's probably not realistic for us to do in CI, so the only realistic option I see is to wait for a new rolling sync on this PR and just move on to other topics in the meantime while we wait.

@gezp gezp mentioned this pull request Aug 4, 2021
6 tasks
@SteveMacenski
Copy link
Member

https://discourse.ros.org/t/preparing-for-rolling-sync-2021-08-05/21701

Good news! A sync is happening tomorrow

@gezp
Copy link
Contributor Author

gezp commented Aug 9, 2021

the new release of ROS Rolling is available ros rolling/2021-08-06 , but the docker image ros:rolling is not updated (https://hub.docker.com/_/ros) , i noticed you are contributor of osrf/docker_images, could you help me let CI use the latest ROS Rolling? @ruffsl

@ruffsl
Copy link
Member

ruffsl commented Aug 9, 2021 via email

@SteveMacenski
Copy link
Member

This is building now -- but I think we are still blocked here with respect to TF or no?

@gezp
Copy link
Contributor Author

gezp commented Aug 11, 2021

i don't think we should wait TF. because tf should spin with self's thread (to avoid delay or block), we needn't consider the implementation of tf (use internal Node or callback group+executor). if TF is updated, we just need a small update in Nav2 (use another constructor of TF listener)

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 12, 2021

Wouldn't the services / map / initial pose callbacks block the laser scan thread? As this PR stands.

I think I need to revisit the TF code. I'm not concerned about giving the buffer the clock from this node and the TF broadcaster is just making a publisher with the node, but I do want to understand what the create Timer ROS function is doing with the timer / base interface.

@gezp
Copy link
Contributor Author

gezp commented Aug 12, 2021

there is no block in services / map / initial pose callbacks
services:

  • callback of global_loc_srv _ (globalLocalizationCallback): call pf_init_model() and set some variables, so the callback won't be blocked
  • callback of nomotion_update_srv_ (nomotionUpdateCallback) : only set a variable, so the callback won't be blocked

subscriptions:

  • callback of initial_pose_sub_ (initialPoseReceived): call handleInitialPose() , it contains lookupTransform(), but tf use internal nodes with dedicated thread, so this callback won't be blocked
  • callback of map_sub_ (mapReceived ) : call handleMapMessage() (free map memory and convert msg to map) , so the callback won't be blocked

@SteveMacenski
Copy link
Member

Yeah it was an issue before, but its different now because the services are involved. If you followed my example, the main concern I had was longer running things (services) blocking long enough to cause a flood of message filter warnings / errors. The TF/laser scan issue is also there, but isn't concerning to me because what you mentioned, its been like that for about 2 years. The no-update service is pretty trivial and not an issue but the global relocalization one is a bit more so.

Can we separate the services onto another executor? Do you think that's a rational move (it would probably need then some shared resource protections)? Not the best solution in my mind though.

The best solution to me is to make TF isolated from the rest of the node's interfaces so that it isn't being interfered with based on user services or data callbacks. If that was the case, then we shouldn't have an issue with message filters not being able to transform and we'd just be dropping laser scans in the case of long-running services which is the right behavior.

@gezp
Copy link
Contributor Author

gezp commented Aug 17, 2021

Can we separate the services onto another executor? Do you think that's a rational move (it would probably need then some shared resource protections)? Not the best solution in my mind though.

which services i need separate ? do you mean global_loc_srv and nomotion_update_srv_ ? @SteveMacenski

@SteveMacenski
Copy link
Member

yes, though as I said below, I think the best solution is sufficiently isolating TF

@gezp
Copy link
Contributor Author

gezp commented Aug 17, 2021

do you mean global_loc_srv_ callback maybe block laser_scan_sub_ ? if so, how about moving the laser_scan_sub_ onto another executor ?

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 17, 2021

It doesn't have to do with that, it has to do with the services blocking the message filter from doing its thing, via the TF timer sharing execution with the services/topics/other interfaces, if my understanding of the situation is correct. This will be a similar issue with Costmap2D's PR, which is why I recommend looking for a solution related to TF / message filters instead so that we have a general solution to the core concern rather than patching it for this particular situation.

TFs waitForTransform used inside the message filters are using the given the timer from the node which would be blocked, I believe, by a service call on the same executor. That's why I asked @wjwwood if he understood the design intent behind giving the timer externally, rather than creating itself based on an input node to TF, so that maybe we could use another setup to isolate TF more sufficiently.

@gezp
Copy link
Contributor Author

gezp commented Aug 17, 2021

sorry, the message filter of tf is difficult to understand for me. in my understanding, when receive a laser_scan message:

  • if it's transformable , the laser_scan message will be handled Immediately (call laserReceived).
  • otherwise, the message will add to a map (transformable_callbacks_), then in the callback of tf listener, use testTransformableRequests() to check and process this map ( if item of map is transformable , call laserReceived for this item), in addition, timers will be used to remove item which is timeout from the map.

if you want to isolate TF timer , i have a idea to do it :

  • let all services and subscriptions use a new callback group with dedicated executor and thread.
  • let timer use default callback group.

in my opinion, we needn't isolate TF timer, because timer is only used to cancel TransformableRequests (maybe wrong), it isn't a issue to cancel with delay.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 17, 2021

Thusly, a blocked executor from a service could easily mess up the message filter's ability to function, throwing exceptions from timeouts and returning lots of errors/warnings to the logs if the transform is not immediately available to transform, which is very well possible / probable at times that a message filter would have to wait a bit to get a valid transformation.

ros2/geometry2#446

@gezp
Copy link
Contributor Author

gezp commented Aug 18, 2021

I have updated PR to isolate TF timer :

  • let all services and subscriptions use a new callback group with dedicated executor and thread.
  • let timer use default callback group.

please review again. @SteveMacenski

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 18, 2021

What you suggested is not unreasonable, but I'd like to focus on ros2/geometry2#446 to actually fix the underlying issue. I'd be curious if you could find a clean way to fix this in TF itself (maybe take in a node and have the timer interfaces in its own separate callback group / executor background thread) like we did with the other TF PR.

That would have a significantly larger impact on the community longer term preventing issues like this from reappearing elsewhere. I would much rather solve this problem forever than have to rethink about this every time we add a new server to Nav2 or building applications.

It doesn't seem very clean to give TF the "main" executor for the node and relegate the actual application code to another executor to keep TF happy.

@mergify
Copy link
Contributor

mergify bot commented Oct 15, 2021

This pull request is in conflict. Could you fix it @gezp?

@ros-navigation ros-navigation deleted a comment from mergify bot Oct 15, 2021
@ros-navigation ros-navigation deleted a comment from mergify bot Nov 8, 2021
@SteveMacenski
Copy link
Member

@gezp where are we on this? Can you ping the ticket / PR that is blocking this work from being completed? I really don't want this to continue sitting around when it could be resolved

@gezp
Copy link
Contributor Author

gezp commented Mar 24, 2022

@SteveMacenski , it's blocked by ros2/geometry2#447.

@SteveMacenski
Copy link
Member

OK - I poked them and it should be merged within a day

@SteveMacenski
Copy link
Member

Merged!

@mergify
Copy link
Contributor

mergify bot commented Mar 28, 2022

@gezp, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

@gezp
Copy link
Contributor Author

gezp commented Mar 28, 2022

CI failed, we need wait for a new rolling sync to get feature ros2/geometry2#447.

Signed-off-by: zhenpeng ge <[email protected]>

add callback group and executor

Signed-off-by: zhenpeng ge <[email protected]>

update

Signed-off-by: zhenpeng ge <[email protected]>
@gezp
Copy link
Contributor Author

gezp commented May 26, 2022

CI passed, please review this pr again. @SteveMacenski

@SteveMacenski SteveMacenski merged commit e581eef into ros-navigation:main May 26, 2022
@SteveMacenski
Copy link
Member

Thanks! Nice to get this in finally 😄 I think there's just 1 more pending to complete this all. Thanks for being so on top of it once we got back on Rolling!

redvinaa pushed a commit to EnjoyRobotics/navigation2 that referenced this pull request Jun 30, 2022
Signed-off-by: zhenpeng ge <[email protected]>

add callback group and executor

Signed-off-by: zhenpeng ge <[email protected]>

update

Signed-off-by: zhenpeng ge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants