-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Machine Launching #255
Changes from all commits
91c6b41
9f3d7de
a4f956c
2e7634b
022e04d
b5ecbf7
4e45a96
b513248
bc31412
4ff0bed
63b2a8c
183d650
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,321 @@ | ||
--- | ||
layout: default | ||
title: ROS 2 Multi-Machine Launching | ||
permalink: articles/roslaunch_mml.html | ||
abstract: | ||
Robotic systems are often distributed across multiple networked machines. | ||
This document describes proposed modifications and enhancements to ROS2's launch system to facilitate launching, monitoring, and shutting down systems spread across multiple machines. | ||
author: '[Matt Lanting](https://github.com/mlanting)' | ||
published: false | ||
--- | ||
|
||
- This will become a table of contents (this text will be scraped). | ||
{:toc} | ||
|
||
# {{ page.title }} | ||
|
||
<div class="abstract" markdown="1"> | ||
{{ page.abstract }} | ||
</div> | ||
|
||
Authors: {{ page.author }} | ||
|
||
## Context | ||
|
||
This document elaborates on the details of launching remote operating system processes alluded to [here](https://github.com/ros2/design/blob/gh-pages/articles/150_roslaunch.md#remote-operating-system-processes) in the main ROS 2 ros_launch design document. | ||
|
||
## Goals | ||
|
||
Our primary goal is to eliminate the need for users to connect to multiple machines and manually launch different components of a system on each of them independently. | ||
The launch system in ROS 1 included a `<machine>` tag for launch files that allowed users to include information about networked machines and how to connect so that processes could be started remotely. | ||
We would like to replicate that capability in the launch system for ROS 2. | ||
|
||
We would like the launch system for ROS 2 to avoid becoming a single point of failure, while still having the capability to shut down the system as a whole on command. | ||
In ROS 1, communication among nodes was facilitated by roscore which roslaunch would start automatically if no instance was already running. | ||
As a result, the machine that roslaunch was run from became a core part of the system and the entire system would go down if it crashed or became disconnected. | ||
This has been problematic on occasion when working with machines running headlessly and interfacing with a laptop. | ||
The `launch` command either had to be run specifically on the computer that roscore was meant to run on, or other steps would need to be taken to launch `roscore` on a remote machine before running the `roslauch` command. | ||
In ROS 2, nodes use DDS to connect in a peer-to-peer fashion with no centralized naming and registration services to have to start up. | ||
|
||
Other issues that we've dealt with on multi-machine systems include ensuring all the machines are properly configured and set up and keeping files and packages synchronized and up to date across machines. | ||
These issues, while related to working with multiple machines, are a bit outside the scope of roslaunch. | ||
There are a number of third-part orchestration tools, such as Kubernetes, that could be leveraged to get some of this extra functionality in addition to using them to facilitate execution of nodes, but we felt that would be too large of a dependency to require of people. | ||
Resource constraind projects in particular don't need to be burdened with additional third-party tools, and some hardware architectures do not have strong Docker support. | ||
It might however make sense to consider including an optional API to facilitate such third-party tools, or at the very least be mindful of them so we can avoid doing anything to make integrating them later too much more difficult. | ||
|
||
## Capabilities | ||
|
||
In order to meet the above use goals, we will provide the following capabilities: | ||
|
||
- Connect to a remote host and running nodes on it | ||
- Support arbitrary remote execution or orchestration mechanisms (`ssh` by default) | ||
- Push configuration parameters for nodes to remote hosts | ||
- Monitor the status and managing the lifecycles of nodes across hosts | ||
- Gracefully shut down nodes across hosts | ||
- Command line tools for managing and monitoring systems across machines | ||
- A grouping mechanism allowing collections of nodes to be stopped/introspected as a unit with the commandline tools | ||
|
||
### Stretch-goals | ||
- API to facilitate integration of third party orchestration tools such as Kubernetes or Ansible | ||
- Load balancing nodes on distributed networks (Possibly outsource this capability to the previously mentioned third-party tools) | ||
- Sharing and synchronizing files across machines | ||
- Deployment and configuration of packages on remote machines | ||
|
||
## Considerations | ||
|
||
There are some outstanding issues that may complicate things: | ||
|
||
- How to group nodes/participants/processes is somewhat of an open issue with potential implications for this part of ROS2. | ||
- https://github.com/ros2/design/pull/250/files/8ccaac3d60d7a0ded50934ba6416550f8d2af332?short_path=dd776c0#diff-dd776c070ecf252bc4dcc4b86a97c888 | ||
- The number of domain participants is limited per vendor (Connext is 120 per domain). | ||
- No `rosmaster` means there is no central mechanism for controlling modes or distributing parameters | ||
- Machines may be running different operating systems | ||
- If we intend to do any kind of load balancing, certain types of resources may need to be transferred to other machines. | ||
- Calibration data, map files, training data, etc. | ||
- Need to keep track of which machine has the most recent version of such resources | ||
- Security: we'll need to manage credentials across numerous machines both for SSH and secure DDS. | ||
|
||
## Proposed Approach | ||
|
||
Following are some of the possible design approaches we have started considering. | ||
This section should evolve to describe a complete and homogenous solution as we iterate over time, but at the moment may be a bit piecemeal as we explore ideas. | ||
The point is to capture all of our ideas and approaches to different pieces of the problem, even rejected approaches, and to facilitate discussion and maintain a record of our reasoning. | ||
|
||
### Simple Remote Process Execution | ||
|
||
Create an action in `launch` called `ExecuteRemoteProcess` that extends the `ExecuteProcess` action but includes parameters for the information needed to connect to a remote host and executes the process there. | ||
|
||
### Spawn Remote LaunchServers | ||
|
||
The `LaunchServer` is the process that, given a `LaunchDescription`, visits all of the constituent `LaunchDescriptionEntities`, triggering them to perform their functions. | ||
Since the launch process involves more than simply executing nodes, it is unlikely that simply providing a way to execute nodes remotely will be adequate for starting non-trivial systems. | ||
The `LaunchServer` is responsible for things such as setting environment variables, registering listeners, emitting events, filling out file and directory paths, declaring arguments, etc. | ||
Remote machines will need to be made aware of any environment changes that are in-scope for nodes that they will be executing, and events may need to be handled across machines. | ||
|
||
One approach would be to add logic to the launch system allowing it to group `LaunchDescriptionEntities` containing the necessary actions and substitutions for successfully executing a node remotely, spawning a LaunchService on the remote machine, serializing the group of entities and sending them to the remote machine to be processed. | ||
This could turn out to be a recursive process depending on how `LaunchDescriptionEntity`'s are nested. | ||
Additional logic will be needed to detect cases where event emission and listener registration cross machine boundaries, and helper objects can be generated to forward events over the wire so handlers on other machines can react appropriately. | ||
|
||
LaunchServers would be the components with which the command line tools interact and need to have channels exposing information about the processes they've started, and for receiving user commands. | ||
|
||
### Define Remote Execution Mechanisms on a Per-Machine Basis | ||
|
||
Historically, ROS1 launched nodes by using `ssh` to connect to a remote machine and execute processes on it. | ||
This is still a reasonable way of doing it and is the expected remote execution mechanism in most environments. | ||
|
||
Some hosts or environments may use a different mechanism, such as Windows Remote Shell on Windows hosts or `kubectl` for Kubernetes clusters. | ||
There will be an abstract interface for remote execution mechanisms; | ||
it will be possible to write custom implementations that use arbitrary mechanisms, and the launch system can be configured to decide which mechanism to use on a per-machine basis. | ||
When a launch system is run, information about all of the nodes assigned to a machine will be passed to the remote execution mechanism implementation so that it can execute them appropriately. | ||
|
||
## Proposed Multi-Machine Launch Command Line Interface | ||
|
||
Launching is controlled through the `launch` command for the `ros2` command-line tool. | ||
|
||
### Commands | ||
|
||
```bash | ||
$ ros2 launch | ||
usage: ros2 launch (subcommand | [-h] [-d] [-D] [-p | -s] [-a] | ||
package_name [launch_file_name] | ||
[launch_arguments [launch_arguments ...]]) ... | ||
|
||
Without a subcommand, `ros2 launch` will run a launch file. Call | ||
`ros2 launch <subcommand> -h` for more detailed usage. | ||
|
||
positional arguments: | ||
package_name Name of the ROS package which contains the launch file | ||
launch_file_name Name of the launch file | ||
launch_arguments Arguments to the launch file; '<name>:=<value>' (for | ||
duplicates, last one wins) | ||
argv Pass arbitrary arguments to the launch file | ||
|
||
optional arguments: | ||
-h, --help Show this help message and exit. | ||
-d, --debug Put the launch system in debug mode, provides more verbose output. | ||
-D, --detach Detach from the launch process after it has started. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we would use if launch process kills itself after launch. after the launch just to leave everything else. for embedded platform, we might not need launch process, but we likely to use launch description to init the system. |
||
-p, --print, --print-description | ||
Print the launch description to the console without launching it. | ||
-s, --show-args, --show-arguments | ||
Show arguments that may be given to the launch file. | ||
-a, --show-all-subprocesses-output | ||
Show all launched subprocesses' output by overriding | ||
their output configuration using the | ||
OVERRIDE_LAUNCH_PROCESS_OUTPUT envvar. | ||
|
||
Subcommands: | ||
list Search for and list running launch systems | ||
attach Attach to a running launch system and wait for it to finish | ||
term Terminate a running launch system | ||
|
||
Call `ros2 launch <subcommand> -h` for more detailed usage. | ||
``` | ||
|
||
Example output: | ||
|
||
```bash | ||
$ ros2 launch demo_nodes_cpp talker_listener.launch.py | ||
[INFO] [launch]: All log files can be found below /home/preed/.ros/log/2019-09-11-20-54-30-715383-regulus-2799 | ||
[INFO] [launch]: Default logging verbosity is set to INFO | ||
[INFO] [launch]: Launch System ID is 50bda6fb-d451-4d53-8a2b-e8fcdce8170b | ||
[INFO] [talker-1]: process started with pid [2809] | ||
[INFO] [listener-2]: process started with pid [2810] | ||
[talker-1] [INFO] [talker]: Publishing: 'Hello World: 1' | ||
[listener-2] [INFO] [listener]: I heard: [Hello World: 1] | ||
[talker-1] [INFO] [talker]: Publishing: 'Hello World: 2' | ||
[listener-2] [INFO] [listener]: I heard: [Hello World: 2] | ||
[talker-1] [INFO] [talker]: Publishing: 'Hello World: 3' | ||
[listener-2] [INFO] [listener]: I heard: [Hello World: 3] | ||
[talker-1] [INFO] [talker]: Publishing: 'Hello World: 4' | ||
[listener-2] [INFO] [listener]: I heard: [Hello World: 4] | ||
^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT) | ||
[listener-2] [INFO] [rclcpp]: signal_handler(signal_value=2) | ||
[INFO] [talker-1]: process has finished cleanly [pid 2809] | ||
[INFO] [listener-2]: process has finished cleanly [pid 2810] | ||
[talker-1] [INFO] [rclcpp]: signal_handler(signal_value=2) | ||
``` | ||
|
||
Note how there is one difference from the old behavior of `ros2 launch`; the group of nodes is assigned a Launch System ID. | ||
This is a unique identifier that can be used to track all of the nodes launched by a particular command across a network. | ||
|
||
Additionally, it is possible to detach from a system and let it run in the background: | ||
|
||
```bash | ||
$ ros2 launch -D demo_nodes_cpp talker_listener.launch.py | ||
[INFO] [launch]: All log files can be found below /home/preed/.ros/log/2019-09-11-20-54-30-715383-regulus-2799 | ||
[INFO] [launch]: Default logging verbosity is set to INFO | ||
[INFO] [launch]: Launch System ID is 50bda6fb-d451-4d53-8a2b-e8fcdce8170b | ||
$ | ||
``` | ||
|
||
A crucial difference here is that with ROS 1, the launch process was tied to the life of the system. If the process exited, that would also terminate all of the nodes it had launched. With ROS 2, the launch process can exit and leave the system running. | ||
|
||
One reason for this is that it was at odds with ROS 2's decentralized design paradigm. Nodes do not need a `rosmaster` to communicate and can operate on physical networks that disconnect from or reconnect to each other. Requiring a single `roslaunch` process that terminates the entire system when it exits is also introducing a single point of failure that had previously been avoided. | ||
|
||
A more practical reason is that in a multi-machine environment, it is often the case that the host doing the launching is not a critical part of the system and the rest of the system should not depend on it. A common use case is that on a vehicle that has several headless hosts for running ROS nodes, you will have a separate laptop for monitoring or controlling those hosts; you will want to be able to launch your system from that laptop, but the system should not terminate just because your laptop goes to sleep or disconnects from the network. The ROS 1 `roslaunch` system would require that `roslaunch` run on one of the vehicle hosts, and to do that you would need to either get remote shell access to one of them or write a custom set of services and launch scripts; in ROS 2, being able to detach from and reattach to a system makes that possible by design. | ||
|
||
#### `list` | ||
|
||
Since it is possible to launch a system of nodes that spans a network and detach from it, it is necessary to be able to query the network to find which systems are active. | ||
|
||
```bash | ||
$ ros2 launch list -h | ||
usage: ros2 launch list [-h] [-v] [--spin-time SPIN_TIME] | ||
|
||
List running launch systems | ||
|
||
|
||
optional arguments: | ||
-h, --help Show this help message and exit. | ||
-v, --verbose Provides more verbose output. | ||
--spin-time SPIN_TIME | ||
Spin time in seconds to wait for discovery (only | ||
applies when not using an already running daemon) | ||
``` | ||
|
||
Here is a simple list that may be useful programmatically, but not so much for an end user: | ||
|
||
```bash | ||
$ ros2 launch list | ||
ab1e0138-bb22-4ec9-a590-cf377de42d0f | ||
50bda6fb-d451-4d53-8a2b-e8fcdce8170b | ||
5d186778-1f50-4828-9425-64cc2ed1342c | ||
$ | ||
``` | ||
|
||
Here is a more verbose list that contains information a user can use to identify a system: | ||
|
||
```bash | ||
$ ros2 launch list -v | ||
ab1e0138-bb22-4ec9-a590-cf377de42d0f: 5 nodes, 2 hosts | ||
Launch host: 192.168.10.5 | ||
Launch time: Fri Sep 13 15:39:45 CDT 2019 | ||
Launch command: ros2 launch package_foo bar.launch.py argument:=value | ||
50bda6fb-d451-4d53-8a2b-e8fcdce8170b: 2 nodes, 1 host | ||
Launch host: 192.168.10.15 | ||
Launch time: Fri Sep 13 12:39:45 CDT 2019 | ||
Launch command: ros2 launch demo_nodes_cpp talker_listener.launch.py | ||
5d186778-1f50-4828-9425-64cc2ed1342c: 16 nodes, 3 hosts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mlanting does each launch file provide enough information to dispatch each node to the right host? If so, how does this cope with a single-machine launch to be launched in multiple hosts, unaware of each other? If not, how are nodes or even "systems" as you propose here associated with each host? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just a theoretical design of how we would expect the CLI to look like, so there isn't a definite answer for those questions (yet). I expect that the launch files will, similar to ROS1's launch files, contain information about all of the hosts involved and which nodes should be launched on which host.
I'm not sure what this means, do you have an example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thought was that nodes which need to run on a certain machine can have that machine specified in the launch file, and those that do not have a machine specified would be sent to a host selected by the launch system (to allow for things like load balancing). The launch system would be informed of hosts by something like a "declare_launch_hosts_action" |
||
Launch host: 192.168.10.13 | ||
Launch time: Fri Sep 12 10:39:45 CDT 2019 | ||
Launch command: ros2 launch package_foo bar2.launch.py | ||
$ | ||
``` | ||
|
||
#### `attach` | ||
|
||
Since it is possible to detach from a launched system, it is useful for scripting or diagnostic purposes to be able to re-attach to it. | ||
|
||
```bash | ||
$ ros2 launch attach -h | ||
usage: ros2 launch attach [-h] [-v] [--spin-time SPIN_TIME] [system_id] | ||
|
||
Blocks until all nodes running under the specified Launch System ID have exited | ||
|
||
positional arguments: | ||
system_id Launch System ID of the nodes to attach to; if less than a full UUID is specified, it will attach to the first Launch System it finds whose ID begins with that sub-string | ||
|
||
optional arguments: | ||
-h, --help Show this help message and exit. | ||
-v, --verbose Provides more verbose output. | ||
--spin-time SPIN_TIME | ||
Spin time in seconds to wait for discovery (only applies when not using an already running daemon) | ||
``` | ||
|
||
Example output: | ||
|
||
```bash | ||
$ ros2 launch attach 50bda6fb-d451-4d53-8a2b-e8fcdce8170b | ||
Attached to Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b. | ||
(... in another terminal, run `ros2 launch term 50bda6fb`...) | ||
All nodes in Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b have exited. | ||
$ | ||
``` | ||
|
||
Verbose mode: | ||
|
||
```bash | ||
$ ros2 launch attach -v 50bda6fb | ||
Attached to Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b. | ||
Waiting for node /launch_ros | ||
Waiting for node /talker | ||
Waiting for node /listener | ||
(... in another terminal, run `ros2 launch term 50bda6fb`...) | ||
Node /launch_ros has exited | ||
Node /talker has exited | ||
Node /listener has exited | ||
All nodes in Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b have exited. | ||
$ | ||
``` | ||
|
||
#### `term` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These features all seem like extensions that can extend the current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it would possibly be a useful feature of |
||
|
||
Terminates all nodes that were launched under a specific Launch System ID. | ||
|
||
```bash | ||
$ ros2 launch term -h | ||
usage: ros2 launch term [-h] [-v] [--spin-time SPIN_TIME] [system_id] | ||
|
||
Terminates all nodes that were launched under a specific Launch System ID | ||
|
||
positional arguments: | ||
system_id Launch System ID of the nodes to terminate; if less than | ||
a full UUID is specified, it will terminate nodes | ||
belonging to the first Launch System it finds whose ID | ||
begins with that sub-string | ||
|
||
optional arguments: | ||
-h, --help Show this help message and exit. | ||
-v, --verbose Provides more verbose output. | ||
--spin-time SPIN_TIME | ||
Spin time in seconds to wait for discovery (only | ||
applies when not using an already running daemon) | ||
``` | ||
|
||
Example output: | ||
|
||
```bash | ||
$ ros2 launch term 50bda6fb-d451-4d53-8a2b-e8fcdce8170b | ||
Terminating Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b. | ||
$ | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: newline There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to resolve this comment: EOF newlines are present unless stated otherwise in github viewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the motivation of these features here.
I think things should only be included in the design document in one of two cases:
It's not clear to me that each of those meet that standard, but if they do then I think they need to be separately motivated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Some of these such as
Command line tools for managing and monitoring systems across machines
is probably best suited by a community package than one provided by ros2.Additionally, what is mean by "load balancing" in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my opinion, i believe
these are totally off tipic from launch system, why this is integrated into launcher? could we think about these features more generic if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I would say they're "totally off topic"; synchronizing a built workspace between machines is something I've had to deal with constantly when launch on multi-machine systems, and I've spent a considerably amount of time writing scripts to handle efficiently deploying large workspaces (I've got one that's 5.3 GB right now) across multiple ROS hosts.
But it is true that the concept of synchronizing files isn't tightly coupled to launching; there are people who will be interested in multi-machine launching but don't need to sync anything, and there are probably others who will want to sync data but aren't launching anything, so it might make more sense to break that out into its own system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think we need to consider what constitutes things for multi-machine launching vs. tools for multi-machine ros2 setups. Our best bet might be to consolidate the features that would help multi-machine launching in the
launch
command within this guide and then think about maybe a suite of packages containing features helpful to multi-machine setups.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ROS 1, files were looked up via something like "find file X in package Y", and in multi machine launching it was required that packages and files existed in both machines (maybe not in the same place, but they had to be discoverable).
That seemed to work ok, though obviously there are situations, especially in development, where this isn't ideal. However, I do think we'd do well to keep the tools "small and sharp" and deal with this problem outside of launch. If there appears to be a really good default way to solve this problem, then we can consider adding tool support.
However, I'd encourage you guys to get it working with the assumption that all the packages existing on the local and remote machines (or in the "remote" containers), and then we can workout mechanisms for synchronization or container setup as needed.