Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Machine Launching #255

Closed
wants to merge 12 commits into from
Closed

Multi-Machine Launching #255

wants to merge 12 commits into from

Conversation

mlanting
Copy link

This is an initial draft of a design document outlining our plans for adding multi-machine launching capability to ros launch for ROS2. We'd like to put it out early in the design process so we can take community feedback into account as we move forward.

Distribution Statement A; OPSEC #2893

Copy link
Contributor

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments, though it looks like it's very early in the design process.


## Justification

Nodes can need to run on different hosts for a variety of reasons. Some possible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting can need? would may need sound better?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"May need" sounds better to me

## Justification

Nodes can need to run on different hosts for a variety of reasons. Some possible
use cases:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting nit:

Suggested change
use cases:
use cases are:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or "Some possible use cases include:".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will is right, "include" sounds better.

Nodes can need to run on different hosts for a variety of reasons. Some possible
use cases:

- A large robot could with hosts located physically near the hardware they are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting hmm, not sure what you mean here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, on a vehicle that has many different types of sensors (cameras, lidars, radars, etc.), you may want to have a separate computer for processing the data from each one of those sets of sensors, and in order to simplify cable routing and network design, you may want to have those computers physically positioned next to them.

Alternately, depending on the chassis of your vehicle, you could have limitations on where it's actually possible to place computers, and so they could be scattered around by necessity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand, seems like a typo or something as "could with hosts" doesn't parse for me...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A large robot could with hosts located physically near the hardware they are
- A large robot could require hosts located physically near the hardware they are

@wjwwood Is this sufficient to make the explanation clearer? We could go further by describing the physical layout more, possibly talk about doing edge processing for sensors when you have a robot that's more than a few meters in overall size.

Copy link

@zmk5 zmk5 Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys are going for a specifically multi-robot/multi-agent approach, we should probably adopt language that suits the field.

Suggested change
- A large robot could with hosts located physically near the hardware they are
- A robot could request from neighboring static or dynamic agents control of their hardware, such as cameras, sensors, etc.

- Calibration data, map files, training data, etc.
- Need to keep track of which machine has the most recent version of such
resources
- Security: we'll need to manage credentials across numerous machines both for SSH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting nit:

Suggested change
- Security: we'll need to manage credentials across numerous machines both for SSH
- Security credentials will have to be managed across numerous machines both for SSH


## Proposed Multi-Machine Launch Command Line Interface

The multi-machine launching interface is controlled through the `launcher`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting meta: I wonder if launcher is the right naming, as opposed to host, machine or even system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the best thing I could think of at the time -- I am also not a big fan of it and an open to suggestions. :-)

I think it would make sense to provide the same interface for both single-machine and multi-machine launching and also make it obvious from the verb what the command does. system might work for that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launcher is just a new command we decided to propose since the existing launch command doesn't have any verbs. Running something like ros2 launcher launch would do the same thing as ros2 launch, either by calling the existing launch, or replacing it (in which case ros2 launch could just be an alias).

This is basically just a workaround for the fact that the existing launch command doesn't have verbs and we weren't sure it'd be a good idea to try to introduce verbs to a verbless command.

$ ros2 launcher list
ab1e0138-bb22-4ec9-a590-cf377de42d0f: 5 nodes, 2 hosts
50bda6fb-d451-4d53-8a2b-e8fcdce8170b: 2 nodes, 1 host
5d186778-1f50-4828-9425-64cc2ed1342c: 16 nodes, 3 hosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlanting does each launch file provide enough information to dispatch each node to the right host? If so, how does this cope with a single-machine launch to be launched in multiple hosts, unaware of each other? If not, how are nodes or even "systems" as you propose here associated with each host?

Copy link
Contributor

@pjreed pjreed Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a theoretical design of how we would expect the CLI to look like, so there isn't a definite answer for those questions (yet). I expect that the launch files will, similar to ROS1's launch files, contain information about all of the hosts involved and which nodes should be launched on which host.

If so, how does this cope with a single-machine launch to be launched in multiple hosts, unaware of each other?

I'm not sure what this means, do you have an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that nodes which need to run on a certain machine can have that machine specified in the launch file, and those that do not have a machine specified would be sent to a host selected by the launch system (to allow for things like load balancing). The launch system would be informed of hosts by something like a "declare_launch_hosts_action"

@ruffsl
Copy link
Member

ruffsl commented Sep 16, 2019

Are there any lessons or infrastructure we'd like to integrate/piggyback with? A long while ago we had a similar idea of integrating orchestration software to manage roslaunch for ROS1: ros/ros_comm#643

Rather than reinventing the wheel, perhaps we could open up an interface to make it simpler to plugin roslaunch with kubernetes, swarm, nomad, docker, or other orchestration models outside containers?

@ivanpauno
Copy link
Member

ivanpauno commented Sep 16, 2019

I agree that it would be good to build this on top of an existing orchestration tool.

It's not clear to me from the document, if in the same launch file you would be able to run processes in different machines. That was possible in ROS 1 using the machine tag.
It has been some discussion here on how to refactor the current ExecuteProcess action to make this easier.

@piraka9011
Copy link

Thanks for the contribution @mlanting

Here are my suggestions:

  1. Let's get this into an outline like the other docs. Suggested outline below:

    • Preface/Background
    • Goals
      • In Scope
      • Out of scope
    • Features/Capabilities
    • Proposed Approach
      • Implementation
      • Risks/Issues
    • Alternatives? (Docker, Kubernetes...)
  2. Is there a reason you aren't proposing extending the previous launch verb? Wouldn't it be better to extend ros2launch with the ability to read a list package_name/launch_files and pass the attach/detach args is needed:

$ ros2 launch [package_name [launch_file_name] [launch_arguments]]

That or specifying a yaml file configuration with a system tag which specifies where the system should run:

# Command
$ ros2 launch --yaml-file <filename>
# YAML File
launch:
  package: <pkg>
    node: <node_name>
      arg1: [1, 2, 3]
      condition1: ['a', 'b', 'c']
      system: `my_robot.local`
  1. We are releasing a Docker plugin for launch (as suggested by @ruffsl) for the ROScon security workshop. This allows you to run nodes/launch files in Docker containers and specify Docker arguments accordingly. Do you think that would support your ability to orchestrate nodes across systems and simplify the design?


Authors: {{ page.author }}

## Purpose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the outline.


## Proposed Multi-Machine Launch Command Line Interface

The multi-machine launching interface is controlled through the `launcher`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind adding a whole new verb as opposed to extending the previous launch command? (Again, see my comments for proposed alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I think it'd be good to do this in a way that doesn't break backwards-compatibility, so just running ros2 launch pkg launch_file should behave the same way as it always has. On the other hand, none of the current ros2 verbs currently can both execute a command and have subcommands, so to do so here would be breaking with existing conventions.

The thought here is that the cleanest way to preserve backwards compatibility and preserve conventions is to add a new command that provides all of the necessary subcommands and have one of those subcommands provide the same functionality as the original ros2 launch. Do you think that either replacing the current launch command or having one command that both performs an action and has sub-commands would be preferable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker has done something similar, like where docker images was extended to docker image list, or docker ps vs docker container list. So we could nest the original verb into the new syntax, while keeping the old syntax. Though I'm a fan of reworking the launch verb as its already singlar. Perhaps ros2 launch -> ros2 launch start|stop|restart <pkg> <file>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a little more:

Wouldn't it be better to extend ros2launch with the ability to read a list package_name/launch_files and pass the attach/detach args is needed:
$ ros2 launch [package_name [launch_file_name] [launch_arguments]]

You may need more information than just a package and launch file name attach / term a launched system; in fact, that will definitely not be enough if it is possible to run the same launch file more than once on a network. For example, imagine you've got a lab of developers, all of whom are on the same network and are independently launching/terminating nodes and don't want to mess up anybody else.

In fact, not needing to know the package/launch file could be a desirable feature; you might want to be able to launch a system from one machine but stop it from a different one.

Actually, it looks like it didn't make it into the revision of this file that got pushed here, but here's an example of some more verbose output from list -v that could be useful for introspecting running systems:

$ ros2 launcher list -v
ab1e0138-bb22-4ec9-a590-cf377de42d0f: 5 nodes, 2 hosts
    Launch host: 192.168.10.5
    Launch time: Fri Sep 13 15:39:45 CDT 2019
    Launch command: ros2 launcher launch package_foo bar.launch.py argument:=value
4c9dc7d1-e6c4-49cd-bc3d-c8aa2d5a95e0: 5 nodes, 2 hosts
    Launch host: 192.168.10.5
    Launch time: Fri Sep 13 16:39:45 CDT 2019
    Launch command: ros2 launcher launch package_foo bar.launch.py argument:=value
50bda6fb-d451-4d53-8a2b-e8fcdce8170b: 2 nodes, 1 host
    Launch host: 192.168.10.15
    Launch time: Fri Sep 13 12:39:45 CDT 2019
    Launch command: ros2 launcher launch demo_nodes_cpp talker_listener.launch.py
5d186778-1f50-4828-9425-64cc2ed1342c: 16 nodes, 3 hosts
    Launch host: 192.168.10.13
    Launch time: Fri Sep 12 10:39:45 CDT 2019
    Launch command: ros2 launcher launch package_foo bar2.launch.py
$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a command to start/stop/restart launched things? Can we not just keep ros2 launch blocking as it is, and you simply ctrl-c the original command when you want to stop it?

If you want to list existing launch services, then we can add that as a ros2 launch --list-launch-services kind of thing. If you absolutely need to asynchronously interact with an already running launch service then the command name that would make more sense to me is ros2 launchctrl...

By the way, the architecture of the launch system is not at all clear at this point in the article, how many processes are there, how are things executed on remote machines, is it a blocking process or a daemon?

Without context on the architecture, this section about the command line tools doesn't make sense and is completely unmotivated. It just comes off as a bit arbitrary. I designed launch originally, and have discussed this feature with people extensively and even I couldn't really guess at why the tools were proposed like this. Don't take this as negative criticism, it's fine we'll iterate on it, but my point is that you need a lot more supporting rationale before you jump into how the command line tools will work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought behind this is that since ROS2 has no rosmaster and can operate in a decentralized fashion, it makes sense for a multi-machine launch system to match that paradigm rather than having a central process around which everything can fail.

This is actually a minor annoyance I've dealt with a lot in ROS1. Let's say your robot is a vehicle that has multiple headless computers, you have a laptop that you use for interacting with the vehicle but does not have any critical software running on it, and you want to launch a ROS system across all of them. Your options are:

  • roslaunch everything from your laptop, which creates an unnecessarily single point of failure that can bring everything down
  • ssh into a computer on the vehicle, start a terminal manager like screen or tmux, roslaunch everything, then detach and log out; when you want to stop everything, you then have to log in, attach, kill it, then log out again

The first case can unexpectedly cause big problems, such as one time that we had a demo running on a vehicle and a bump in the road caused the laptop to close, go to sleep, and then the whole system died. The second case is less problematic but adds several repetitive steps. You can abstract away the process by writing services and scripts to automate everything, but to me that feels like something the launch system should be able to do.

In addition, ROS2's decentralized design means it's possible to have machines that lose and regain connectivity to each other and continue functioning. This can happen accidentally as a result of network failures, but it could also be intentional in the case of a swarm of drones that move out of and back into range of each other. If the launch paradigm is to have a central launch process that shuts everything down when it's interrupted, how do you handle shutting down hosts that you've temporarily lost connection to? After they've reconnected to the network, is there even a way to do so?

Designing the launch system in this way addresses those problems; you can:

  • Launch a ROS2 system from a machine that is not a critical part of the system
  • Monitor or stop groups of nodes that were started on a different machine than the current one
  • Monitor or stop nodes that disconnected from the network they were launched from

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, we have some ideas and plans for the actual launch architecture but are still pretty early in the design process; the desire behind designing the user interface first is to figure out what features we want and how people want to interact with it and then build an architecture that is capable of that rather than creating a system first and then trying to figure out how to control it in an intuitive way.

As mentioned in another comment, the desire behind creating a new command (and I like launchctrl better than launcher) was to preserve compatibility with the existing launch command while also not breaking the current paradigm of commands that perform a single action not also having sub-commands. If you don't have any problem with adding sub-commands to launch or with tweaking launch's arguments so that is is capable of other actions, I also think that's preferable.

Copy link

@piraka9011 piraka9011 Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.
I think all of this is important to outline in the design doc as it provides a lot of necessary context and justification as to why one should go about implementing this new feature.
I agree that leveraging the decentralized paradigm of ROS2 to launch processes on various systems and monitor them from other systems is useful.

To move forward, I'd suggest having very specific, actionable items of what this proposition will and will not achieve as well as justification as to why previous commands/features don't achieve what your proposing (As you outlined to wjwwood).

Edit:
Just wanted to say this is hilarious and sorry it happened 😆 (I know the pain...)

The first case can unexpectedly cause big problems, such as one time that we had a demo running on a vehicle and a bump in the road caused the laptop to close, go to sleep, and then the whole system died.

Copy link
Member

@ruffsl ruffsl Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as to why previous commands/features don't achieve what your proposing

I would also like to see some arguments on why existing orchestration tools just sitting on top of ROS2 launch already would not be as suitable, and to what extent we should implement distributed host process management into roslaunch itself. The unix side of me isn't sure if ROS is self would be the best place to add this kind of infrastructure management. There's a lot more to distributed process management for robots that roslaunch, such and rotating logs, key exchange/enrolment, port mapping, health monitoring, etc. For example, I don't even think using SSH is still appropriate for managing swarms of robots, given how eazy ssh pipes break over spotty wifi connections; even something more robust like most isn't well purposed for programmatic process management.

However, It might be neat to use ROS as the transport for other orchestration tools, eg. where kubernetes or nomad could send client/server exchanges over DDS; though I'm not sure if the'd operate under anything but a reliable transport.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mosh might be a better because it was made for intermittent connectivity.

For the attach/detach feature, couldn't you just extend the ros2 node ... command to include an attach/detach action (i.e. ros2 node attach name-of-node)?

$ ros2 launcher term 50bda6fb-d451-4d53-8a2b-e8fcdce8170b
Terminating Launch System 50bda6fb-d451-4d53-8a2b-e8fcdce8170b.
$
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to resolve this comment: EOF newlines are present unless stated otherwise in github viewer

$
```

#### `term`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These features all seem like extensions that can extend the current ros2run and ros2launch CLI.
They are also something I would like to see in ROS2 as well :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would possibly be a useful feature of ros2 launch (not sure about ros2 run, but maybe...), but again I don't see what it has to do with multi-machine launch. We could add attach/detach/term without adding multi-machine launching actions, and vice versa (unless I miss some core reason for it being included here).

@pjreed
Copy link
Contributor

pjreed commented Sep 17, 2019

Providing the ability to work on top of an existing orchestration tool could be very useful, but I'm hesitant about requiring it. Adding something like docker or kubernetes is a very significant dependency that I know some people won't want to be required as part of their ROS installation, and they may not be feasible for some low-resource platforms or available for some architectures.

In ROS1 systems, I've done multi-machine launching in docker environments through abuse of the env-loader attribute on machines, but it would be very neat to have a formalized interface for integrating different orchestration systems.

@dirk-thomas dirk-thomas changed the title Roslaunch Multi-Machine Launching Sep 17, 2019
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for stating the process on this!

I have to say though I think we need a lot more details on what the system will be capable of, how it might be implemented (ssh vs. a daemon needs to be running on each machine ahead of time vs. something else), and how users will specify what is local and what is remote and any limitations on what can be remote.

Rather than, what I take to be, unrelated features surrounding detachable launch services.

Also, please update this section as part of this pull request, to at least point to this new article:

https://github.com/ros2/design/blob/gh-pages/articles/150_roslaunch.md#remote-operating-system-processes


Allow a system of ROS nodes to be launched on a hardware architecture that is
spread across multiple networked computers and facilitate introspection and
management of the system from a single machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Justification

Nodes can need to run on different hosts for a variety of reasons. Some possible
use cases:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or "Some possible use cases include:".

Nodes can need to run on different hosts for a variety of reasons. Some possible
use cases:

- A large robot could with hosts located physically near the hardware they are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand, seems like a typo or something as "could with hosts" doesn't parse for me...

- Connecting to a remote host and running nodes on it
- Pushing configuration parameters for nodes to remote hosts
- Monitoring the status and tracking the lifecycle of nodes
- Recovering from failures by optionally restarting nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two (tracking lifecycle and recovering from failures) are not special for distributed launching, imo. They have to be done in a single launch file as well some times.

- Load balancing nodes on distributed networks
- Command line tools for managing and monitoring systems across machines
- Mechanisms for locating files and executables across machines
- Sharing and synchronizing files across machines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the motivation of these features here.

I think things should only be included in the design document in one of two cases:

  • it's a feature that has a concrete/existing use case (e.g. feature we had in ROS 1 and want to emulate)
  • it's a feature we might want in the future, but needs to be considered now so that it's possible to add it later
    • i.e. trying to avoid designing ourselves into a box where adding the new feature would require major redesigning of the system

It's not clear to me that each of those meet that standard, but if they do then I think they need to be separately motivated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Some of these such as Command line tools for managing and monitoring systems across machines is probably best suited by a community package than one provided by ros2.

Additionally, what is mean by "load balancing" in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my opinion, i believe

  • Mechanisms for locating files and executables across machines
  • Sharing and synchronizing files across machines

these are totally off tipic from launch system, why this is integrated into launcher? could we think about these features more generic if necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I would say they're "totally off topic"; synchronizing a built workspace between machines is something I've had to deal with constantly when launch on multi-machine systems, and I've spent a considerably amount of time writing scripts to handle efficiently deploying large workspaces (I've got one that's 5.3 GB right now) across multiple ROS hosts.

But it is true that the concept of synchronizing files isn't tightly coupled to launching; there are people who will be interested in multi-machine launching but don't need to sync anything, and there are probably others who will want to sync data but aren't launching anything, so it might make more sense to break that out into its own system.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think we need to consider what constitutes things for multi-machine launching vs. tools for multi-machine ros2 setups. Our best bet might be to consolidate the features that would help multi-machine launching in the launch command within this guide and then think about maybe a suite of packages containing features helpful to multi-machine setups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ROS 1, files were looked up via something like "find file X in package Y", and in multi machine launching it was required that packages and files existed in both machines (maybe not in the same place, but they had to be discoverable).

That seemed to work ok, though obviously there are situations, especially in development, where this isn't ideal. However, I do think we'd do well to keep the tools "small and sharp" and deal with this problem outside of launch. If there appears to be a really good default way to solve this problem, then we can consider adding tool support.

However, I'd encourage you guys to get it working with the assumption that all the packages existing on the local and remote machines (or in the "remote" containers), and then we can workout mechanisms for synchronization or container setup as needed.


## Proposed Multi-Machine Launch Command Line Interface

The multi-machine launching interface is controlled through the `launcher`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a command to start/stop/restart launched things? Can we not just keep ros2 launch blocking as it is, and you simply ctrl-c the original command when you want to stop it?

If you want to list existing launch services, then we can add that as a ros2 launch --list-launch-services kind of thing. If you absolutely need to asynchronously interact with an already running launch service then the command name that would make more sense to me is ros2 launchctrl...

By the way, the architecture of the launch system is not at all clear at this point in the article, how many processes are there, how are things executed on remote machines, is it a blocking process or a daemon?

Without context on the architecture, this section about the command line tools doesn't make sense and is completely unmotivated. It just comes off as a bit arbitrary. I designed launch originally, and have discussed this feature with people extensively and even I couldn't really guess at why the tools were proposed like this. Don't take this as negative criticism, it's fine we'll iterate on it, but my point is that you need a lot more supporting rationale before you jump into how the command line tools will work.


## Proposed Multi-Machine Launch Command Line Interface

The multi-machine launching interface is controlled through the `launcher`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, why do I need to attach/detach? Just because docker has it? Do we really need that feature?

#### `launch`

The `ros2 launcher launch` is equivalent to `ros2 launch`, which is preserved
for backwards compatibility and ease of use. It is used to run a launch file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been stated above, but just to drive it home here where it's clearer, why not just ros2 launch?

What is the technical justification for needing a new sub-verb inside a new verb if they behave the same?

across a network.

Additionally, it is possible to detach from a system and let it run in the
background:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this just a feature of the ros2 launcher launch version of this? Why is it needed at all and why can it not be done for ros2 launch as well? Put another way, what is it about multi machine launch that requires this ability?

$
```

#### `term`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would possibly be a useful feature of ros2 launch (not sure about ros2 run, but maybe...), but again I don't see what it has to do with multi-machine launch. We could add attach/detach/term without adding multi-machine launching actions, and vice versa (unless I miss some core reason for it being included here).

@piraka9011
Copy link

Providing the ability to work on top of an existing orchestration tool could be very useful, but I'm hesitant about requiring it.
...

+1, I'm totally on board with this, just wanted to bring up the fact that there may be similar implementations that we can leverage so we don't reinvent the wheel.

As before, I suggest writing this out in the design doc as a background/justification as it explains why such a tool is needed.

processing hardware
- A robot with a cluster of machines that do distributed processing of data
- A network of multiple virtual hosts for testing purposes
- A swarm of independent drones that can cooperate but do not require
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- A swarm of independent drones that can cooperate but do not require
- A swarm of cooperative, independent agents that do not require intercommunication

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to a swarm and not a team of more than 1 agent?

How is agent intercommunication relevant to multi-machine launching? And why limit to systems without it?

Why not just: Cooperative Multi-Agent Systems?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a specific use case example. I don't think it was meant in an all-encompassing manner for multi-machine launching.

- A robot with a cluster of machines that do distributed processing of data
- A network of multiple virtual hosts for testing purposes
- A swarm of independent drones that can cooperate but do not require
communication with each other
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing this line

Suggested change
communication with each other


- Connecting to a remote host and running nodes on it
- Pushing configuration parameters for nodes to remote hosts
- Monitoring the status and tracking the lifecycle of nodes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is exactly the difference from lifecycle command?

-h, --help Show this help message and exit.
-d, --debug Put the launch system in debug mode, provides more
verbose output.
-D, --detach Detach from the launch process after it has started.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would use if launch process kills itself after launch. after the launch just to leave everything else. for embedded platform, we might not need launch process, but we likely to use launch description to init the system.

@mlanting
Copy link
Author

mlanting commented Oct 4, 2019

I tried pushing an update a couple days ago to my fork, but it doesn't seem to be coming through. Possibly because the roslaunch branch was merged into gh-pages and then removed a few days after I originally submitted this PR?

@ivanpauno
Copy link
Member

I tried pushing an update a couple days ago to my fork, but it doesn't seem to be coming through. Possibly because the roslaunch branch was merged into gh-pages and then removed a few days after I originally submitted this PR?

I'm not sure, but that's probably the case.
Consider rebasing with master and re-targeting the PR to it.

mlanting and others added 12 commits October 4, 2019 14:53
- Added a 'Context' section to describe the multi-machine launch
  capabilities of ROS1 and point to remote launch section of the main
  launch design document.
- Changed linebreaks to one sentence per line.
- Added a 'Proposed Approach' section where we can start to describe our
  design(s) from a technical perspective.
- Added a 'Goals' section. Not much useful content here yet though.

Signed-off-by: matthew.lanting <[email protected]>
- Also did a bit more reformatting/reworking of some of the main
  sections based on some of the feedback we've gotten so far

Signed-off-by: matthew.lanting <[email protected]>
Distro A, OPSEC

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC

Signed-off-by: Jacob Hassold <[email protected]>
Distribution Statement A; OPSEC #2893

Signed-off-by: P. J. Reed <[email protected]>
Distribution Statement A; OPSEC #2893

Signed-off-by: matthew.lanting <[email protected]>
Distribution Statement A; OPSEC #2893

Signed-off-by: P. J. Reed <[email protected]>
Distribution Statement A; OPSEC #2893

Signed-off-by: matthew.lanting <[email protected]>
@mlanting mlanting changed the base branch from roslaunch to gh-pages October 4, 2019 19:08
@wjwwood wjwwood added question Further information is requested enhancement New feature or request and removed question Further information is requested labels Oct 24, 2019
@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2020-07-16/15468/1

@mlanting
Copy link
Author

We've created a new version of the design document with some much more concrete ideas to discuss, but since the document has changed entirely I figured it'd be more appropriate to create a new PR: #297

@mlanting mlanting closed this Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.