Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move --host ordering fix to v3.0.x, v3.1.x, v4.0.x? #6501

Closed
jsquyres opened this issue Mar 19, 2019 · 12 comments
Closed

Move --host ordering fix to v3.0.x, v3.1.x, v4.0.x? #6501

jsquyres opened this issue Mar 19, 2019 · 12 comments

Comments

@jsquyres
Copy link
Member

jsquyres commented Mar 19, 2019

Per #6298, we had an accidental change in behavior of mpirun --host aaa,bbb between version v2.1.x and v3.0.x. A fix just went in to master in #6493.

Here's what happened:

The question is: should we put this fix on any of v3.0.x, v3.1.x, and/or v4.0.x?

Summary of behavior change

Behavior X

The ordering of hosts in the --host list matters:

$ mpirun --host aaa,bbb rank_test
aaa: MCW rank 0
bbb: MCW rank 1
$ mpirun --host bbb,aaa rank_test
aaa: MCW rank 1
bbb: MCW rank 0

Behavior Y

The ordering of hosts in the --host list does not matter (note: this behavior was unintentional. It was always intended that we honor the ordering of hosts in the --host list):

$ mpirun --host aaa,bbb rank_test
aaa: MCW rank 0
bbb: MCW rank 1
$ mpirun --host bbb,aaa rank_test
aaa: MCW rank 0
bbb: MCW rank 1

Discussion points

We need to discuss this and decide what to do. Points (in no particular order):

  1. This is a fairly minor change in behavior.
  2. Apparently no one noticed this change in behavior between v2.1.x and v3.0.x. It was only discovered recently by @bturrubiates, a Cisco employee (while using Open MPI for other / unrelated testing).
  3. The fix is probably not worth putting into v3.0.x or v3.1.x.
  4. But it might be worthwhile to put in to v4.0.x...?
  5. That being said, even putting it in v4.0.x is at least sorta breaking backwards compatibility. You could squint at this and call it a bug and therefore allow it in. Or you could say that it was effectively the behavior of all the v3.x/v4.x releases, and they're backwards compatible with each other, so we should maintain that behavior in v4.0.x.
@jjhursey
Copy link
Member

I think we noticed in PR #4327, but never got the cycles to get back to it. I'd like to verify this change also fixes that issue (I think it should)

@jsquyres
Copy link
Member Author

Two good points were made on the webex:

  1. We should be consistent across all of our releases. I.e., we should acknowledge that the v3.0.x and v3.1.x and v4.0.0 behavior was a bug, and was not intended (i.e., the v2.1.x behavior is the intended behavior).
    • Meaning: we should port these fixes back to v3.0.x, v3.1.x, and v4.0.x.
    • Meaning: the last releases of all of these series will make Open MPI's --host (and --hostfile) ordering semantics consistent across all of these release series.
  2. Users who depend on the default ordering given by v3.0.0, v3.1.0, and v4.0.0 can still get that same ordering after the bug fixes are applied. The ordering they want may not be the default, but the desired order will definitely be obtainable.

With these arguments, it seems to make sense to back-port the commits from #6493 to v3.0.x, v3.1.x, and v4.0.x.

@jsquyres
Copy link
Member Author

I attempted to make PRs for v3.0.x, v3.1.x, and v4.0.x. Unfortunately, master's nidmap.c has diverged quite a bit from each of these 3 branches. @rhc54 graciously said he would work on the ports for these branches.

@jsquyres
Copy link
Member Author

#6508 is the PR for v4.0.x. It's becoming a bit of a bear, both in terms of size and complexity. We're working the issue, with the intent that it'll catch whatever v4.0.x train it can. After that, we can hopefully apply a similar back-port to v3.1.x and v3.0.x, and trigger new releases there, too (i.e., with the goal that we can close out the v3.x.y series with this fix).

@jsquyres
Copy link
Member Author

Removed the "Target v2.x" label -- the fix is not needed for the v2.x series.

@marmistrz
Copy link

I came across this issue today, in a case where I need mpirun to place the final rank on a specific node. On 3.1.x it seems to currently be impossible to achieve it in any other way than writing the whole rankfile, am I right about it?

Please also note that it's not that easy to build OpenMPI >= 4.0 on some recent-ish distributions, such as Ubuntu 18.04, because 4.0 requires a newer version of hwloc than provided by the distribution. It's achievable, but it's definitely a considerable effort to upgrade the system libraries.

@jsquyres
Copy link
Member Author

@marmistrz I'm curious about your statement: why can't you build Open MPI >v4.0 on some recent Linux distributions? Open MPI comes with its own embedded hwloc that satisfies Open MPI's requirements -- meaning that even if the distro's hwloc is old, the embedded hwloc should be sufficient. Is that not working properly?

@marmistrz
Copy link

@jsquyres when I built OpenMPI with --with-hwloc the configure script complained it cannot build the external hwloc. Using --with-hwloc=internal did the trick, but was counterintuitive.

@jsquyres
Copy link
Member Author

FWIW: You can just not specify --with-hwloc at all, an Open MPI should first try to use an external hwloc, and if that fails, fall back to the internal hwloc.

@marmistrz
Copy link

Anyway, this is probably a bug in the m4 scripts, as the doc precisely says that --with-hwloc should use the internal hwloc:

"internal" (or no DIR value) forces Open MPI to use its internal copy of hwloc.

@jsquyres
Copy link
Member Author

Ah, that might be a bug in the docs. FWIW: the use of hwloc has evolved over time in Open MPI:

  1. hwloc used to be optional. It is now required.
  2. We used to default to using the internal hwloc, and only use an external hwloc if requested. It's now what I said before: by default (i.e., if --with-hwloc is not specified), we try external first, and if that fails, fall back to internal.

@hppritcha hppritcha added the RTE Issue likely is in RTE or PMIx areas label Mar 27, 2021
@hppritcha
Copy link
Member

fixed in the 5.0.x release stream. no plan to back port fix to 4.1.x and older.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants