Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.10 and master: mpirun without enough slots returns $status==0 #1344

Closed
jsquyres opened this issue Feb 5, 2016 · 11 comments
Closed

v1.10 and master: mpirun without enough slots returns $status==0 #1344

jsquyres opened this issue Feb 5, 2016 · 11 comments
Labels
Milestone

Comments

@jsquyres
Copy link
Member

jsquyres commented Feb 5, 2016

With the HEAD of v1.10 and master (probably v2.x, too, but I didn't check it): if mpirun complains about a lack of slots, mpirun still returns an exit status of 0. Note that the test below was run on a machine with 16 cores; this test is run with the ssh launcher and not under a job scheduler (i.e., not in a SLURM job):

$ mpirun --version
mpirun (Open MPI) 1.10.3a1

Report bugs to http://www.open-mpi.org/community/help/
$ mpirun --host mpi001 -np 4 uptime
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots 
that were requested by the application:
  uptime

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
$ echo $status
0

A secondary question: did we decide that this is the behavior we wanted? The mpi001 machine has 16 cores. I remember we talked about this, but I don't remember what we decided...

Thanks to @rfaucett for identifying the problem.

@jsquyres jsquyres added the bug label Feb 5, 2016
@jsquyres jsquyres added this to the v1.10.3 milestone Feb 5, 2016
@ggouaillardet
Copy link
Contributor

I discussed the secondary question with @rhc54 in an other ticket, and this is what we want.
the behavior changed in the middle of a stable serie, but this is considered as a bug fix

note in master you can
mpirun --host mpi001:16 ...
but not in v1.10

@rhc54
Copy link
Contributor

rhc54 commented Feb 6, 2016

Let me just clarify a bit. I personally dislike this behavior as I find it annoying. However, the issue that was originally raised was the following:

$ mpirun -host foo hostname

would automatically run a number of procs equal to the number of cores on host foo, and not just a single process. This is due to our decision to equate the lack of specifying the number of procs to effectively directing that we run a number equal to the number of "slots" in the allocation, which we set to the number of cores if it wasn't specified in a hostfile.

However, when you do specify the number of procs, then we still view this command line as having only one slot, which is what triggers the error. I don't believe that should be the behavior, but it is consistent with what we defined.

I propose that we change this behavior. If someone doesn't specify the number of procs, then we treat it as a single slot. However, if they do specify the number of procs, then we fall back to setting the number of slots equal to the number of processing elements (cores or threads, depending on what they requested).

Would that make sense?

@ggouaillardet
Copy link
Contributor

that sounds much better to me.

can you clarify what will happen if we
mpirun --host foo spawn

it will start one MPI task (so far, so good), but what happens when the MPI task invokes MPI_Comm_spawn ?
will it success because there are let's say 16 cores on foo ?
will it fail because foo is seen as having only one slot ?

some random thoughts ...

with the previous behavior, it was possible to
mpirun --host foo -np 1 a.out
or
mpirun --host foo --npernode 1 a.out

an other way to see things is
mpirun --hostfile machinefile a.out
in which machinefile only contains foo, should start 16 MPI tasks
a possible refinement is to run 16 tasks unless a/the hostname appears more than once in the hostfile

@rhc54
Copy link
Contributor

rhc54 commented Feb 6, 2016

In today's code, your example would error out due to a lack of slots, which likely isn't what the user expected.

Hostfile parsing is a different subject. In that case, the behavior is more as you would expect. If you list a hostname once and don't explicitly set the number of slots, then we default to the number of discovered cores. If you list a hostname more than once, then we take that as equivalent to having included a "slots=N" statement and set the number of slots equal to the number of times you list the hostname.

It is only the -host option where things differ, and I'm not convinced it is doing what people would expect.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 6, 2016

Just to be clear: this issue is specifically about mpirun returning a status of 0, even when there's an error. That clearly seems to be incorrect behavior. I'm guessing it should be easy to fix; I just don't know where to look in the code.


The secondary question, I agree, is also quite important. @rhc54 let me see if I understand current behavior:

  1. In a SLURM job (and other RMs):
    • Hosts and slot counts are explicitly specified by SLURM
  2. In a hostfile (only) specified job (i.e., we're not running in an RM job, and no --host options were specified), hosts are explicitly specified in the hostfile:
    • If a host is listed once:
      • If a slot count is specified for that host, it is used
      • If a slot count is not specified for that host, the hwloc-discovered num cores (or HTs, depending on the user's setting) is used as the slot count
    • If a host is listed more than once:
      • If a slot count is specified for that host, it is added to the total number of slots for that host
      • If a slot count is not specified for that host, 1 is added to the total number of slots for that host
  3. In a --host (only) specified job (i.e., we're not running in an RM job, and no hostfile was specified), hosts are explicitly listed via the --host option:
    • If a host is listed once:
      • (master/v2.x only) If a slot count is specified for that host, it is used
      • If a slot count is not specified for that host, the slot count is set to 1 (THIS IS THE BEHAVIOR WE'RE DISCUSSING)
    • If a host is listed more than once:
      • (master/v2.x only) If a slot count is specified for that host, it is added to the total number of slots for that host
      • If a slot count is not specified for that host, 1 is added to the total number of slots for that host
  4. In all cases, mpirun ... foo (with no -np value) essentially invokes:
for i = 0 .. num_hosts
  for j = 0 .. num_slots(host[i])
    launch foo on host[i]

Is that correct?

@rhc54
Copy link
Contributor

rhc54 commented Feb 6, 2016

I agree about this issue being more limited, but I'd rather not dive into that code multiple times if it can be avoided. So let's resolve the desired behavior.

Your detailing of the current behavior is correct.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 7, 2016

@rhc54 Agreed that we might as well dive into this code once; I just wanted to make sure that the original issue that was reported wasn't lost in the discussion.

If the description at #1344 (comment) is correct, then it seems like a no-brainer to make the --hostfile and --host behavior the same in terms of what happens when no slot count is specified: use the hwloc-discovered num cores (or HTs) as the slot count.

Agreed?

@ggouaillardet
Copy link
Contributor

imho, --hostfile and --host should do exactly the same thing
(e.g. --host ... is the command line form and --hostfile is the "filename" form, and these two options could/should be exclusive)

there are currently two differences between --host and --hostfile on a machine with n slots

mpirun --host localhost a.out
does run one MPI task, but (the localhost file contains localhost)
mpirun --hostfile localhost a.out
does run n MPI tasks

mpirun --hostfile localhost -np n a.out
does run n MPI tasks, but
mpirun --host localhost -np n a.out
fails because of oversubscription

it seems everyone agree failing because of oversubscription is annoying and is ok to change that.
(imho, i would call this a bug and set the milestone to v1.10.3)

i can only guess there is a rationale for starting only one MPI task with the --host option is used, and i did not find it. i am fine with changing this too, though we might start from v2.0.0 (e.g. it is harder to call this a bug ...)

fwiw
mpirun --host localhost --hostfile localhost a.out
does run n MPI tasks.
(and i cannot tell whether this is a bug, a feature or the result of an undefined behaviour)

@jsquyres
Copy link
Member Author

jsquyres commented Feb 7, 2016

I think I agree with everything @ggouaillardet is saying. To summarize:

  1. We all agree that mpirun -np X ... that results in oversubscription + mpirun aborting should not exit with status 0.
    • Fix for v1.10.3
  2. We all agree that --host should behave like --hostfile in terms of counting slots (i.e., oversubscription definition).
    • Fix for v1.10.3
  3. There is a difference between --host and --hostfile behavior if -np is not specified: --host launches 1, --hostfile launches num_cores (or HTs).
    • It would be good to make these the same (we can argue which way it should go).
    • Fix for v2.0.0.

I believe the rationale for making --hostfile launch num_cores (or HTs) is because that is what will happen in the RM case (i.e., if you launch mpirun --host foo a.out in a SLURM job, it'll run as many slots as SLURM says there are on host foo). Hence, this made --hostfile behavior consistent with the RM case. Perhaps we should make --host behavior consistent with it, as well.

I do recall that there was some pushback on making --host launch more than 1ppn by default, however. I think the reason for the pushback was because it was user-unexpected behavior -- i.e., OMPI trained people for years to expect --host would launch 1ppn. So perhaps v2.0.0 is a good time to change that behavior and make --hostfile, --host, and the RM case all the same...? Just my $0.02.

@rhc54
Copy link
Contributor

rhc54 commented Feb 7, 2016

😩 the behavior you describe is what we -did- have prior to deciding we needed to change it to the current behavior. Why don't we at least raise this at the telecon on Tues - and then let's agree to quit going back-and-forth on this topic as it is getting rather confusing to the users and frankly frustrating to me.

@rhc54
Copy link
Contributor

rhc54 commented Feb 11, 2016

@jsquyres @ggouaillardet Please see #1353 and tell me if this is okay

hjelmn pushed a commit to hjelmn/ompi that referenced this issue Sep 13, 2016
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
…ly run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements

Ensure the returned exit status is non-zero if we fail to map

If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported.

If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules.

If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect.

With that one correction, this now passes all the given use-cases on that spreadsheet.

Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up.

Fixes open-mpi#1344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants