Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs on failures on master #1379

Closed
jsquyres opened this issue Feb 18, 2016 · 6 comments
Closed

Hangs on failures on master #1379

jsquyres opened this issue Feb 18, 2016 · 6 comments
Assignees
Labels

Comments

@jsquyres
Copy link
Member

In the Cisco MTT cluster, we're seeing a large amount of hangs on tests that are supposed to fail (e.g., they call MPI_ABORT). Specifically, test MPI processes do not die, even if their HNP and local orted are gone. The MPI processes keep spinning and consuming CPU cycles.

I'm seeing this across a variety of configure command line options. I.e., it doesn't seem to be specific to a single problematic configure option.

It looks like the hangs are of two flavors:

  1. An MPI process is stuck in an MPI collective that never completes
  2. An MPI process is stuck in a PMIX collective

The Intel test MPI_Abort_c is an example of case 1. In this test, MPI_COMM_WORLD rank 0 calls MPI_ABORT, and everyone else calls an MPI_ALLREDUCE.

It looks like the MCW rank 0 process is gone/dead, and all the others are stuck in the MPI_ALLREDUCE. The HNP and local orted is gone, too. I.e., somehow the RTE thread in the MPI processes somehow didn't kill these processes either when they got the abort signal, or the HNP / local orted went away.

I see the same pattern in the IBM test environment/abort: MCW 0 calls abort, everyone else calls sleep. In this case, MCW 0 and the HNP and the local orted are all gone, but all the other processes are stuck looping in sleep().


The Intel test MPI_Errhandler_fatal_f is an example of case 2. In this test, processes don't seem to get past MPI_INIT:

#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6  
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6 
#2  0x00002aaaac3ec99e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0
#3  0x00002aaaac3cccee in pmix1_fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0 
#4  0x00002aaaab4f1ab6 in ompi_mpi_init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0 
#5  0x00002aaaab527167 in PMPI_Init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0  
#6  0x00002aaaab25b602 in pmpi_init__ ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi_mpifh.so.0 
#7  0x0000000000401744 in MAIN__ ()

I see a bunch of tests like this (hung in MPI_INIT) -- not just Fortran tests, and not just tests that are supposed to fail. In these cases, it looks like a server gets overloaded with CPU load and things start slowing down, and then even positive tests start getting stuck in the PMIX fence in MPI_INIT (i.e., not just tests that are supposed to fail).


I've also seen similar stack traces where PMIX is stuck on a fence, but in MPI_FINALIZE. E.g., in the t_winerror test:

(gdb) bt
#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaab60988e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#3  0x00002aaaab5e9bde in pmix1_fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#4  0x00002aaaaab306c5 in ompi_mpi_finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#5  0x00002aaaaab5a1c1 in PMPI_Finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#6  0x0000000000401cc4 in main ()
@jsquyres jsquyres added the bug label Feb 18, 2016
@jsquyres jsquyres added this to the v2.0.0 milestone Feb 18, 2016
@jsquyres
Copy link
Member Author

@rhc54 and I chatted about this on the phone. He's looking into it.

@rhc54
Copy link
Contributor

rhc54 commented Feb 18, 2016

@jsquyres I think I have this fixed, or at least the problem associated with MPI_Abort. I'm not sure if/why it would show up in 2.x as it was due to a change in the IOF a couple of days ago.

Let me know if you see any continuing problems, and if there is something going on in the 2.x branch.

@jsquyres
Copy link
Member Author

I think my MTT problems were all caused by this master issue, but these failures increased the load on my servers, thereby causing a cascade of other failures. So it's kinda hard to tell if the real root was originally on the master. We'll let it percolate through MTT over the next several days and see what happens.

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2016

I found another error that was also causing problems in certain cases and fixed it too. It should have made it into tonight's tarball, so hopefully we'll see the impact soon.

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2016

My MTT tonight is looking very clean, so hopefully this has resolved the problem.

@rhc54
Copy link
Contributor

rhc54 commented Feb 19, 2016

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 3.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-3.0.0a1-my_installation.html |
| Test Build  | trivial         | 3.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-trivial-my_installation-3.0.0a1-my_installation.html          |
| Test Build  | ibm             | 3.0.0a1     | 00:57    | 1    |      |          |      | Test_Build-ibm-my_installation-3.0.0a1-my_installation.html              |
| Test Build  | intel           | 3.0.0a1     | 00:27    | 1    |      |          |      | Test_Build-intel-my_installation-3.0.0a1-my_installation.html            |
| Test Build  | onesided        | 3.0.0a1     | 00:04    | 1    |      |          |      | Test_Build-onesided-my_installation-3.0.0a1-my_installation.html         |
| Test Build  | java            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-3.0.0a1-my_installation.html             |
| Test Build  | orte            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-3.0.0a1-my_installation.html             |
| Test Run    | trivial         | 3.0.0a1     | 00:02    | 2    |      |          |      | Test_Run-trivial-my_installation-3.0.0a1-my_installation.html            |
| Test Run    | ibm             | 3.0.0a1     | 07:04    | 371  |      |          | 3    | Test_Run-ibm-my_installation-3.0.0a1-my_installation.html                |
| Test Run    | spawn           | 3.0.0a1     | 00:04    | 7    |      |          |      | Test_Run-spawn-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 3.0.0a1     | 03:41    | 1    |      |          |      | Test_Run-loopspawn-my_installation-3.0.0a1-my_installation.html          |
| Test Run    | intel           | 3.0.0a1     | 16:27    | 242  |      |          | 2    | Test_Run-intel-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 3.0.0a1     | 07:40    | 222  |      |          | 22   | Test_Run-intel_skip-my_installation-3.0.0a1-my_installation.html         |
| Test Run    | onesided        | 3.0.0a1     | 00:18    | 32   |      |          |      | Test_Run-onesided-my_installation-3.0.0a1-my_installation.html           |
| Test Run    | java            | 3.0.0a1     | 00:02    | 1    |      |          |      | Test_Run-java-my_installation-3.0.0a1-my_installation.html               |
| Test Run    | orte            | 3.0.0a1     | 00:34    | 19   |      |          |      | Test_Run-orte-my_installation-3.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+

@rhc54 rhc54 removed this from the v2.0.0 milestone Feb 24, 2016
@rhc54 rhc54 closed this as completed Feb 24, 2016
jsquyres added a commit to jsquyres/ompi that referenced this issue Sep 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants