CI/Jenkins failures: timeouts #4686

jsquyres · 2018-01-09T17:11:06Z

A bunch of CI tests have been failing recently, particularly in the Mellanox Jenkins.

Per discussion on the 2017-01-09 teleconf, we were reminded that most of the segv's that we see in the Mellanox Jenkins (e.g., #4683 Mellanox Jenkins master build http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7030/) are actually timeouts.

Meaning: it looks like some threaded and/or vader-based tests are timing out. But not 100% of the time. @artpol84 confirmed that he can reproduce if he runs tests multiple times (i.e., sometimes the test passes, sometimes it fails). Given that the failures have typically been involving Vader and/or multi-threaded tests, @rhc54 points out 8b8aae3 which was a recent ASM commit.

Investigation is required. @jsquyres volunteered to try to reproduce as well.

The text was updated successfully, but these errors were encountered:

jsquyres · 2018-01-09T22:20:22Z

I am unable to reproduce this issue manually with the bw_th test. I have tried both debug and optimized builds at the head. I also tried a few different hardware platforms, with both the TCP and usNIC BTLs, and with a variety of CLI arguments to bw_th (2, 4, 8, 20, 40, etc.). I also tried with the vader BTL on a single server (also with different CLI params -- 2-40, etc.).

Any suggestions on how to replicate would be appreciated.

bwbarrett · 2018-01-09T22:26:16Z

are you pinning to cores? their set is interesting; they tasksel to 2 cores for mpirun, which I think means the mpi processes will inherit that, and then spawn 4 threads each on the 2 processes.

artpol84 · 2018-01-09T22:50:35Z

Yes, this pinning introduces resource contention and once helped us to identify the issue in the code that went in unnoticed (#1813)

jsquyres · 2018-01-09T23:12:34Z

Good call -- no, I wasn't. Didn't help, though:

$ lstopo -p
Machine (128GB total)
  NUMANode P#0 (64GB)
    Package P#0 + L3 (20MB)
      L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
      L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
...
$ taskset -c 0,1 mpirun --bind-to none -np 2 --mca btl vader,self bw_th 4
Bandwidth (Mbits/s)
2491.610374
2299.635567
2187.437563
2211.887889
$

This is an optimized build of master head, obviously using vader on a single node. Let me try a more recent machine (although it's got a slightly slower clock speed)... Nope, it still runs fine there, too:

$ lstopo -p
Machine (63GB total)
  Package P#0
    NUMANode P#0 (31GB)
    L3 (11MB)
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
...
$ taskset -c 0,1 mpirun --bind-to none -np 2 --mca btl vader,self bw_th 4
Bandwidth (Mbits/s)
2727.412732
2510.088930
2467.584226
2437.714611
$

@artpol84 Just out of curiosity, since you're binding to Linux virtual IDs 0,1, what's the load on the machine at the time? I.e., can you change your script to output the load before running each of these tests? Also, is any other Jenkins test also binding to LVIDs 0,1? I.e., are we just banging the hell out of LVIDs 0 and 1 via the union of all currently-running Jenkins tests, and therefore they're just running incredibly slowly? As you noted, testing for contention is good, but are we testing for too much contention sometimes?

Although I do notice that in the Mellanox output, we see the stdout from the entire bw_th test -- so we don't know if the MPI processes or mpirun were the ones that got hung. @artpol84 -- you said you could replicate -- could you clarify which process(es) are getting hung?

artpol84 · 2018-01-10T01:15:12Z

With openib

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to none -mca orte_tmpdir_base /tmp/tmp.4940jlMQbm -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib --debug-daemons --mca odls_base_verbose 100 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/bw_th 4

Sender side:
(gdb) bt
#0  0x00007ffff780e213 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffe9c91ee7 in mlx5_poll_cq () from /lib64/libmlx5-rdmav2.so
#2  0x00007fffe8a085ba in ibv_poll_cq (cq=0x8162e0, num_entries=256, wc=0x7fffdcb41c00) at /usr/include/infiniband/verbs.h:1458
#3  0x00007fffe8a1468f in poll_device (device=0x7601e0, count=0) at btl_openib_component.c:3696
#4  0x00007fffe8a14c01 in progress_one_device (device=0x7601e0) at btl_openib_component.c:3829
#5  0x00007fffe8a14cbd in btl_openib_component_progress () at btl_openib_component.c:3853
#6  0x00007ffff6e0983f in opal_progress () at runtime/opal_progress.c:222
#7  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdcb44d30) at threads/wait_sync.c:85
#8  0x00007fffe83b6095 in ompi_request_wait_completion (req=0x7f8800) at ../../../../ompi/request/request.h:405
#9  0x00007fffe83b8598 in mca_pml_ob1_send (buf=0x7fffdb343010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022c0 <ompi_mpi_comm_world>) at pml_ob1_isend.c:266
#10 0x00007ffff7aeed60 in PMPI_Send (buf=0x7fffdb343010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, dest=1, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>) at psend.c:78
#11 0x0000000000400ce4 in runfunc ()
#12 0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#13 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

Receiver side:
(gdb) bt
#0  0x00007ffff6ecd455 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:39
#1  0x00007ffff6ecd9d8 in opal_timer_linux_get_cycles_sys_timer () at timer_linux_component.c:232
#2  0x00007ffff6e097bb in opal_progress () at runtime/opal_progress.c:196
#3  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffda33fd60) at threads/wait_sync.c:85
#4  0x00007fffe83b2494 in ompi_request_wait_completion (req=0x7f9100) at ../../../../ompi/request/request.h:405
#5  0x00007fffe83b3b58 in mca_pml_ob1_recv (addr=0x7fffd8b3e010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, src=0, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#6  0x00007ffff7ae1a65 in PMPI_Recv (buf=0x7fffd8b3e010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, source=0, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#7  0x0000000000400d73 in runfunc ()
#8  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#9  0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

artpol84 · 2018-01-10T01:15:43Z

And with TCP:

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to none -mca orte_tmpdir_base /tmp/tmp.4940jlMQbm -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/bw_th 4

(gdb) bt
#0  0x00007ffff75377a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6e6ed93 in epoll_dispatch (base=0x653880, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6e727e0 in opal_libevent2022_event_base_loop (base=0x653880, flags=2) at event.c:1630
#3  0x00007ffff6e0981b in opal_progress () at runtime/opal_progress.c:204
#4  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdc833d60) at threads/wait_sync.c:85
#5  0x00007fffe880f494 in ompi_request_wait_completion (req=0x7abb00) at ../../../../ompi/request/request.h:405
#6  0x00007fffe8810b58 in mca_pml_ob1_recv (addr=0x7fffda831010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, src=0, tag=3, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#7  0x00007ffff7ae1a65 in PMPI_Recv (buf=0x7fffda831010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, source=0, tag=3, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#8  0x0000000000400d73 in runfunc ()
#9  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

(gdb) bt
#0  0x00007ffff75377a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6e6ed93 in epoll_dispatch (base=0x653880, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6e727e0 in opal_libevent2022_event_base_loop (base=0x653880, flags=2) at event.c:1630
#3  0x00007ffff6e0981b in opal_progress () at runtime/opal_progress.c:204
#4  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdb831d30) at threads/wait_sync.c:85
#5  0x00007fffe8813095 in ompi_request_wait_completion (req=0x7a9f80) at ../../../../ompi/request/request.h:405
#6  0x00007fffe8815598 in mca_pml_ob1_send (buf=0x7fffd982f010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, dst=1, tag=2, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022c0 <ompi_mpi_comm_world>) at pml_ob1_isend.c:266
#7  0x00007ffff7aeed60 in PMPI_Send (buf=0x7fffd982f010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, dest=1, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>) at psend.c:78
#8  0x0000000000400ce4 in runfunc ()
#9  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

artpol84 · 2018-01-10T01:16:39Z

@jsquyres when I was reproducing manually I saw that no process was running on the node at the moment.

artpol84 · 2018-01-10T01:17:06Z

I mean no CPU-intence procs was there except mine.

artpol84 · 2018-01-10T01:18:29Z

I just noticed that MPI_Send and MPI_Recv tags are different. It reminds me the issue I cited here previously.

artpol84 · 2018-01-10T01:19:25Z

And note, that I wasn't binding at all so this is not the root cause!

artpol84 · 2018-01-10T01:45:48Z

I checked the test. tag is getting incremented on each iteration by both of the ranks. It looks like receiver moves to the next iteration while sender still waits for the previous message to be completed.

artpol84 · 2018-01-10T02:10:47Z

@vspetrov @jladd-mlnx it seems that this issue is also related to hcoll issue we observed here

If I disable hcoll I can't reproduce on bgate anymore.
I'm going to temporally disable hcoll and check how it will go.

One thing concerns me: why in other branches (like for #4685 for which I provided Jenkins link) test finishes successfully and hangs at MPI_Finalize, while for master we see the issue much earlier during unrelated point-to-point phase.

artpol84 · 2018-01-10T02:47:10Z

@jsquyres I've fixed timeout as we discussed on the call:

We were automatically detecting availability of --timeout option, but after some version mpirun help interface has changed and instead of mpirun -h you need to do mpirun --help debug. This broke scripts for recent branches and the default external timeout agent was used.

artpol84 · 2018-01-10T08:19:06Z

--timeout option provides more useful output, but it looks like hcoll is not the only problem.
I see multiple vader-related issues on the ring_c test:

http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7042/console
No multi-threaded failures though.

I'm going to add similar fix to the oshmem code to address http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7044/console.

jsquyres · 2018-01-10T15:50:13Z

@artpol84 Excellent -- that makes the Mellanox Jenkins output much easier to parse/understand. Thanks!

Just to be clear: are you saying that there's still a vader issue on master? Or is this also related to hcoll? (I ask because in build 7042 -- linked above -- it's hanging in an hcoll-enabled run)

jsquyres · 2018-01-10T17:02:52Z

@artpol84 In #4683 (comment), you cited leftover files in /dev/shm -- are those the cause of the vader issues?

artpol84 · 2018-01-10T20:11:46Z

@jsquyres

Yes I think that vader is an issue. There is a new report failing on it.
In both 7047 and 7042 hcoll is disabled: --mca ^hcoll

artpol84 · 2018-01-10T20:19:23Z

Regarding vader leftover. Yes I see these files again in /dev/shm.

$ pwd && ls                                                                                                                                                                                                                                   
/dev/shm
vader_segment.jenkins03.0  vader_segment.jenkins03.1

However the issue that leftover was causing looked like this:

05:14:55 + taskset -c 12,13 timeout -s SIGSEGV 15m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.r9k63Y7KRi -mca coll '^hcoll' -mca pml ob1 -mca btl self,vader taskset -c 12,13 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
05:14:55 --------------------------------------------------------------------------
05:14:55 A system call failed during shared memory initialization that should
05:14:55 not have.  It is likely that your MPI job will now either abort or
05:14:55 experience performance degradation.
05:14:55 
05:14:55   Local host:  jenkins03
05:14:55   System call: open(2) 
05:14:55   Error:       Permission denied (errno 13)

while errors in 7042 and 7047 occur later when Send/Recv is performed.

artpol84 · 2018-01-10T20:30:28Z

and it seems that after 4047 has failed it lead to subsequent failures because of leftover:

http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7053/console

artpol84 · 2018-01-10T21:22:18Z

7054 finised ok: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7054/ after removing vader leftover.

jsquyres · 2018-01-10T21:31:18Z

So is the real problem the vader leftover in /dev/shm?

artpol84 · 2018-01-10T21:34:19Z

No, it seems like a 2 real problems.

first (root cause) that causing a hang and leftover to appear: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7047/console
second one: consequence of the first - leftover prevents app from starting: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7053/console

rhc54 · 2018-01-10T22:14:22Z

FWIW: there is a cleanup mechanism in master to resolve the leftover problem. However, when you kill mpirun by hitting it with SIGKILL, you don't give it a chance to invoke the cleanup code. My best guess is that is what caused your second case.

I'll check to ensure that the mpirun timeout option does invoke the cleanup - pretty sure it does, but worth a check.

artpol84 · 2018-01-10T22:17:01Z

According to the output of http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7047/console hang was interrupted with mpirun timeout:

11:24:34 --------------------------------------------------------------------------
11:24:34 The user-provided time limit for job execution has been reached:
11:24:34 
11:24:34   Timeout: 900 seconds
11:24:34 
11:24:34 The job will now be aborted.  Please check your code and/or
11:24:34 adjust/remove the job execution time limit (as specified by --timeout
11:24:34 command line option or MPIEXEC_TIMEOUT environment variable).
11:24:34 --------------------------------------------------------------------------

And if you will check the list of tested PRs (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/) between 4047 and 4053 (both failed) you will see that successful PRs in between was targeting non-master branches. So it could be a problem with this cleanup feature.

artpol84 · 2018-01-11T03:42:51Z

@rhc54, am I right that #4698 fixes the cleanup issue?

rhc54 · 2018-01-11T03:47:38Z

so far as I can determine, yes

artpol84 · 2018-01-11T03:48:04Z

@hjelmn one thing that concerns me is that since leftover was causing issues for sequentially launched applications it can also cause issue if multiple MPI programs will run on the same node.
According to what I sav in /dev/shm:

$ pwd && ls                                                                                                                                                                                                                                   
/dev/shm
vader_segment.jenkins03.0  vader_segment.jenkins03.1

File names do not contain any identification information and this may cause issues in multi-user node usage. It makes sense to add some IDs (like jobid or PID) to the filename to protect from that.

rhc54 · 2018-01-11T10:56:40Z

Just checked on another system and the timeout on master still isn't removing the /dev/shm files. Will play with it more as time permits.

jsquyres · 2018-01-16T16:24:58Z

@artpol84 @jladd-mlnx Per discussion 2017-01-16 webex, Mellanox agreed to do a manual build on the Jenkins machine (since it seems to be the only machine that can replicate this problem ☹️ ) and try reverting 8b8aae3 (the recent ASM commit) and see if that makes the problem go away.

artpol84 · 2018-01-16T21:42:06Z

What issue out of 2 we are tracking?
Vader hang or improper leftover handling?

jsquyres · 2018-01-16T21:58:17Z

Improper leftovers should now be handled (via #4701). The only issue left should be the vader hang.

jsquyres · 2018-02-03T19:25:45Z

Vader hangs should now be fixed via #4767.

jsquyres added bug Target: main labels Jan 9, 2018

rhc54 mentioned this issue Jan 11, 2018

Ensure cleanup of registered files/dirs #4701

Merged

rhc54 mentioned this issue Jan 17, 2018

use PMIx to manage CIDs #4674

Closed

jsquyres closed this as completed Feb 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/Jenkins failures: timeouts #4686

CI/Jenkins failures: timeouts #4686

jsquyres commented Jan 9, 2018

jsquyres commented Jan 9, 2018

bwbarrett commented Jan 9, 2018

artpol84 commented Jan 9, 2018

jsquyres commented Jan 9, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

jsquyres commented Jan 10, 2018

jsquyres commented Jan 10, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018

jsquyres commented Jan 10, 2018

artpol84 commented Jan 10, 2018

rhc54 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 11, 2018

rhc54 commented Jan 11, 2018

artpol84 commented Jan 11, 2018 •

edited

Loading

rhc54 commented Jan 11, 2018

jsquyres commented Jan 16, 2018

artpol84 commented Jan 16, 2018

jsquyres commented Jan 16, 2018

jsquyres commented Feb 3, 2018

CI/Jenkins failures: timeouts #4686

CI/Jenkins failures: timeouts #4686

Comments

jsquyres commented Jan 9, 2018

jsquyres commented Jan 9, 2018

bwbarrett commented Jan 9, 2018

artpol84 commented Jan 9, 2018

jsquyres commented Jan 9, 2018

artpol84 commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 10, 2018 • edited Loading

jsquyres commented Jan 10, 2018

jsquyres commented Jan 10, 2018

artpol84 commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 10, 2018

jsquyres commented Jan 10, 2018

artpol84 commented Jan 10, 2018

rhc54 commented Jan 10, 2018

artpol84 commented Jan 10, 2018 • edited Loading

artpol84 commented Jan 11, 2018

rhc54 commented Jan 11, 2018

artpol84 commented Jan 11, 2018 • edited Loading

rhc54 commented Jan 11, 2018

jsquyres commented Jan 16, 2018

artpol84 commented Jan 16, 2018

jsquyres commented Jan 16, 2018

jsquyres commented Feb 3, 2018

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 10, 2018 •

edited

Loading

artpol84 commented Jan 11, 2018 •

edited

Loading