Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/Jenkins failures: timeouts #4686

Closed
jsquyres opened this issue Jan 9, 2018 · 32 comments
Closed

CI/Jenkins failures: timeouts #4686

jsquyres opened this issue Jan 9, 2018 · 32 comments

Comments

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

A bunch of CI tests have been failing recently, particularly in the Mellanox Jenkins.

Per discussion on the 2017-01-09 teleconf, we were reminded that most of the segv's that we see in the Mellanox Jenkins (e.g., #4683 Mellanox Jenkins master build http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7030/) are actually timeouts.

Meaning: it looks like some threaded and/or vader-based tests are timing out. But not 100% of the time. @artpol84 confirmed that he can reproduce if he runs tests multiple times (i.e., sometimes the test passes, sometimes it fails). Given that the failures have typically been involving Vader and/or multi-threaded tests, @rhc54 points out 8b8aae3 which was a recent ASM commit.

Investigation is required. @jsquyres volunteered to try to reproduce as well.

@jsquyres
Copy link
Member Author

jsquyres commented Jan 9, 2018

I am unable to reproduce this issue manually with the bw_th test. I have tried both debug and optimized builds at the head. I also tried a few different hardware platforms, with both the TCP and usNIC BTLs, and with a variety of CLI arguments to bw_th (2, 4, 8, 20, 40, etc.). I also tried with the vader BTL on a single server (also with different CLI params -- 2-40, etc.).

Any suggestions on how to replicate would be appreciated.

@bwbarrett
Copy link
Member

are you pinning to cores? their set is interesting; they tasksel to 2 cores for mpirun, which I think means the mpi processes will inherit that, and then spawn 4 threads each on the 2 processes.

@artpol84
Copy link
Contributor

artpol84 commented Jan 9, 2018

Yes, this pinning introduces resource contention and once helped us to identify the issue in the code that went in unnoticed (#1813)

@jsquyres
Copy link
Member Author

jsquyres commented Jan 9, 2018

Good call -- no, I wasn't. Didn't help, though:

$ lstopo -p
Machine (128GB total)
  NUMANode P#0 (64GB)
    Package P#0 + L3 (20MB)
      L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
      L2 (256KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
...
$ taskset -c 0,1 mpirun --bind-to none -np 2 --mca btl vader,self bw_th 4
Bandwidth (Mbits/s)
2491.610374
2299.635567
2187.437563
2211.887889
$

This is an optimized build of master head, obviously using vader on a single node. Let me try a more recent machine (although it's got a slightly slower clock speed)... Nope, it still runs fine there, too:

$ lstopo -p
Machine (63GB total)
  Package P#0
    NUMANode P#0 (31GB)
    L3 (11MB)
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0
      L2 (1024KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1
...
$ taskset -c 0,1 mpirun --bind-to none -np 2 --mca btl vader,self bw_th 4
Bandwidth (Mbits/s)
2727.412732
2510.088930
2467.584226
2437.714611
$

@artpol84 Just out of curiosity, since you're binding to Linux virtual IDs 0,1, what's the load on the machine at the time? I.e., can you change your script to output the load before running each of these tests? Also, is any other Jenkins test also binding to LVIDs 0,1? I.e., are we just banging the hell out of LVIDs 0 and 1 via the union of all currently-running Jenkins tests, and therefore they're just running incredibly slowly? As you noted, testing for contention is good, but are we testing for too much contention sometimes?

Although I do notice that in the Mellanox output, we see the stdout from the entire bw_th test -- so we don't know if the MPI processes or mpirun were the ones that got hung. @artpol84 -- you said you could replicate -- could you clarify which process(es) are getting hung?

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

With openib

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to none -mca orte_tmpdir_base /tmp/tmp.4940jlMQbm -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib --debug-daemons --mca odls_base_verbose 100 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/bw_th 4

Sender side:
(gdb) bt
#0  0x00007ffff780e213 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffe9c91ee7 in mlx5_poll_cq () from /lib64/libmlx5-rdmav2.so
#2  0x00007fffe8a085ba in ibv_poll_cq (cq=0x8162e0, num_entries=256, wc=0x7fffdcb41c00) at /usr/include/infiniband/verbs.h:1458
#3  0x00007fffe8a1468f in poll_device (device=0x7601e0, count=0) at btl_openib_component.c:3696
#4  0x00007fffe8a14c01 in progress_one_device (device=0x7601e0) at btl_openib_component.c:3829
#5  0x00007fffe8a14cbd in btl_openib_component_progress () at btl_openib_component.c:3853
#6  0x00007ffff6e0983f in opal_progress () at runtime/opal_progress.c:222
#7  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdcb44d30) at threads/wait_sync.c:85
#8  0x00007fffe83b6095 in ompi_request_wait_completion (req=0x7f8800) at ../../../../ompi/request/request.h:405
#9  0x00007fffe83b8598 in mca_pml_ob1_send (buf=0x7fffdb343010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, dst=1, tag=1, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022c0 <ompi_mpi_comm_world>) at pml_ob1_isend.c:266
#10 0x00007ffff7aeed60 in PMPI_Send (buf=0x7fffdb343010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, dest=1, tag=1, comm=0x6022c0 <ompi_mpi_comm_world>) at psend.c:78
#11 0x0000000000400ce4 in runfunc ()
#12 0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#13 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

Receiver side:
(gdb) bt
#0  0x00007ffff6ecd455 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:39
#1  0x00007ffff6ecd9d8 in opal_timer_linux_get_cycles_sys_timer () at timer_linux_component.c:232
#2  0x00007ffff6e097bb in opal_progress () at runtime/opal_progress.c:196
#3  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffda33fd60) at threads/wait_sync.c:85
#4  0x00007fffe83b2494 in ompi_request_wait_completion (req=0x7f9100) at ../../../../ompi/request/request.h:405
#5  0x00007fffe83b3b58 in mca_pml_ob1_recv (addr=0x7fffd8b3e010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, src=0, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#6  0x00007ffff7ae1a65 in PMPI_Recv (buf=0x7fffd8b3e010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, source=0, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#7  0x0000000000400d73 in runfunc ()
#8  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#9  0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

@artpol84
Copy link
Contributor

And with TCP:

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 2 -bind-to none -mca orte_tmpdir_base /tmp/tmp.4940jlMQbm -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/thread_tests/thread-tests-1.1/bw_th 4

(gdb) bt
#0  0x00007ffff75377a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6e6ed93 in epoll_dispatch (base=0x653880, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6e727e0 in opal_libevent2022_event_base_loop (base=0x653880, flags=2) at event.c:1630
#3  0x00007ffff6e0981b in opal_progress () at runtime/opal_progress.c:204
#4  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdc833d60) at threads/wait_sync.c:85
#5  0x00007fffe880f494 in ompi_request_wait_completion (req=0x7abb00) at ../../../../ompi/request/request.h:405
#6  0x00007fffe8810b58 in mca_pml_ob1_recv (addr=0x7fffda831010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, src=0, tag=3, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at pml_ob1_irecv.c:135
#7  0x00007ffff7ae1a65 in PMPI_Recv (buf=0x7fffda831010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, source=0, tag=3, comm=0x6022c0 <ompi_mpi_comm_world>, status=0x0) at precv.c:79
#8  0x0000000000400d73 in runfunc ()
#9  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

(gdb) bt
#0  0x00007ffff75377a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6e6ed93 in epoll_dispatch (base=0x653880, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff6e727e0 in opal_libevent2022_event_base_loop (base=0x653880, flags=2) at event.c:1630
#3  0x00007ffff6e0981b in opal_progress () at runtime/opal_progress.c:204
#4  0x00007ffff6e12013 in ompi_sync_wait_mt (sync=0x7fffdb831d30) at threads/wait_sync.c:85
#5  0x00007fffe8813095 in ompi_request_wait_completion (req=0x7a9f80) at ../../../../ompi/request/request.h:405
#6  0x00007fffe8815598 in mca_pml_ob1_send (buf=0x7fffd982f010, count=8388608, datatype=0x6024c0 <ompi_mpi_byte>, dst=1, tag=2, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x6022c0 <ompi_mpi_comm_world>) at pml_ob1_isend.c:266
#7  0x00007ffff7aeed60 in PMPI_Send (buf=0x7fffd982f010, count=8388608, type=0x6024c0 <ompi_mpi_byte>, dest=1, tag=2, comm=0x6022c0 <ompi_mpi_comm_world>) at psend.c:78
#8  0x0000000000400ce4 in runfunc ()
#9  0x00007ffff7809dc5 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00007ffff75371cd in clone () from /usr/lib64/libc.so.6

@artpol84
Copy link
Contributor

@jsquyres when I was reproducing manually I saw that no process was running on the node at the moment.

@artpol84
Copy link
Contributor

I mean no CPU-intence procs was there except mine.

@artpol84
Copy link
Contributor

I just noticed that MPI_Send and MPI_Recv tags are different. It reminds me the issue I cited here previously.

@artpol84
Copy link
Contributor

And note, that I wasn't binding at all so this is not the root cause!

@artpol84
Copy link
Contributor

I checked the test. tag is getting incremented on each iteration by both of the ranks. It looks like receiver moves to the next iteration while sender still waits for the previous message to be completed.

@artpol84
Copy link
Contributor

@vspetrov @jladd-mlnx it seems that this issue is also related to hcoll issue we observed here

If I disable hcoll I can't reproduce on bgate anymore.
I'm going to temporally disable hcoll and check how it will go.

One thing concerns me: why in other branches (like for #4685 for which I provided Jenkins link) test finishes successfully and hangs at MPI_Finalize, while for master we see the issue much earlier during unrelated point-to-point phase.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

@jsquyres I've fixed timeout as we discussed on the call:

We were automatically detecting availability of --timeout option, but after some version mpirun help interface has changed and instead of mpirun -h you need to do mpirun --help debug. This broke scripts for recent branches and the default external timeout agent was used.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

--timeout option provides more useful output, but it looks like hcoll is not the only problem.
I see multiple vader-related issues on the ring_c test:

I'm going to add similar fix to the oshmem code to address http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7044/console.

@jsquyres
Copy link
Member Author

@artpol84 Excellent -- that makes the Mellanox Jenkins output much easier to parse/understand. Thanks!

Just to be clear: are you saying that there's still a vader issue on master? Or is this also related to hcoll? (I ask because in build 7042 -- linked above -- it's hanging in an hcoll-enabled run)

@jsquyres
Copy link
Member Author

@artpol84 In #4683 (comment), you cited leftover files in /dev/shm -- are those the cause of the vader issues?

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

@jsquyres

  • Yes I think that vader is an issue. There is a new report failing on it.
  • In both 7047 and 7042 hcoll is disabled: --mca ^hcoll

@artpol84
Copy link
Contributor

Regarding vader leftover. Yes I see these files again in /dev/shm.

$ pwd && ls                                                                                                                                                                                                                                   
/dev/shm
vader_segment.jenkins03.0  vader_segment.jenkins03.1

However the issue that leftover was causing looked like this:

05:14:55 + taskset -c 12,13 timeout -s SIGSEGV 15m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/bin/mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.r9k63Y7KRi -mca coll '^hcoll' -mca pml ob1 -mca btl self,vader taskset -c 12,13 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-5/ompi_install1/examples/hello_c
05:14:55 --------------------------------------------------------------------------
05:14:55 A system call failed during shared memory initialization that should
05:14:55 not have.  It is likely that your MPI job will now either abort or
05:14:55 experience performance degradation.
05:14:55 
05:14:55   Local host:  jenkins03
05:14:55   System call: open(2) 
05:14:55   Error:       Permission denied (errno 13)

while errors in 7042 and 7047 occur later when Send/Recv is performed.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

and it seems that after 4047 has failed it lead to subsequent failures because of leftover:

@artpol84
Copy link
Contributor

7054 finised ok: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7054/ after removing vader leftover.

@jsquyres
Copy link
Member Author

So is the real problem the vader leftover in /dev/shm?

@artpol84
Copy link
Contributor

No, it seems like a 2 real problems.

@rhc54
Copy link
Contributor

rhc54 commented Jan 10, 2018

FWIW: there is a cleanup mechanism in master to resolve the leftover problem. However, when you kill mpirun by hitting it with SIGKILL, you don't give it a chance to invoke the cleanup code. My best guess is that is what caused your second case.

I'll check to ensure that the mpirun timeout option does invoke the cleanup - pretty sure it does, but worth a check.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

According to the output of http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7047/console hang was interrupted with mpirun timeout:

11:24:34 --------------------------------------------------------------------------
11:24:34 The user-provided time limit for job execution has been reached:
11:24:34 
11:24:34   Timeout: 900 seconds
11:24:34 
11:24:34 The job will now be aborted.  Please check your code and/or
11:24:34 adjust/remove the job execution time limit (as specified by --timeout
11:24:34 command line option or MPIEXEC_TIMEOUT environment variable).
11:24:34 --------------------------------------------------------------------------

And if you will check the list of tested PRs (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/) between 4047 and 4053 (both failed) you will see that successful PRs in between was targeting non-master branches. So it could be a problem with this cleanup feature.

@artpol84
Copy link
Contributor

@rhc54, am I right that #4698 fixes the cleanup issue?

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2018

so far as I can determine, yes

@artpol84
Copy link
Contributor

artpol84 commented Jan 11, 2018

@hjelmn one thing that concerns me is that since leftover was causing issues for sequentially launched applications it can also cause issue if multiple MPI programs will run on the same node.
According to what I sav in /dev/shm:

$ pwd && ls                                                                                                                                                                                                                                   
/dev/shm
vader_segment.jenkins03.0  vader_segment.jenkins03.1

File names do not contain any identification information and this may cause issues in multi-user node usage. It makes sense to add some IDs (like jobid or PID) to the filename to protect from that.

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2018

Just checked on another system and the timeout on master still isn't removing the /dev/shm files. Will play with it more as time permits.

@jsquyres
Copy link
Member Author

@artpol84 @jladd-mlnx Per discussion 2017-01-16 webex, Mellanox agreed to do a manual build on the Jenkins machine (since it seems to be the only machine that can replicate this problem ☹️ ) and try reverting 8b8aae3 (the recent ASM commit) and see if that makes the problem go away.

@artpol84
Copy link
Contributor

What issue out of 2 we are tracking?
Vader hang or improper leftover handling?

@jsquyres
Copy link
Member Author

Improper leftovers should now be handled (via #4701). The only issue left should be the vader hang.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 3, 2018

Vader hangs should now be fixed via #4767.

@jsquyres jsquyres closed this as completed Feb 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants