Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/openib: delay UCX warning to add_procs() #6137

Merged
merged 1 commit into from
Dec 5, 2018

Conversation

ggouaillardet
Copy link
Contributor

If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet [email protected]

If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet <[email protected]>
Copy link
Member

@hjelmn hjelmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

@ggouaillardet ggouaillardet merged commit fccb3e7 into open-mpi:master Dec 5, 2018
@hoopoepg
Copy link
Contributor

hoopoepg commented Dec 6, 2018

hi all

not sure how, but this PR completely breaks OMPI over UCX (and btl???) - OMPI just crashes into core

@ggouaillardet
Copy link
Contributor Author

@hoopoepg an you please provide more details on the crash ? It obviously did not crash in my environment.

What if you mpirun --mca btl ^openib ... ?

@hoopoepg
Copy link
Contributor

hoopoepg commented Dec 7, 2018

@ggouaillardet it works fine without openib (using your command line)

@hoopoepg
Copy link
Contributor

hoopoepg commented Dec 7, 2018

@ggouaillardet it crashes on different location every time, could be on UCX init - ibv_open_device, or on access to memory.
it seems there is memory corruption

@ggouaillardet
Copy link
Contributor Author

valgrind reports many errors. I will sort which ones are caused by this PR since I am unable to reproduce any crash in my environment (mlx4 + centos7 + ucx master)

@hoopoepg
Copy link
Contributor

hoopoepg commented Dec 7, 2018

@ggouaillardet I'm using osu_bw to reproduce issue. env Red Hat Enterprise Linux Server release 7.4 (Maipo):
./configure --with-ucx= --enable-debug --enable-oshmem --enable-oshmem-profile --enable-oshmem-fortran=no && make && make install

mpirun -np 2 ./osu_bw

ofed_info -s
MLNX_OFED_LINUX-4.4-2.0.7.0

@ggouaillardet
Copy link
Contributor Author

I found a gross bug of mine :-(

can you please give the attached patch a try

diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 9ec57c0..cc7a982 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1048,6 +1048,7 @@ int mca_btl_openib_add_procs(
         opal_show_help("help-mpi-btl-openib.txt", "ib port not selected",
                        true, opal_process_info.nodename,
                        ibv_get_device_name(openib_btl->device->ib_dev), openib_btl->port_num);
+        return OPAL_SUCCESS;
     }
 
     btl_rank = get_openib_btl_params(openib_btl, &lcl_subnet_id_port_cnt);

@hoopoepg
Copy link
Contributor

@ggouaillardet doesn't help - same issue
without openib (-mca btl ^openib) works fine

@ggouaillardet
Copy link
Contributor Author

thanks for the report.

can you please run the following commands and tell me which works and which crashes

$ mpirun --mca pml ucx --mca btl_openib_allow_ib false -np 2 ./osu_bw
$ mpirun --mca pml ucx --mca btl_openib_allow_ib true -np 2 ./osu_bw
$ mpirun --mca pml ob1 --mca btl_openib_allow_ib true -np 2 ./osu_bw
$ mpirun --mca pml ob1 --mca btl_openib_allow_ib false -np 2 ./osu_bw

also, could you please post a stack trace. I understand the crash location vary between runs, but I'd like to at least know understand if it crashes during MPI_Init(), MPI_Finalize() or during the benchmark itself.

last but not least, which UCX version are you using ?

My system runs the same OFED release (with mlx4 hardware) but I am unable to reproduce a crash, even with osu_bw and the same configure command line.

valgrind reports tons of warning, but I cannot figure out the legit one vs noise (e.g. verbs writes back some results directly into user space, and valgrind incorrectly reports that as using uninitialized memory ...)

@hoopoepg
Copy link
Contributor

UCX - master
mpirun -mca pml ob1 -mca btl_openib_allow_ib false -np 2 ./osu_bw crashed on MPI_Finalize:

#0  0x00007f3919df0118 in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007f3919df1019 in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#2  0x00007f391990e376 in backtrace () from /usr/lib64/libc.so.6
#3  0x00007f390454704a in ucs_debug_backtrace_create () at debug/debug.c:317
#4  0x00007f39045476da in ucs_debug_show_innermost_source_file (stream=0x7f3919bbf1c0 <_IO_2_1_stderr_>) at debug/debug.c:512
#5  0x00007f3904548914 in ucs_handle_error (error_type=0x7f39046011c4 "address not mapped to object", message=0x7f3904601330 "Caught signal %d (%s: %s%s)")
    at debug/debug.c:993
#6  0x00007f3904548589 in ucs_debug_handle_error_signal (signo=11, cause=0x7f39046011c4 "address not mapped to object", fmt=0x7f390460134d " at address %p")
    at debug/debug.c:934
#7  0x00007f39045486cf in ucs_error_signal_handler (signo=11, info=0x7f39140cb5f0, context=0x7f39140cb4c0) at debug/debug.c:957
#8  <signal handler called>
#9  0x00007f39077b4580 in ?? ()
#10 0x00007f390d16d6e2 in device_destruct (device=0x1d68a70) at btl_openib_component.c:1026
#11 0x00007f390d161a85 in opal_obj_run_destructors (object=0x1d68a70) at ../../../../opal/class/opal_object.h:462
#12 0x00007f390d167c26 in mca_btl_openib_finalize_resources (btl=0x1d69fd0) at btl_openib.c:1727
#13 0x00007f390d167d2c in mca_btl_openib_finalize (btl=0x1d69fd0) at btl_openib.c:1754
#14 0x00007f391922ae3c in mca_btl_base_close () at base/btl_base_frame.c:203
#15 0x00007f391920e51a in mca_base_framework_close (framework=0x7f39194dfd00 <opal_btl_base_framework>) at mca_base_framework.c:218
#16 0x00007f391a6e6a56 in mca_bml_base_close () at base/bml_base_frame.c:130
#17 0x00007f391920e51a in mca_base_framework_close (framework=0x7f391a9bdd20 <ompi_bml_base_framework>) at mca_base_framework.c:218
#18 0x00007f391a663115 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:449
#19 0x00007f391a6a1299 in PMPI_Finalize () at pfinalize.c:54
#20 0x00000000004019fe in main (argc=1, argv=0x7ffe75da5618) at osu_bw.c:152

mpirun -mca pml ob1 -mca btl_openib_allow_ib true -np 2 ./osu_bw:

#0  0x00007f2fa2f87118 in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007f2fa2f88019 in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#2  0x00007f2fa2aa5376 in backtrace () from /usr/lib64/libc.so.6
#3  0x00007f2f8d76e04a in ucs_debug_backtrace_create () at debug/debug.c:317
#4  0x00007f2f8d76e6da in ucs_debug_show_innermost_source_file (stream=0x7f2fa2d561c0 <_IO_2_1_stderr_>) at debug/debug.c:512
#5  0x00007f2f8d76f914 in ucs_handle_error (error_type=0x7f2f8d8281e8 "invalid permissions for mapped object", message=0x7f2f8d828330 "Caught signal %d (%s: %s%s)") at debug/debug.c:993
#6  0x00007f2f8d76f589 in ucs_debug_handle_error_signal (signo=11, cause=0x7f2f8d8281e8 "invalid permissions for mapped object", fmt=0x7f2f8d82834d " at address %p") at debug/debug.c:934
#7  0x00007f2f8d76f6cf in ucs_error_signal_handler (signo=11, info=0x7f2f9c03b5f0, context=0x7f2f9c03b4c0) at debug/debug.c:957
#8  <signal handler called>
#9  0x00007f2f94aa6ab0 in ?? ()
#10 0x00007f2f95b42abe in __ibv_common_reg_mr () from /usr/lib64/libibverbs.so.1
#11 0x00007f2f95b42b94 in ibv_reg_mr () from /usr/lib64/libibverbs.so.1
#12 0x00007f2f96392a38 in openib_reg_mr (reg_data=0x860aa0, base=0x8c7000, size=4096, reg=0x8c8080) at btl_openib_component.c:556
#13 0x00007f2f96becc61 in mca_rcache_grdma_register (rcache=0x861c80, addr=0x8c7000, size=4096, flags=17, access_flags=15, reg=0x7ffc7b0c9008) at rcache_grdma_module.c:345
#14 0x00007f2fa2350942 in opal_free_list_grow_st (flist=0x860ee0, num_elements=64, item_out=0x0) at class/opal_free_list.c:224
#15 0x00007f2fa235060e in opal_free_list_init (flist=0x860ee0, frag_size=488, frag_alignment=64, frag_class=0x7f2f965d6ac0 <mca_btl_openib_send_control_frag_t_class>, payload_buffer_size=28, payload_buffer_alignment=64, num_elements_to_alloc=8, max_elements_to_alloc=-1, num_elements_per_alloc=32,
    mpool=0x7f2fa2678b80 <mca_mpool_malloc_module>, rcache_reg_flags=0, rcache=0x861c80, item_init=0x7f2f963a21d4 <mca_btl_openib_frag_init>, ctx=0x8bea00) at class/opal_free_list.c:158
#16 0x00007f2f9638babd in prepare_device_for_use (device=0x860aa0) at btl_openib.c:744
#17 0x00007f2f9638ca86 in mca_btl_openib_add_procs (btl=0x862070, nprocs=2, procs=0x8b4590, peers=0x8b4570, reachable=0x7ffc7b0c9350) at btl_openib.c:1070
#18 0x00007f2f967dfe8f in mca_bml_r2_add_procs (nprocs=2, procs=0x8b4590, reachable=0x7ffc7b0c9350) at bml_r2.c:521
#19 0x00007f2f94ee16a4 in mca_pml_ob1_add_procs (procs=0x8b7080, nprocs=2) at pml_ob1.c:335
#20 0x00007f2fa37f8ae6 in ompi_mpi_init (argc=1, argv=0x7ffc7b0c96a8, requested=0, provided=0x7ffc7b0c952c, reinit_ok=false) at runtime/ompi_mpi_init.c:854
#21 0x00007f2fa3845697 in PMPI_Init (argc=0x7ffc7b0c957c, argv=0x7ffc7b0c9570) at pinit.c:67
#22 0x000000000040153c in main (argc=1, argv=0x7ffc7b0c96a8) at osu_bw.c:39

mpirun --mca pml ucx --mca btl_openib_allow_ib false -np 2 ./osu_bw:

#0  0x00007f5cceb9ff0d in pause () from /usr/lib64/libpthread.so.0
#1  0x00007f5cbb475d29 in ucs_debug_freeze () at debug/debug.c:710
#2  0x00007f5cbb4761c2 in ucs_error_freeze (error_type=0x7f5cbb52f024 "illegal operand",
    message=0x7f5cc8072d60 "Caught signal 4 (Illegal instruction: illegal operand)") at debug/debug.c:829
#3  0x00007f5cbb476958 in ucs_handle_error (error_type=0x7f5cbb52f024 "illegal operand", message=0x7f5cbb52f330 "Caught signal %d (%s: %s%s)")
    at debug/debug.c:997
#4  0x00007f5cbb476589 in ucs_debug_handle_error_signal (signo=4, cause=0x7f5cbb52f024 "illegal operand", fmt=0x7f5cbb52f34c "") at debug/debug.c:934
#5  0x00007f5cbb47660b in ucs_error_signal_handler (signo=4, info=0x7f5cc80735f0, context=0x7f5cc80734c0) at debug/debug.c:945
#6  <signal handler called>
#7  0x00007f5cc088fcd2 in ucp_rndv_progress_rma_get_zcopy_inner (self=0x114ef80) at tag/rndv.c:396
#8  ucp_rndv_progress_rma_get_zcopy (self=0x7f5cceb8a7b8 <main_arena+88>) at tag/rndv.c:323
#9  0x0000000000000000 in ?? ()

mpirun --mca pml ucx --mca btl_openib_allow_ib true -np 2 ./osu_bw:

#0  0x00007fde13057f0d in pause () from /usr/lib64/libpthread.so.0
#1  0x00007fddff88ad29 in ucs_debug_freeze () at debug/debug.c:710
#2  0x00007fddff88b1c2 in ucs_error_freeze (error_type=0x7fddff944024 "illegal operand",
    message=0x7fde0c027d60 "Caught signal 4 (Illegal instruction: illegal operand)") at debug/debug.c:829
#3  0x00007fddff88b958 in ucs_handle_error (error_type=0x7fddff944024 "illegal operand", message=0x7fddff944330 "Caught signal %d (%s: %s%s)")
    at debug/debug.c:997
#4  0x00007fddff88b589 in ucs_debug_handle_error_signal (signo=4, cause=0x7fddff944024 "illegal operand", fmt=0x7fddff94434c "") at debug/debug.c:934
#5  0x00007fddff88b60b in ucs_error_signal_handler (signo=4, info=0x7fde0c0285f0, context=0x7fde0c0284c0) at debug/debug.c:945
#6  <signal handler called>
#7  0x00007fde04e93cd2 in ucp_rndv_progress_rma_get_zcopy_inner (self=0x19aff50) at tag/rndv.c:396
#8  ucp_rndv_progress_rma_get_zcopy (self=0x7fde130427b8 <main_arena+88>) at tag/rndv.c:323
#9  0x0000000000000000 in ?? ()

@hoopoepg
Copy link
Contributor

MPI is master + your patch

@ggouaillardet
Copy link
Contributor Author

Thanks for the traces ! I am still unable to reproduce the issue, but I might have a lead on what is going on.
That being said, I am very suspicious this PR is the real issue, and start thinking that the issue was already there, but was hidden by default.

Anyway, could you please apply the following patch (it only collect traces) on top of master and

$ mpirun --mca pml ob1 --mca btl_openib_allow_ib false --mca btl_base_verbose 1 -np 2 ./osu_bw 2>&1 | grep openib
$ mpirun --mca pml ob1 --mca btl_openib_allow_ib true --mca btl_base_verbose 1 -np 2./osu_bw 2>&1 | grep openib

out of curiosity, does the command below crashes ?

$ mpirun --mca pml ob1 --mca btl_openib_allow_ib true --mca btl ^uct -np 2 ./osu_bw

if you get a chance, could you checkout master, revert 0a2ce58 and run the same command.
I have some hard time believing this PR can break anything if infiniband is allowed by btl/openib

@hoopoepg
Copy link
Contributor

@ggouaillardet
mpirun --mca pml ob1 --mca btl_openib_allow_ib true --mca btl ^uct -np 2 ./osu_bw works fine

[user]$ mpirun --mca pml ob1 --mca btl_openib_allow_ib false --mca btl_base_verbose 1 -np 2 ./osu_bw 2>&1 | grep openib
[jazz18][[3227,1],0][btl_openib_component.c:684:init_one_port] looking for mlx5_1:1 GID index 0
[jazz18][[3227,1],1][btl_openib_component.c:684:init_one_port] looking for mlx5_1:1 GID index 0
[jazz18][[3227,1],1][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_1 port 1 is 0000000000000000
[jazz18][[3227,1],0][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_1 port 1 is 0000000000000000
[jazz18][[3227,1],1][btl_openib_ip.c:363:add_rdma_addr] Adding addr 2.1.3.18 (0x12030102) subnet 0x2010300 as mlx5_1:1
[jazz18][[3227,1],0][btl_openib_ip.c:363:add_rdma_addr] Adding addr 2.1.3.18 (0x12030102) subnet 0x2010300 as mlx5_1:1
[jazz18][[3227,1],1][btl_openib_ip.c:363:add_rdma_addr] Adding addr 1.1.3.18 (0x12030101) subnet 0x1010300 as mlx5_0:1
[jazz18][[3227,1],0][btl_openib_ip.c:363:add_rdma_addr] Adding addr 1.1.3.18 (0x12030101) subnet 0x1010300 as mlx5_0:1
[jazz18][[3227,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[jazz18][[3227,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[jazz18][[3227,1],1][connect/btl_openib_connect_udcm.c:454:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_1:1
[jazz18][[3227,1],1][connect/btl_openib_connect_udcm.c:503:udcm_component_query] unavailable for use on mlx5_1:1; skipped[jazz18][[3227,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[3227,1],0][connect/btl_openib_connect_udcm.c:454:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_1:1
[jazz18][[3227,1],0][connect/btl_openib_connect_udcm.c:503:udcm_component_query] unavailable for use on mlx5_1:1; skipped
used on a specific port.  As such, the openib BTL (OpenFabrics
You can override this policy by setting the btl_openib_allow_ib MCA parameter
[hpchead:08929] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[hpchead:08929] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[user]$ mpirun --mca pml ob1 --mca btl_openib_allow_ib true --mca btl_base_verbose 1 -np 2 ./osu_bw 2>&1 | grep openib
[jazz18][[845,1],0][btl_openib_component.c:684:init_one_port] looking for mlx5_1:1 GID index 0
[jazz18][[845,1],1][btl_openib_component.c:684:init_one_port] looking for mlx5_1:1 GID index 0
[jazz18][[845,1],1][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_1 port 1 is 0000000000000000
[jazz18][[845,1],0][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_1 port 1 is 0000000000000000
[jazz18][[845,1],1][btl_openib_component.c:684:init_one_port] looking for mlx5_0:1 GID index 0
[jazz18][[845,1],0][btl_openib_component.c:684:init_one_port] looking for mlx5_0:1 GID index 0
[jazz18][[845,1],0][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_0 port 1 is fe80000000000000
[jazz18][[845,1],1][btl_openib_component.c:715:init_one_port] my IB subnet_id for HCA mlx5_0 port 1 is fe80000000000000
[jazz18][[845,1],0][btl_openib_ip.c:363:add_rdma_addr] Adding addr 2.1.3.18 (0x12030102) subnet 0x2010300 as mlx5_1:1
[jazz18][[845,1],1][btl_openib_ip.c:363:add_rdma_addr] Adding addr 2.1.3.18 (0x12030102) subnet 0x2010300 as mlx5_1:1
[jazz18][[845,1],0][btl_openib_ip.c:363:add_rdma_addr] Adding addr 1.1.3.18 (0x12030101) subnet 0x1010300 as mlx5_0:1
[jazz18][[845,1],1][btl_openib_ip.c:363:add_rdma_addr] Adding addr 1.1.3.18 (0x12030101) subnet 0x1010300 as mlx5_0:1
[jazz18][[845,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[jazz18][[845,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[845,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[845,1],1][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[845,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[jazz18][[845,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[845,1],0][btl_openib_component.c:1401:setup_qps] [jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:454:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_1:1
[jazz18][[845,1],0][btl_openib_component.c:1401:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:503:udcm_component_query] unavailable for use on mlx5_1:1; skipped
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:454:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_1:1
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:503:udcm_component_query] unavailable for use on mlx5_1:1; skipped
used on a specific port.  As such, the openib BTL (OpenFabrics
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:685:udcm_module_init] created cpc module 0x2065330 for btl 0x2057080
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:685:udcm_module_init] created cpc module 0x1e9e380 for btl 0x1e900d0
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:929:udcm_module_create_listen_qp] creating listen QP on port 1
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:929:udcm_module_create_listen_qp] creating listen QP on port 1
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:990:udcm_module_create_listen_qp] listening for connections on lid 7, qpn 252885
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:990:udcm_module_create_listen_qp] listening for connections on lid 7, qpn 252884
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:754:udcm_module_init] my modex = LID: 7, Port: 1, QPN: 252885, GID: 039a0dec 000080fe
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:754:udcm_module_init] my modex = LID: 7, Port: 1, QPN: 252884, GID: 039a0dec 000080fe
[jazz18][[845,1],1][connect/btl_openib_connect_udcm.c:494:udcm_component_query] available for use on mlx5_0:1
[jazz18][[845,1],0][connect/btl_openib_connect_udcm.c:494:udcm_component_query] available for use on mlx5_0:1
[jazz18:19595] [rank=0] openib: using port mlx5_0:1
[jazz18:19596] [rank=1] openib: using port mlx5_0:1
[hpchead:11575] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

@ggouaillardet
Copy link
Contributor Author

Thanks ! I just realized I forgot to upload my patch with extended traces ...

Anyway, I think I see what could be causing the issue, and I will upload a fix tomorrow.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Dec 12, 2018
…led.

Fixes an issue introduced in open-mpi/ompi@0a2ce58

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Dec 12, 2018
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Mar 19, 2019
…led.

Fixes an issue introduced in open-mpi/ompi@0a2ce58

This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master.

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <[email protected]>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Mar 19, 2019
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.

This is a one-off commit for the v4.0.x branch since btl/openib has been removed
 from master.

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants