Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treematch issues in master #4303

Closed
rhc54 opened this issue Oct 4, 2017 · 13 comments
Closed

Treematch issues in master #4303

rhc54 opened this issue Oct 4, 2017 · 13 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented Oct 4, 2017

The treematch topology component is segfaulting in master when running MTT:

$ mpirun --oversubscribe --bind-to none   -np 16  topology/distgraph1 
using graph layout 'deterministic complete graph'
testing MPI_Dist_graph_create_adjacent
testing MPI_Dist_graph_create w/ outgoing only
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using defaultdistgraph1: topo_treematch_dist_graph_create.c:656:
mca_topo_treematch_dist_graph_create: Assertion `(int)sol->k_length == size' failed.
[rhc001:14020] *** Process received signal ***
[rhc001:14020] Signal: Aborted (6)
[rhc001:14020] Signal code:  (-6)[rhc001:14020] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f385b8e8370]
[rhc001:14020] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f385b54d1d7]
[rhc001:14020] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f385b54e8c8]
[rhc001:14020] [ 3] /lib64/libc.so.6(+0x2e146)[0x7f385b546146]
[rhc001:14020] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x7f385b5461f2]
[rhc001:14020] [ 5]
/home/common/openmpi/build/foobar/lib/openmpi/mca_topo_treematch.so(mca_topo_treematch_dist_graph_create+0x21e9)[0x7f384396e702]
[rhc001:14020] [ 6]
/home/common/openmpi/build/foobar/lib/libmpi.so.0(PMPI_Dist_graph_create+0x44d)[0x7f385bb7a83e]
[rhc001:14020] [ 7] topology/distgraph1[0x40219e]
[rhc001:14020] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f385b539b35]
[rhc001:14020] [ 9] topology/distgraph1[0x400fc9]
[rhc001:14020] *** End of error message ***-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rhc001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@bosilca
Copy link
Member

bosilca commented Oct 5, 2017

We're looking into it, but I can't replicate it on any of my machines. The execution path in treematch depends on the local architecture of the platform. Can we have a description of the setting where this assert triggered ?

@rhc54
Copy link
Contributor Author

rhc54 commented Oct 5, 2017

Here is the topology (local.txt is the xml version).

local.pdf

local.txt

Let me know if I can provide any debug output.

@GuillaumeMercier
Copy link
Contributor

I'm unable to replicate this issue on our machines. The test program works fine. However, it fails when I try your topology (local.txt) with this error:

topology discovery failed
--> Returned value Not supported (-8) instead of ORTE_SUCCESS

I need to investigate this more.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 12, 2017

Here's a little more info from a failure on AWS:

using graph layout 'deterministic complete graph'
testing MPI_Dist_graph_create_adjacenttesting MPI_Dist_graph_create w/ outgoing only
========== Centralized Reordering ========= 
*** Error: Core numbering not between 0 and 27: tab_node[18]=33    <<<<====== NOTE
nb_constraints = 0, N= 2; nb_processing units = 27                 <<<<====== NOTE
Error : More processes (2) than number of constraints              <<<<====== NOTE
(0)!-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[40635,1],0]
  Exit code:    255

@jsquyres
Copy link
Member

This same distgraph1 test is doing odd things for me at Cisco on a pair of 16-core nodes.

Here's a gist (https://gist.github.com/jsquyres/098f256cead9d20d2ad1c3aea0e6b0be) showing:

  • lstopo for the 2 nodes where the test was run
  • output from the test

The test doesn't actually fail for me, but it does give approximately ~21K lines like this:

[mpi025:15395] Unable to extract peer [[30746,1],2] nodeid from the modex.

And yes, I mean approximately twenty-one thousand lines like this.

Here's the exact mpirun command I used (inside a SLURM allocation containing these 2 nodes):

mpirun --mca btl tcp,vader,self distgraph1

@GuillaumeMercier
Copy link
Contributor

@jsquyres: George and I are experiencing the same output also.
@rhc54: can you give me your mpiexec command line also? Thanks.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 12, 2017

For that aws error report? It was from the v3.0 branch, and simply:

$ mpirun -n 2 topology/distgraph1

Here's another one, this from the v3.1.x branch:

$ mpirun -n 2 topology/distgraph1 
testing MPI_Dist_graph_create_adjacenttesting MPI_Dist_graph_create w/ outgoing only
Error: Cannot partition 2 elements in 3 parts
[ip-172-31-72-214:68988] *** Process received signal ***
[ip-172-31-72-214:68988] Signal: Segmentation fault (11)
[ip-172-31-72-214:68988] Signal code: Address not mapped (1)
[ip-172-31-72-214:68988] Failing at address: (nil)
Using default
Using default[ip-172-31-72-214:68988] [ 0] /lib64/libpthread.so.0(+0xf5a0)[0x7f06376965a0]
[ip-172-31-72-214:68988] [ 1]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(split_com_mat+0x10c)[0x7f0624c787c3]
[ip-172-31-72-214:68988] [ 2]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(kpartition_build_level_topology+0x141)[0x7f0624c78ed0]
[ip-172-31-72-214:68988] [ 3]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(kpartition_build_level_topology+0x2a7)[0x7f0624c79036]
[ip-172-31-72-214:68988] [ 4]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(kpartition_build_level_topology+0x2a7)[0x7f0624c79036]
[ip-172-31-72-214:68988] [ 5]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(kpartition_build_tree_from_topology+0x280)[0x7f0624c79360]
[ip-172-31-72-214:68988] [ 6]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(tm_build_tree_from_topology+0x1cb)[0x7f0624c75339]
[ip-172-31-72-214:68988] [ 7]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/openmpi/mca_topo_treematch.so(mca_topo_treematch_dist_graph_create+0x213e)[0x7f0624c67526]
[ip-172-31-72-214:68988] [ 8]
/home/ec2-user/mtt-scratch/installs/xnS2/install/lib/libmpi.so.0(MPI_Dist_graph_create+0x45e)[0x7f063792ac20]
[ip-172-31-72-214:68988] [ 9] topology/distgraph1[0x40209a]
[ip-172-31-72-214:68988] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f06372e4c05]
[ip-172-31-72-214:68988] [11] topology/distgraph1[0x400ec9]
[ip-172-31-72-214:68988] *** End of error message
***-------------------------------------------------------

And this is from master:

$ mpirun -n 2 topology/distgraph1 
using graph layout 'deterministic complete graph'
testing MPI_Dist_graph_create_adjacenttesting MPI_Dist_graph_create w/ outgoing only
Using default
Using default
Using default
Using default
Using default
Using default
Using default
Using defaultUsing default
distgraph1: topo_treematch_dist_graph_create.c:656: mca_topo_treematch_dist_graph_create: Assertion
`(int)sol->k_length == size' failed.
[ip-172-31-72-214:98463] *** Process received signal ***
[ip-172-31-72-214:98463] Signal: Aborted (6)
[ip-172-31-72-214:98463] Signal code:  (-6)
[ip-172-31-72-214:98463] [ 0] /lib64/libpthread.so.0(+0xf5a0)[0x7f65f19515a0]
[ip-172-31-72-214:98463] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f65f15b31f7]
[ip-172-31-72-214:98463] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f65f15b48e8]
[ip-172-31-72-214:98463] [ 3] /lib64/libc.so.6(+0x2e266)[0x7f65f15ac266]
[ip-172-31-72-214:98463] [ 4] /lib64/libc.so.6(+0x2e312)[0x7f65f15ac312]
[ip-172-31-72-214:98463] [ 5]
/home/ec2-user/mtt-scratch/installs/zFDF/install/lib/openmpi/mca_topo_treematch.so(mca_topo_treematch_dist_graph_create+0x219e)[0x7f65def4e562]
[ip-172-31-72-214:98463] [ 6]
/home/ec2-user/mtt-scratch/installs/zFDF/install/lib/libmpi.so.0(MPI_Dist_graph_create+0x45e)[0x7f65f1be638d]
[ip-172-31-72-214:98463] [ 7] topology/distgraph1[0x40209a]
[ip-172-31-72-214:98463] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f65f159fc05]
[ip-172-31-72-214:98463] [ 9] topology/distgraph1[0x400ec9]
[ip-172-31-72-214:98463] *** End of error message

@GuillaumeMercier
Copy link
Contributor

@rhc54: the issue with the master is resolved : this assert should not be here in the first place.
For the 3.1.x, I need to check this.

@bwbarrett
Copy link
Member

bwbarrett commented Dec 19, 2017

When I compile ompi with ./configure --prefix=<blah> from a git build, the error doesn't happen. When I compile ompi with ./configure CFLAGS=-pipe --enable-picky --enable-debug --prefix=$HOME/install, the error Ralph copy-and-pasted occurs.

@bosilca
Copy link
Member

bosilca commented Dec 20, 2017

I have created PR #4644 o address the issue highlighted here.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 20, 2017

Thanks George!

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2018

Per 2018-01-09 teleconf, @bwbarrett will follow up with @bosilca on this one.

@bwbarrett
Copy link
Member

Reading the history in this ticket, clearly a fix is needed for v3.1.x, and it looks like it was never pulled. @bosilca, can you file a PR to merge the changes into v3.1.x? I'm not seeing the failure on v3.0.x, so I believe it's not needed there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants