Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with multinode running with iimpi-2020a #10899

Closed
jhein32 opened this issue Jul 1, 2020 · 16 comments
Closed

Problem with multinode running with iimpi-2020a #10899

jhein32 opened this issue Jul 1, 2020 · 16 comments
Milestone

Comments

@jhein32
Copy link
Collaborator

jhein32 commented Jul 1, 2020

As mentioned in the Slack already, we have issues with MPI executables build against iimpi-2020a starting multinode. Within a node I am not aware of issues.

The problem seems associated with the UCX/1.8.0 dependency. Executables utilising iimpi/2020.00, which utilises Intel's MPI 19.6 without an UCX dependency work multinode. Also if I "massage" the easyconfig impi-2019.7.217-iccifort-2020.1.217.eb and comment the line

 ('UCX', '1.8.0'),

in the dependencies list, basic hello world codes or the HPL for intel/2020a will run. Though performance, when compared to an HPL build with intel/2017b is 10% poorer. Using the HPL from PR #10864 the performance is within 1% of the one from intel/2017b.

A few details on our cluster. The system is using Intel Xeon Xeon E5-2650 v3 (Haswell) and 4xFDR InfiniBand. We are using CentOS 7, currently 7.6 or 7.8, linux kernel 3.10, infiniband stuff from CentOS. Slurm is setup with cgroups for process control and accounting
(TaskPlugin=task/cgroup, ProctrackType=proctrack/cgroup ). The slurm is quite old slurm 17.02.

To get the Intel MPI started I add (in an editor)

setenv("I_MPI_PMI_LIBRARY", "/lib64/libpmi.so")

to the impi modules (we have versions as far back as iimpi/7.3.5, predating iimpi/2016b). I tested multiple times, but libpmi2.sodoes not work for us. From the methods to start an Intel MPI jobs, described in the slurm guide, only srun works for us. We never got hydra or MPD to work. I tested, setting 'I_MPI_HYDRA_TOPOLIB': 'ipl'does not help anything.

When running I load:

ml iccifort/2020.1.217 impi/2019.7.217

The modules are build with unmodified configs from EB 4.2.1. When compiling and running a simple MPI hello world code, I get the following in stdout:

[1593610029.017424] [au220:9811 :0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy
[1593610029.017913] [au219:19723:0]         select.c:433  UCX  ERROR no active messages transport to <no debug data>: po
six/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory - Destination is unreacha
ble, sockcm/sockaddr - no am bcopy, rdmacm/sockaddr - no am bcopy, cma/memory - no am bcopy

and this in stderr:

Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
slurmstepd: error: *** STEP 4574138.0 ON au219 CANCELLED AT 2020-07-01T15:27:09 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed
In: PMI_Abort(1091215, Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(904)...............: 
MPIDI_OFI_mpi_init_hook(1471): OFI get address vector map failed)
srun: error: au219: task 0: Killed
srun: error: au220: task 1: Exited with exit code 143

Ok, that went long. Any suggestions would be highly appreciated.

@Micket
Copy link
Contributor

Micket commented Jul 1, 2020

@jhein32 I have similar hardware. I'll see if i can repeat this when i'm back from vacation (during July)

@boegel boegel added this to the 4.x milestone Jul 1, 2020
@boegel
Copy link
Member

boegel commented Jul 1, 2020

@jhein32 We added UCX because it was recommended by Intel, see #10280 and https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html (where UCX is listed as required even).

Have you reported this to Intel support?

@jhein32
Copy link
Collaborator Author

jhein32 commented Jul 2, 2020

@boegel Thanks for getting involved. As written above, the performance of intel mpi 2019.7 without UCX is poor. So the decision to include UCX is correct. If you don't want UCX, in my current view Intel MPI 18.5 is the choice (which can't be your choice forever). From the error messages I am wondering whether the issues sits in UCX and not in Intel MPI. OFI is also mentioned. Who provides that? Intel MPI, UCX, CentOS? Anyone here any clues?

@jhein32
Copy link
Collaborator Author

jhein32 commented Jul 2, 2020

We haven't yet engaged with Intel.

@jhein32
Copy link
Collaborator Author

jhein32 commented Jul 2, 2020

Hi,

We (LUNAC team members) had a virtual 6-hands one keyboard sessions (via zoom thanks to Covid 19) and went over the error messages and the information available within the docs shared on here and within the easyconfigs.

Our hardware is a bit old (2015 or 2016), so we get:

-bash-4.2$ ucx_info -d | grep Transport
#   Transport: posix
#   Transport: sysv
#   Transport: self
#   Transport: tcp
#   Transport: tcp
#   Transport: tcp
#   Transport: tcp
#   Transport: rc_verbs
#   Transport: ud_verbs
#   Transport: cma

Intel writes, output should include: dc, rc, and ud transports. Our hardware lacks dc, which according to Intel is a common issue with older hardware. They recommend setting:

export UCX_TLS=rc,ud,sm,self

Intel calls it a workaround.
When we set this, we can run an MPI helloworld, if we unset this it fails again. I still need to do a performance test to see how it does when compared to older impi versions.

Assuming that goes well, here are two questions/tasks for EB:

  • Do we include an automatic check of ucx_info -d for the dc layer and set the variable?
  • If so, where should that go? UCX module or impi module?

@bartoldeman
Copy link
Contributor

Yes "dc" is a little complex. It used to only work if you use MOFED, but now dc support is upstream and backported in newer CentOS (7.7 has it, not sure about 7.6, 7.5 and older definitely not). We have a cluster without dc as well running CentOS 7.8, will check it there later today (lspci | grep Mellanox reports

$ lspci | grep Mell
02:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

there.

Setting UCX_TLS would be appropriate in the UCX module (as Open MPI uses UCX too and can't use dc either), ideally it should auto-detect the lack of dc though and not need that env var.

@lexming
Copy link
Contributor

lexming commented Jul 2, 2020

Setting UCX_TLS is documented by Intel in https://software.intel.com/content/www/us/en/develop/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

AFAIK disabling dc with UCX_TLS=rc,ud,sm,self is only necessary with ConnectX-4 and older. Since this setting depends on the capabilities of the hardware, defining anything in EB will cause issues one way or another. I recommend to not set anything on EB side.

For instance, we have nodes with ConnectX-4 and ConnectX-5. We only disable dc through UCX_TLS on the older ConnectX-4.

@jhein32
Copy link
Collaborator Author

jhein32 commented Jul 3, 2020

Hi, the performance test was ok.

We noticed the Intel MPI lib is happy with UCX_TLS being set to either dc,rc,ud,sm,selfor rc,ud,sm,self it is unhappy with the variable unset.

When it runs, we get a warning:
WARNING: release_mt library was used but no multi-ep feature was enabled. Please use release library instead.
which we still need to get at the bottom off.

I also looked at the foss/2020a Linpack, using the same UCX for OpenMPI. foss runs with UCX_TLS being any of unset, dc,rc,ud,sm,selfor rc,ud,sm,self. If dc is on the list it gives a warning. So this appears more clever than the Intel MPI.

Based on this, I feel the UCX_TLS should be set in the INTEL MPI module.

@lexming
Copy link
Contributor

lexming commented Jul 3, 2020

@jhein32 The warning message regarding release_mt has been recently fixed in easybuilders/easybuild-easyblocks#2080

@Micket
Copy link
Contributor

Micket commented Jul 8, 2020

I can just confirm that we see the same issues on our older cluster.
So, the options are

  1. Setting UCX_TLS=rc,ud,sm,self in the UCX module.
  2. Not do anything and rely on sysadmins to set this environment variable on their older machines by some other means.
    (it does not belong in the intel MPI module)

Putting in something like this

# For systems with ConnectX4 (or older) interconnect, you need to disable "dc", else Intel MPI will try to use it and fail.
# Uncomment the line to disable dc:
# modextravars = {'UCX_TLS': 'rc,ud,sm,self'};

into the UCX config sounds about right to me.

@jhein32
Copy link
Collaborator Author

jhein32 commented Aug 10, 2020

This is still open - didn't get round to finish this off before the summer.

Based on my current understanding of the issue, I would like to add a comment as proposed by @Micket into a relevant config. However I feel that UCX module is not the correct place. To me it seems to be an Intel MPI issue. With the standard UCX module, as reported, the OpenMPI in foss seems to work well. I is only the Intel MPI that needs this kind of help. My proposal would be to amend the Intel MPI module.

In addition, when issues are encountered with Intel MPI, a user would look there first to look for hints. Until the UCX config is examined it would take some poking around.

Any opinions on the above?

@lexming
Copy link
Contributor

lexming commented Sep 25, 2020

@jhein32 can you test this again with the new version of IMPI in #11337. I think that you won't have any issue now. As far as I can tell, Intel has disabled the mlx provider and now verbs is used for ConnectX NCAs. So UCX is not used at all and then there is no need to set UCX_TLS.

@boegel
Copy link
Member

boegel commented Oct 9, 2020

@jhein32 How should we proceed with this?

@lexming
Copy link
Contributor

lexming commented Oct 9, 2020

I correct my previous statement. After further investigation, the mlx provider is still used by IMPI even though it is not presented by fi_info (see #11337 (comment)). However, setting UCX_TLS seems to not be strictly necessary any more. I tested it on systems with ConnectX-4 and ConnectX-5 and, in both cases, IMPI handles the appropriate transport seamlessly.

@jhein32
Copy link
Collaborator Author

jhein32 commented Oct 12, 2020

Hi,

I installed HPL and pre-requisites from PR #11337. I have massaged the setenv("I_MPI_PMI_LIBRARY", "/lib64/libpmi.so") by hacking into the impi module file, which I always have done for the Intel MPI modules for something like 5 years, so I am fine with that. Following this, I could run (without any further modification the hpl) on 2 nodes. I couldn't do that without massaging UCX with the intel 2020a.

So I am happy to proceed.

@lexming
Copy link
Contributor

lexming commented Nov 27, 2020

@jhein32 The impi easyblock has been updated to set UCX_TLS=all if UCX is in its dependency list (easybuilders/easybuild-easyblocks#2253). This fix works with all hardware configurations. Just be sure to reinstall impi from intel/2020a onwards with the updated easyblock. Thanks for reporting this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants