Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netcvode test tolerances may be overly generous to accomodate NVIDIA compilers #1823

Open
olupton opened this issue May 17, 2022 · 1 comment
Labels
bug coverage gpu improvement Improvement over existing implementation testing

Comments

@olupton
Copy link
Collaborator

olupton commented May 17, 2022

Context

The test introduced in #1752 did not pass when the NVIDIA HPC compilers nvc and nvc++ were used. #1822 was introduced to make the tests pass by tolerating larger differences.

Overview of the issue

Some tolerances, in particular this 25% margin:

# Note the very large tolerance, this seems relatively unstable with the
# NVIDIA compilers
chk(key + " tvec size", tvec.size(), tol=0.25)

seem worthy of further investigation. Differences between compilers are to be expected at some level, but this seems high.

See the discussion on the follow-up PR here: #1822 (comment).

Expected result/behaviour

The results obtained with the NVIDIA compilers should be compatible with those obtained with GCC/Intel/Clang with a margin that is plausibly related to optimisations (e.g. FMA and enabled fast-math optimisations).

Assuming the difference is understood and fixed at source, the test tolerance should be reduced again.

NEURON setup

  • Version: master
  • Installation method: CMake build
  • OS + Version: BB5
  • Compiler + Version: NVHPC 22.3

Minimal working example - MWE

Modify the tolerance linked above to be smaller again, run the relevant test and investigate its failure.

@nrnhines
Copy link
Member

I did not succeed in installing a working nvcc/nvc++ on my desktop (cuda refused to install). What is the name of the BB5 these days. Last time I logged in it was [email protected]. I guess I'm out of date with:

Hostname: bbpv1.epfl.ch
User: hines
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'hpe-mpi'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'gcc/6.4.0'
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'python/3.6.5

Do I need a list of module loads? And what is a reasonable cmake to look into this issue? If I've entered a minefield, I'm happy to step back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug coverage gpu improvement Improvement over existing implementation testing
Projects
None yet
Development

No branches or pull requests

2 participants