Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests that run model test cases under valgrind. #148

Closed
nichannah opened this issue Apr 27, 2015 · 9 comments
Closed

Tests that run model test cases under valgrind. #148

nichannah opened this issue Apr 27, 2015 · 9 comments

Comments

@nichannah
Copy link
Collaborator

Valgrind has been shown to be a useful tool to find use of uninitialized variables. Using uninitialized variables most often leads to unreproducible results because garbage can be read out of memory.

This issue proposes an automated way to run the test cases under valgrind. This will allow bugs of this kind to be found quickly.

See also #149

@nichannah nichannah changed the title Tests that run the model in under valgrind. Tests that run the model under valgrind. Apr 27, 2015
@nichannah nichannah changed the title Tests that run the model under valgrind. Tests that run model test cases under valgrind. Apr 27, 2015
@adcroft
Copy link
Collaborator

adcroft commented Apr 28, 2015

Do you have an automated way in mind?

@nichannah
Copy link
Collaborator Author

I do, although would like to discuss it. I'm doing #147 within Python (there's a bit of string and file manipulation which I think Python is good at). I can see this one fitting into a similar setup. I'll show you once I've got #147 in shape.

@nichannah
Copy link
Collaborator Author

Valgrind can be run manually on gaea for any test case that will run on a single PE. For example:

  1. build a debug executable (otherwise the exe will contain instructions that valgrind doesn't understand, probably vector math stuff)
  2. module load valgrind
  3. export TMPDIR=/lustre/f1/$USER/tmp
  4. aprun -n 1 valgrind --gen-suppressions=all --log-file=valgrind_log.txt --suppressions=../../../MOM6.supp ../../build/gnu/ocean_only/debug/MOM6

The suppressions file tells valgrind which errors to ignore. I'll share mine once I've completed it.

Then look in valgrind_log.txt to see memory errors.

For test cases that need multiple PEs valgrind generates millions of false-positives from within MPI. I'm in the process of figuring out how to filter these out properly (making a suppressions file is not feasible).

@nichannah
Copy link
Collaborator Author

From what I can gather, it's going to be tricky to properly run mulit-PE test cases with valgrind on gaea. Usually Valgrind would handle calls to MPI by replacing the MPI library with wrappers that do certain checks before making the actual calls. This replacement is only possible if the MPI library is dynamically linked. It seems that running executables with dynamic libraries is not easy/supported on gaea. For a start the compute nodes don't have access to the filesystem where most dynamic libraries reside (netcdf, hdf, z, math, etc). Also I can't find a way to make the ftn compiler link some libraries as dynamic and others as static.

I still have a couple of things to try.

@nichannah
Copy link
Collaborator Author

I've given up on running valgrind on gaea due to gaea limitation with shared libraries. Instead I'll try to run it on raijin, supercomputer on Canberra, Aus.

@nichannah nichannah reopened this Aug 6, 2015
@nichannah
Copy link
Collaborator Author

I'll run these tests on the Aus computer. The output will be published here:

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/

This is what it looks like for MOM5, I think it can be cleaned up a lot (this file is ~300Mb).

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM5_valgrind/lastBuild/console

@nichannah
Copy link
Collaborator Author

@nichannah
Copy link
Collaborator Author

The Valgrind tests are not yet all running, but I thought it would be good to document any errors as I see them....

In global_ALE/z:

==20891== Invalid read of size 8
==20891== at 0x53FA19: mom_tracer_hor_diff_mp_tracer_hordiff_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x989017: mom_mp_step_mom_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x72148F: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== Address 0x25349540 is 16 bytes before a block of size 256 free'd
==20891== at 0x4C27C44: free (vg_replace_malloc.c:473)
==20891== by 0xF0231A: for__free_vm (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0xF099AD: for_write_int_fmt_xmit (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0xB6CBF8: fms_io_mod_mp_get_file_name_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0xB497CE: fms_io_mod_mp_read_data_2d_new_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x735E82: mom_surface_forcing_mp_buoyancy_forcing_from_files_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x72B5A2: mom_surface_forcing_mp_set_forcing_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x7213A0: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:mom_tracer_hor_diff_mp_tracer_hordiff_
fun:mom_mp_step_mom_
fun:MAIN__
fun:main
}

==20891== Conditional jump or move depends on uninitialised value(s)
==20891== at 0x5EA68C: mom_restart_mp_save_restart_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x9A1DF1: mom_mp_initialize_mom_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x72062C: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== Uninitialised value was created by a heap allocation
==20891== at 0x4C2826A: malloc (vg_replace_malloc.c:296)
==20891== by 0xF024D3: for_allocate (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x5EEAD5: mom_restart_mp_restart_init_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x99F542: mom_mp_initialize_mom_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x72062C: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6)
==20891==
{
<insert_a_suppression_name_here>
Memcheck:Cond
fun:mom_restart_mp_save_restart_
fun:mom_mp_initialize_mom_
fun:MAIN__
fun:main
}

@nichannah
Copy link
Collaborator Author

gustavo-marques pushed a commit to gustavo-marques/MOM6 that referenced this issue May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants