Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests fails on i586, ppc64, ppc64le and 390x #2258

Closed
junghans opened this issue Sep 12, 2018 · 28 comments · Fixed by #2401
Closed

Some tests fails on i586, ppc64, ppc64le and 390x #2258

junghans opened this issue Sep 12, 2018 · 28 comments · Fixed by #2401

Comments

@junghans
Copy link
Member

npt:

[ 1250s]  82/110 Test  #82: npt .....................................***Failed    5.26 sec
[ 1250s] terminate called after throwing an instance of 'std::bad_alloc'
[ 1250s]   what():  std::bad_alloc
[ 1250s] [obs-power8-05:07978] *** Process received signal ***
[ 1250s] [obs-power8-05:07978] Signal: Aborted (6)
[ 1250s] [obs-power8-05:07978] Signal code:  (-6)
[ 1250s] [obs-power8-05:07978] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fff831304a8]
[ 1250s] [obs-power8-05:07978] [ 1] /lib64/libc.so.6(gsignal+0x13c)[0x7fff82ef692c]
[ 1250s] [obs-power8-05:07978] [ 2] /lib64/libc.so.6(abort+0x178)[0x7fff82ed4120]
[ 1250s] [obs-power8-05:07978] [ 3] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x158)[0x7fff6f1ebfa8]
[ 1250s] [obs-power8-05:07978] [ 4] /usr/lib64/libstdc++.so.6(+0xb6d84)[0x7fff6f1e6d84]
[ 1250s] [obs-power8-05:07978] [ 5] /usr/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x7fff6f1e6e40]
[ 1250s] [obs-power8-05:07978] [ 6] /usr/lib64/libstdc++.so.6(__cxa_throw+0x78)[0x7fff6f1e7318]
[ 1250s] [obs-power8-05:07978] [ 7] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-46.1.ppc64le//usr/lib64/libEspressoScriptInterface.so.4(+0x1d7920)[0x7fff701e7920]

and test_elc_vs_mmm2d:

[ 1262s]  83/110 Test  #83: elc_vs_mmm2d_neutral ....................***Failed   11.53 sec
[ 1262s] Features:  ['BOND_ANGLE', 'BUCKINGHAM', 'COLLISION_DETECTION', 'ELECTROSTATICS', 'EXCLUSIONS', 'EXTERNAL_FORCES', 'FFTW', 'GAUSSIAN', 'GHOSTS_HAVE_BONDS', 'GSL', 'HERTZIAN', 'LANGEVIN_PER_PARTICLE', 'LATTICE', 'LB', 'LB_BOUNDARIES', 'LENNARD_JONES', 'LENNARD_JONES_GENERIC', 'LJCOS', 'LJCOS2', 'MASS', 'MORSE', 'NPT', 'P3M', 'PARTIAL_PERIODIC', 'SOFT_SPHERE', 'TABULATED']
[ 1262s] b'P3M tune parameters: Accuracy goal = 1.00000e-06 prefactor = 1.00000e+00 \nSystem: box_l = 1.00000e+01 # charged part = 4 Sum[q_i^2] = 1.33333e+00\nfixed mesh 16 16 24\nfixed cao 6\nmesh cao r_cut_iL     alpha_L      err          rs_err     ks_err     time [ms]\n16   6   4.88086e-01 6.47910e+00 9.93385e-07 7.071e-07 6.977e-07 1.58    \n\nresulting parameters:\n16   16   24   6   4.88086e-01 6.47910e+00 9.93385e-07 1.58    \n'
[ 1262s] F
[ 1262s] ======================================================================
[ 1262s] FAIL: test_elc_vs_mmm2d (__main__.ELC_vs_MMM2D_neutral)
[ 1262s] ----------------------------------------------------------------------
[ 1262s] Traceback (most recent call last):
[ 1262s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/elc_vs_mmm2d_neutral.py", line 143, in test_elc_vs_mmm2d
[ 1262s]     mmm2d_res[run], elc_res[run], rtol=0, atol=1e-4) is None)
[ 1262s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
[ 1262s]     verbose=verbose, header=header, equal_nan=equal_nan)
[ 1262s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
[ 1262s]     raise AssertionError(msg)
[ 1262s] AssertionError: 
[ 1262s] Not equal to tolerance rtol=0, atol=0.0001
[ 1262s] 
[ 1262s] (mismatch 5.454545454545453%)
[ 1262s]  x: array([[ 9.500000e+000, -2.360929e-003, -1.422413e-002, -6.259504e-002,
[ 1262s]         -2.643891e-002],
[ 1262s]        [ 8.600000e+000, -3.905825e-003, -2.350280e-002, -5.846164e-002,...
[ 1262s]  y: array([[ 9.500000e+00, -2.360927e-03, -1.422401e-02, -6.259497e-02,
[ 1262s]         -2.645121e-02],
[ 1262s]        [ 8.600000e+00, -3.905771e-03, -2.350236e-02, -5.846157e-02,...
[ 1262s] 
[ 1262s] ----------------------------------------------------------------------
[ 1262s] Ran 1 test in 8.798s
[ 1262s] 
[ 1262s] FAILED (failures=1)

and test_elc_vs_mmm2d:

[ 1271s]  84/110 Test  #84: elc_vs_mmm2d_nonneutral .................***Failed    9.72 sec
[ 1271s] Features:  ['BOND_ANGLE', 'BUCKINGHAM', 'COLLISION_DETECTION', 'ELECTROSTATICS', 'EXCLUSIONS', 'EXTERNAL_FORCES', 'FFTW', 'GAUSSIAN', 'GHOSTS_HAVE_BONDS', 'GSL', 'HERTZIAN', 'LANGEVIN_PER_PARTICLE', 'LATTICE', 'LB', 'LB_BOUNDARIES', 'LENNARD_JONES', 'LENNARD_JONES_GENERIC', 'LJCOS', 'LJCOS2', 'MASS', 'MORSE', 'NPT', 'P3M', 'PARTIAL_PERIODIC', 'SOFT_SPHERE', 'TABULATED']
[ 1271s] b'P3M tune parameters: Accuracy goal = 1.00000e-06 prefactor = 1.00000e+00 \nSystem: box_l = 1.00000e+01 # charged part = 4 Sum[q_i^2] = 9.33333e+00\nfixed mesh 20 20 32\nfixed cao 7\nmesh cao r_cut_iL     alpha_L      err          rs_err     ks_err     time [ms]\n20   7   4.71816e-01 7.33084e+00 9.92661e-07 7.071e-07 6.967e-07 2.92    \n\nresulting parameters:\n20   20   32   7   4.71816e-01 7.33084e+00 9.92661e-07 2.92    \n'
[ 1271s] F
[ 1271s] ======================================================================
[ 1271s] FAIL: test_elc_vs_mmm2d (__main__.ELC_vs_MMM2D_neutral)
[ 1271s] ----------------------------------------------------------------------
[ 1271s] Traceback (most recent call last):
[ 1271s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/elc_vs_mmm2d_nonneutral.py", line 131, in test_elc_vs_mmm2d
[ 1271s]     mmm2d_res[run], elc_res[run], rtol=0, atol=1e-4) is None)
[ 1271s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
[ 1271s]     verbose=verbose, header=header, equal_nan=equal_nan)
[ 1271s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
[ 1271s]     raise AssertionError(msg)
[ 1271s] AssertionError: 
[ 1271s] Not equal to tolerance rtol=0, atol=0.0001
[ 1271s] 
[ 1271s] (mismatch 5.454545454545453%)
[ 1271s]  x: array([[ 9.500000e+00, -7.082787e-03, -4.267240e-02, -1.877851e-01,
[ 1271s]         -9.847615e-01],
[ 1271s]        [ 8.600000e+00, -1.171748e-02, -7.050841e-02, -1.753849e-01,...
[ 1271s]  y: array([[ 9.500000e+00, -7.082801e-03, -4.267245e-02, -1.877852e-01,
[ 1271s]         -9.847716e-01],
[ 1271s]        [ 8.600000e+00, -1.171749e-02, -7.050842e-02, -1.753849e-01,...
[ 1271s] 
[ 1271s] ----------------------------------------------------------------------
[ 1271s] Ran 1 test in 6.897s
[ 1271s] 
[ 1271s] FAILED (failures=1)

Details here

CC @mkuron

@mkuron
Copy link
Member

mkuron commented Sep 12, 2018

These errors are actually on ppc64le, and I have no idea what is happening.

The errors on ppc64 and s390x (I have no idea who would run Espresso on an IBM mainframe, but sure, we can have binary packages for it...) are more obvious though: there the periodicity check does not work, probably because these are big-endian architectures and we are doing some incorrect bitwise operations.

@junghans junghans changed the title Some tests fails on ppc64: Some tests fails on ppc64le: Sep 12, 2018
@junghans
Copy link
Member Author

Not sure, I fixed it now...

@mkuron
Copy link
Member

mkuron commented Sep 12, 2018

What did you change?

Also, please re-open this issue as the periodicity issue on the big-endian architectures still exists.

@KaiSzuttor KaiSzuttor reopened this Sep 12, 2018
@junghans
Copy link
Member Author

junghans commented Sep 12, 2018

I think, it is a parallel test issue, make check without -j5 helped.

ppc64 with le is still failing, but that is a different issue.

@junghans junghans changed the title Some tests fails on ppc64le: Some tests fails on ppc64{,le} Sep 12, 2018
@junghans
Copy link
Member Author

On ppc64:

[ 1015s]  55/110 Test  #55: mmm1d ...................................***Failed    0.78 sec
[ 1015s] Features:  ['BOND_ANGLE', 'BUCKINGHAM', 'COLLISION_DETECTION', 'ELECTROSTATICS', 'EXCLUSIONS', 'EXTERNAL_FORCES', 'FFTW', 'GAUSSIAN', 'GHOSTS_HAVE_BONDS', 'GSL', 'HERTZIAN', 'LANGEVIN_PER_PARTICLE', 'LATTICE', 'LB', 'LB_BOUNDARIES', 'LENNARD_JONES', 'LENNARD_JONES_GENERIC', 'LJCOS', 'LJCOS2', 'MASS', 'MORSE', 'NPT', 'P3M', 'PARTIAL_PERIODIC', 'SOFT_SPHERE', 'TABULATED']
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] ERROR: MMM1D requires periodicity 0 0 1
[ 1015s] E
[ 1015s] ======================================================================
[ 1015s] ERROR: test_mmm1d (__main__.ElectrostaticInteractionsTests)
[ 1015s] ----------------------------------------------------------------------
[ 1015s] Traceback (most recent call last):
[ 1015s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/mmm1d.py", line 76, in func
[ 1015s]     self.system.actors.add(Inter)
[ 1015s]   File "actors.pyx", line 186, in espressomd.actors.Actors.add
[ 1015s]   File "actors.pyx", line 50, in espressomd.actors.Actor._activate
[ 1015s]   File "utils.pyx", line 252, in espressomd.utils.handle_errors
[ 1015s]   File "utils.pyx", line 269, in espressomd.utils.handle_errors
[ 1015s] Exception: Activation of an actor: b'ERROR: MMM1D requires periodicity 0 0 1'
[ 1015s] 
[ 1015s] ----------------------------------------------------------------------
[ 1015s] Ran 1 test in 0.004s
[ 1015s] 
[ 1015s] FAILED (errors=1)

and

[ 1039s]  66/110 Test  #66: coulomb_mixed_periodicity ...............***Failed    0.99 sec
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] EERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] ERROR: MMM2D requires periodicity 1 1 0
[ 1039s] E
[ 1039s] ======================================================================
[ 1039s] ERROR: test_MMM2D (__main__.CoulombMixedPeriodicity)
[ 1039s] ----------------------------------------------------------------------
[ 1039s] Traceback (most recent call last):
[ 1039s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/coulomb_mixed_periodicity.py", line 134, in test_MMM2D
[ 1039s]     self.S.actors.add(mmm2d)
[ 1039s]   File "actors.pyx", line 186, in espressomd.actors.Actors.add
[ 1039s]   File "actors.pyx", line 49, in espressomd.actors.Actor._activate
[ 1039s]   File "electrostatics.pyx", line 676, in espressomd.electrostatics.MMM2D._activate_method
[ 1039s]   File "electrostatics.pyx", line 669, in espressomd.electrostatics.MMM2D._set_params_in_es_core
[ 1039s]   File "utils.pyx", line 269, in espressomd.utils.handle_errors
[ 1039s] Exception: MMM2d setup: b'ERROR: MMM2D requires periodicity 1 1 0'
[ 1039s] 
[ 1039s] ======================================================================
[ 1039s] ERROR: test_zz_p3mElc (__main__.CoulombMixedPeriodicity)
[ 1039s] ----------------------------------------------------------------------
[ 1039s] Traceback (most recent call last):
[ 1039s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/coulomb_mixed_periodicity.py", line 55, in setUp
[ 1039s]     del self.S.actors[0]
[ 1039s]   File "actors.pyx", line 225, in espressomd.actors.Actors.__delitem__
[ 1039s]   File "actors.pyx", line 199, in espressomd.actors.Actors.remove
[ 1039s]   File "actors.pyx", line 54, in espressomd.actors.Actor._deactivate
[ 1039s]   File "electrostatics.pyx", line 68, in espressomd.electrostatics.ElectrostaticInteraction._deactivate_method
[ 1039s]   File "utils.pyx", line 269, in espressomd.utils.handle_errors
[ 1039s] Exception: Coulom method deactivation: b'ERROR: MMM2D requires periodicity 1 1 0'
[ 1039s] 
[ 1039s] ----------------------------------------------------------------------
[ 1039s] Ran 2 tests in 0.190s
[ 1039s] 
[ 1039s] FAILED (errors=2)

@junghans junghans changed the title Some tests fails on ppc64{,le} Some tests fails on ppc64, ppc64le and 390x Sep 12, 2018
@junghans
Copy link
Member Author

Same on 390x

@junghans junghans changed the title Some tests fails on ppc64, ppc64le and 390x Some tests fails on i586, ppc64, ppc64le and 390x Sep 13, 2018
@junghans
Copy link
Member Author

There is still an issue on Tumbleweed i586:

[  800s] 38/41 Test #38: field_coupling_coulplings ........***Failed    0.01 sec
[  800s] Running 5 test cases...
[  800s] /home/abuild/rpmbuild/BUILD/espresso-4.0.0/src/core/unit_tests/field_coupling_couplings_test.cpp(84): [1;31;49merror: in "scaled": check (default_val * 5.) == scaled_coupling(Particle(3), 5.) has failed[0;39;49m
[  800s] 
[  800s] [1;31;49m*** 1 failure is detected in the test module "AutoParameter test"
[  800s] [0;39;49m
[  800s] 

@junghans
Copy link
Member Author

Ok, #2259 fixed big endian!

@mkuron
Copy link
Member

mkuron commented Sep 13, 2018

https://build.opensuse.org/package/show/home:cjunghans:branches:devel:languages:python/python-espressomd

SLE_12_SP4: needs to be switched from openmpi2 to openmpi, @junghans.

armv7l:

[ 5566s]  63/110 Test  #63: analyze_energy ..........................***Failed    3.22 sec
[ 5566s] Features:  ['BOND_ANGLE', 'BUCKINGHAM', 'COLLISION_DETECTION', 'ELECTROSTATICS', 'EXCLUSIONS', 'EXTERNAL_FORCES', 'FFTW', 'GAUSSIAN', 'GHOSTS_HAVE_BONDS', 'GSL', 'HERTZIAN', 'LANGEVIN_PER_PARTICLE', 'LATTICE', 'LB', 'LB_BOUNDARIES', 'LENNARD_JONES', 'LENNARD_JONES_GENERIC', 'LJCOS', 'LJCOS2', 'MASS', 'MORSE', 'NPT', 'P3M', 'PARTIAL_PERIODIC', 'SOFT_SPHERE', 'TABULATED']
[ 5566s] [armbuild15:10817] *** Process received signal ***
[ 5566s] [armbuild15:10817] Signal: Segmentation fault (11)
[ 5566s] [armbuild15:10817] Signal code: Address not mapped (1)
[ 5566s] [armbuild15:10817] Failing at address: 0xffffff08
[ 5566s] [armbuild15:10817] *** End of error message ***

i586:

[ 1086s] 38/41 Test #38: field_coupling_coulplings ........***Failed    0.01 sec
[ 1086s] Running 5 test cases...
[ 1086s] /home/abuild/rpmbuild/BUILD/espresso-4.0.0/src/core/unit_tests/field_coupling_couplings_test.cpp(84): [1;31;49merror: in "scaled": check (default_val * 5.) == scaled_coupling(Particle(3), 5.) has failed[0;39;49m
[ 1086s] 
[ 1086s] [1;31;49m*** 1 failure is detected in the test module "AutoParameter test"
[ 1086s] [0;39;49m

@junghans
Copy link
Member Author

SLE_12_SP4 has another problem:

[   47s] Preparing...                          ########################################
[   47s] 	file /usr/lib64/mpi/gcc/openmpi/lib64/mpi.mod conflicts between attempted installs of openmpi-devel-1.10.7-1.15.x86_64 and openmpi-compat-1.8.1-3.1.x86_64
[   48s] exit ...

@junghans
Copy link
Member Author

@mkuron any idea about that i586 issue?

@mkuron
Copy link
Member

mkuron commented Sep 14, 2018

Patch in #2265. That only leaves us with the segfault on 32-bit ARM.

@junghans
Copy link
Member Author

I guess, you need to run qemu again ;-)

@junghans
Copy link
Member Author

Hmm, on i586 the serial tests take a very long time.

@mkuron
Copy link
Member

mkuron commented Sep 16, 2018

I think the i586 build machine just got stuck. Before you uploaded my patches, the tests ran just fine, and the patches shouldn't slow down any tests. They also still run fine in my Docker container.

Regarding ARM: I couldn't reproduce the analyze_energy failure. Trying to reproduce it was a nightmare though: the QEMU emulation is missing a syscall that OpenMPI 2 uses, so I first had to work around that (openpmix/openpmix#836, open-mpi/ompi#5716). OpenSUSE doesn't provide an arm32v7 Docker image anymore, so I used Ubuntu instead, where the issue does not occur, even when using the same GCC version and the same compiler flags. Then I created my own OpenSUSE arm32v7 Docker image, where I couldn't reproduce the issue either. I even tried rpmbuild, but to no avail -- the compiler flags weren't exactly the same though as rpmbuild seems to automatically adapt to the CPU.

So unless you can get us direct access to these build machines, we won't be able to fix whatever issue this is. It doesn't occur in "regular" builds on the respective architecture.

@junghans
Copy link
Member Author

Maybe @kkaempf knows how to do that!

@kkaempf
Copy link

kkaempf commented Sep 16, 2018

Reach out to [email protected], they should be able to help.

@junghans
Copy link
Member Author

The ppc64le issue is back:

[ 1723s]  83/110 Test  #83: elc_vs_mmm2d_neutral ....................***Failed    9.91 sec
[ 1723s] Features:  ['BOND_ANGLE', 'BUCKINGHAM', 'COLLISION_DETECTION', 'ELECTROSTATICS', 'EXCLUSIONS', 'EXTERNAL_FORCES', 'FFTW', 'GAUSSIAN', 'GHOSTS_HAVE_BONDS', 'GSL', 'HERTZIAN', 'LANGEVIN_PER_PARTICLE', 'LATTICE', 'LB', 'LB_BOUNDARIES', 'LENNARD_JONES', 'LENNARD_JONES_GENERIC', 'LJCOS', 'LJCOS2', 'MASS', 'MORSE', 'NPT', 'P3M', 'PARTIAL_PERIODIC', 'SOFT_SPHERE', 'TABULATED']
[ 1723s] b'P3M tune parameters: Accuracy goal = 1.00000e-06 prefactor = 1.00000e+00 \nSystem: box_l = 1.00000e+01 # charged part = 4 Sum[q_i^2] = 1.33333e+00\nfixed mesh 16 16 24\nfixed cao 6\nmesh cao r_cut_iL     alpha_L      err          rs_err     ks_err     time [ms]\n16   6   4.88086e-01 6.47910e+00 9.93385e-07 7.071e-07 6.977e-07 1.43    \n\nresulting parameters:\n16   16   24   6   4.88086e-01 6.47910e+00 9.93385e-07 1.43    \n'
[ 1723s] F
[ 1723s] ======================================================================
[ 1723s] FAIL: test_elc_vs_mmm2d (__main__.ELC_vs_MMM2D_neutral)
[ 1723s] ----------------------------------------------------------------------
[ 1723s] Traceback (most recent call last):
[ 1723s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/elc_vs_mmm2d_neutral.py", line 143, in test_elc_vs_mmm2d
[ 1723s]     mmm2d_res[run], elc_res[run], rtol=0, atol=1e-4) is None)
[ 1723s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
[ 1723s]     verbose=verbose, header=header, equal_nan=equal_nan)
[ 1723s]   File "/usr/lib64/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
[ 1723s]     raise AssertionError(msg)
[ 1723s] AssertionError: 
[ 1723s] Not equal to tolerance rtol=0, atol=0.0001
[ 1723s] 
[ 1723s] (mismatch 5.454545454545453%)
[ 1723s]  x: array([[ 9.500000e+000, -1.027968e-003, -6.689425e-003,  9.434845e-001,
[ 1723s]         -2.881663e-001],
[ 1723s]        [ 8.600000e+000, -3.085796e-003, -1.934990e-002,  7.870464e-002,...
[ 1723s]  y: array([[ 9.500000e+00, -1.027966e-03, -6.689300e-03,  9.434846e-01,
[ 1723s]         -2.881786e-01],
[ 1723s]        [ 8.600000e+00, -3.085742e-03, -1.934946e-02,  7.870471e-02,...
[ 1723s] 
[ 1723s] ----------------------------------------------------------------------
[ 1723s] Ran 1 test in 7.242s
[ 1723s] 
[ 1723s] FAILED (failures=1)
[ 1723s] -------------------------------------------------------
[ 1723s] Primary job  terminated normally, but 1 process returned
[ 1723s] a non-zero exit code.. Per user-direction, the job has been aborted.
[ 1723s] -------------------------------------------------------
[ 1723s] --------------------------------------------------------------------------
[ 1723s] mpiexec detected that one or more processes exited with non-zero status, thus causing
[ 1723s] the job to be terminated. The first process to do so was:
[ 1723s] 
[ 1723s]   Process name: [[56796,1],0]
[ 1723s]   Exit code:    1
[ 1723s] --------------------------------------------------------------------------
[ 1723s] 

and

[ 1667s]  58/110 Test  #58: lb ......................................***Failed   17.81 sec
[ 1667s] .WARNING: Recalculating forces, so the LB coupling forces are not included in the particle force the first time step. This only matters if it happens frequently during sampling.
[ 1667s] 
[ 1667s] F.WARNING: Recalculating forces, so the LB coupling forces are not included in the particle force the first time step. This only matters if it happens frequently during sampling.
[ 1667s] 
[ 1667s] .ssss
[ 1667s] ======================================================================
[ 1667s] FAIL: test_mass_momentum_thermostat (__main__.TestLBCPU)
[ 1667s] ----------------------------------------------------------------------
[ 1667s] Traceback (most recent call last):
[ 1667s]   File "/home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite/lb.py", line 146, in test_mass_momentum_thermostat
[ 1667s]     self.params['mass_prec_per_node']))
[ 1667s] AssertionError: False is not true : fluid mass deviation too high
[ 1667s] deviation: 1.422165896325378e-05   accepted deviation: 5e-08
[ 1667s] 
[ 1667s] ----------------------------------------------------------------------
[ 1667s] Ran 8 tests in 15.126s
[ 1667s] 
[ 1667s] FAILED (failures=1, skipped=4)
[ 1667s] -------------------------------------------------------
[ 1667s] Primary job  terminated normally, but 1 process returned
[ 1667s] a non-zero exit code.. Per user-direction, the job has been aborted.
[ 1667s] -------------------------------------------------------
[ 1667s] --------------------------------------------------------------------------
[ 1667s] mpiexec detected that one or more processes exited with non-zero status, thus causing
[ 1667s] the job to be terminated. The first process to do so was:
[ 1667s] 
[ 1667s]   Process name: [[57959,1],0]
[ 1667s]   Exit code:    1
[ 1667s] --------------------------------------------------------------------------
[ 1667s] 

@kkaempf
Copy link

kkaempf commented Sep 20, 2018

Hmm, can we track which buildhosts are affected ? That might be a hardware/architecture/cpu-type problem !?

@mkuron
Copy link
Member

mkuron commented Sep 20, 2018

Looking at https://build.opensuse.org/packages/python-espressomd/job_history/home:cjunghans:branches:devel:languages:python/openSUSE_Factory_PowerPC/ppc64le, there is no pattern. The job can succeed one day and fail on the same host the next day. These issues are also not reproducible in QEMU emulation, which is too bad as we only have x86_64 hardware on site.

This one is also interesting:

[ 1713s]  82/110 Test  #82: npt .....................................***Failed   13.65 sec
[ 1713s] terminate called after throwing an instance of 'std::bad_alloc'
[ 1713s]   what():  std::bad_alloc
[ 1713s] [obs-power8-05:08378] *** Process received signal ***
[ 1713s] [obs-power8-05:08378] Signal: Aborted (6)
[ 1713s] [obs-power8-05:08378] Signal code:  (-6)
[ 1713s] [obs-power8-05:08378] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fffb3e404a8]
[ 1713s] [obs-power8-05:08378] [ 1] /lib64/libc.so.6(gsignal+0x13c)[0x7fffb3c0692c]
[ 1713s] [obs-power8-05:08378] [ 2] /lib64/libc.so.6(abort+0x178)[0x7fffb3be4120]
[ 1713s] [obs-power8-05:08378] [ 3] /usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x158)[0x7fff9fefbfa8]
[ 1713s] [obs-power8-05:08378] [ 4] /usr/lib64/libstdc++.so.6(+0xb6d84)[0x7fff9fef6d84]
[ 1713s] [obs-power8-05:08378] [ 5] /usr/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x7fff9fef6e40]
[ 1713s] [obs-power8-05:08378] [ 6] /usr/lib64/libstdc++.so.6(__cxa_throw+0x78)[0x7fff9fef7318]
[ 1713s] [obs-power8-05:08378] [ 7] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoScriptInterface.so.4(+0x1d7920)[0x7fffa0ef7920]
[ 1713s] [obs-power8-05:08378] [ 8] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoScriptInterface.so.4(_ZNK5boost7archive6detail11iserializerINS_3mpi15packed_iarchiveE8ParticleE16load_object_dataERNS1_14basic_iarchiveEPvj+0x2f0)[0x7fffa0f47240]
[ 1713s] [obs-power8-05:08378] [ 9] /usr/lib64/libboost_serialization.so.1.68.0(_ZN5boost7archive6detail14basic_iarchive11load_objectEPvRKNS1_17basic_iserializerE+0x18c)[0x7fffa09c8a6c]
[ 1713s] [obs-power8-05:08378] [10] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_ZNK5boost7archive6detail11iserializerINS_3mpi15packed_iarchiveE12ParticleListE16load_object_dataERNS1_14basic_iarchiveEPvj+0xbc)[0x7fffa07cb01c]
[ 1713s] [obs-power8-05:08378] [11] /usr/lib64/libboost_serialization.so.1.68.0(_ZN5boost7archive6detail14basic_iarchive11load_objectEPvRKNS1_17basic_iserializerE+0x18c)[0x7fffa09c8a6c]
[ 1713s] [obs-power8-05:08378] [12] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z14recv_particlesP12ParticleListi+0xe8)[0x7fffa07c3138]
[ 1713s] [obs-power8-05:08378] [13] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z30dd_exchange_and_sort_particlesi+0x350)[0x7fffa0747180]
[ 1713s] [obs-power8-05:08378] [14] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z22cells_resort_particlesi+0xd8)[0x7fffa0725a38]
[ 1713s] [obs-power8-05:08378] [15] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z19cells_update_ghostsv+0x30)[0x7fffa0725a90]
[ 1713s] [obs-power8-05:08378] [16] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z12integrate_vvii+0x204)[0x7fffa077e1a4]
[ 1713s] [obs-power8-05:08378] [17] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z13mpi_integrateii+0x48)[0x7fffa07301e8]
[ 1713s] [obs-power8-05:08378] [18] /home/abuild/rpmbuild/BUILDROOT/python-espressomd-4.0.0-62.3.ppc64le//usr/lib64/libEspressoCore.so.4(_Z16python_integrateibb+0x35c)[0x7fffa077c04c]
[ 1713s] [obs-power8-05:08378] [19] /home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/src/python/espressomd/integrate.so(+0x14c14)[0x7fff95494c14]
[ 1713s] [obs-power8-05:08378] [20] /home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/src/python/espressomd/script_interface.so(+0x123e8)[0x7fff9fbb23e8]
[ 1713s] [obs-power8-05:08378] [21] /home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/src/python/espressomd/script_interface.so(+0x148ac)[0x7fff9fbb48ac]
[ 1713s] [obs-power8-05:08378] [22] /usr/lib64/libpython3.6m.so.1.0(_PyObject_FastCallDict+0xcc)[0x7fffb3867afc]
[ 1713s] [obs-power8-05:08378] [23] /usr/lib64/libpython3.6m.so.1.0(_PyObject_FastCallKeywords+0x60)[0x7fffb391b190]
[ 1713s] [obs-power8-05:08378] [24] /usr/lib64/libpython3.6m.so.1.0(+0x1b2100)[0x7fffb3932100]
[ 1713s] [obs-power8-05:08378] [25] /usr/lib64/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x674)[0x7fffb397e504]
[ 1713s] [obs-power8-05:08378] [26] /usr/lib64/libpython3.6m.so.1.0(PyEval_EvalFrameEx+0x34)[0x7fffb383cc74]
[ 1713s] [obs-power8-05:08378] [27] /usr/lib64/libpython3.6m.so.1.0(+0x13e82c)[0x7fffb38be82c]
[ 1713s] [obs-power8-05:08378] [28] /usr/lib64/libpython3.6m.so.1.0(+0x1b203c)[0x7fffb393203c]
[ 1713s] [obs-power8-05:08378] [29] /usr/lib64/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x674)[0x7fffb397e504]
[ 1713s] [obs-power8-05:08378] *** End of error message ***

@kkaempf
Copy link

kkaempf commented Sep 20, 2018

Looking at https://build.opensuse.org/packages/python-espressomd/job_history/home:cjunghans:branches:devel:languages:python/openSUSE_Factory_PowerPC/ppc64le, there is no pattern. The job can succeed one day and fail on the same host the next day. These issues are also not reproducible in QEMU emulation, which is too bad as we only have x86_64 hardware on site.

Are you saying the failures (like the array mismatch) are ppc64le specific ?

This one is also interesting:

[ 1713s]  82/110 Test  #82: npt .....................................***Failed   13.65 sec
[ 1713s] terminate called after throwing an instance of 'std::bad_alloc'
[ 1713s]   what():  std::bad_alloc

Naa, that's just an "out of memory". This can be avoided by adding a _constraints file next to the .spec to tell buildservice "this needs a bigger build host". See libreoffice as an example.

@mkuron
Copy link
Member

mkuron commented Sep 20, 2018

Are you saying the failures (like the array mismatch) are ppc64le specific ?

Yes, this specific set of errors only appears on ppc64le. Some tests have rather tight tolerances that might not be valid on different hardware implementations of floating-point arithmetic, but the deviations appearing here are a bit big to blame on that.

@kkaempf
Copy link

kkaempf commented Sep 20, 2018

Yes, this specific set of errors only appears on ppc64le.

Can you give me a minimal test case ? I have a (large) ppc64le machine available to run tests and narrow the problem down. It might be a compiler bug after all.

@mkuron
Copy link
Member

mkuron commented Sep 20, 2018

While I can't provide a minimal test case as I don't know what is specifically causing the issue, here is how I built and ran the failing tests manually:

git clone [email protected]:espressomd/espresso
cd espresso
mkdir build
cd build
cmake ..
make -j 16
mpiexec -n 2 ./pypresso ../testsuite/elc_vs_mmm2d_neutral.py
mpiexec -n 2 ./pypresso ../testsuite/lb.py

@junghans
Copy link
Member Author

@kkaempf on i586 it gets stuck reproducible at:

[  985s] Test project /home/abuild/rpmbuild/BUILD/espresso-4.0.0/build/testsuite
[  985s]         Start   1: save_checkpoint
[29793s] qemu-system-x86_64: terminating on signal 15 from pid 5721 (<unknown process>)


Job seems to be stuck here, killed. (after 28800 seconds of inactivity)

@RudolfWeeber
Copy link
Contributor

What's the state of this?

@mkuron
Copy link
Member

mkuron commented Dec 5, 2018

It's still broken and I can't debug it due to lack of hardware. It does not happen in QEMU-emulated Docker images of the respective architecture. I was unable to set up the official openSUSE QEMU environment, so I can't tell whether it occurs there. So whatever bug this is, it won't get fixed.

I do want to set up weekly CI jobs for other architectures, but before I can do that someone needs to merge espressomd/docker#41.

@junghans
Copy link
Member Author

npt seems to be fixed on aarch64 and i586, but persists on ppc64* (#2468)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants