Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semop lock error during 3D classification #1177

Open
DrJesseHansen opened this issue Aug 23, 2024 · 8 comments
Open

semop lock error during 3D classification #1177

DrJesseHansen opened this issue Aug 23, 2024 · 8 comments

Comments

@DrJesseHansen
Copy link

Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.

running Relion 5 beta 3 commit 6331fe

command:

mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/optimisation_set.star --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/

error:



Expectation iteration 1 of 25
000/??? sec ~~(,_,">                                                          [oo]^Cjhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ^C
jhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ./07_classify1class.job 
RELION version: 5.0-beta-3-commit-6331fe 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Leader      (0) runs on host            = gpu148
 + Follower     1  runs on host            = gpu148
 + Follower     2  runs on host            = gpu148
 + Follower     3  runs on host            = gpu148
 + Follower     4  runs on host            = gpu148
 ==========================
 uniqueHost gpu148 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 1
Device 0 on gpu148 is split between 2 followers
Device 1 on gpu148 is split between 2 followers
 Running CPU instructions in double precision. 
 WARNING:  The reference pixel size is 1 A/px, but the pixel size of the first optics group of the data is 11.056 A/px! 
WARNING: Although the requested resized pixel size is 11.056 A/px, the actual resized pixel size of the reference will be 10 A/px due to rounding of the box size to an even number. 
WARNING: Resizing input reference(s) to pixel_size= 10 and box size= 40 ...
 Estimating initial noise spectra from at most 10 particles 
   0/   0 sec ............................................................~~(,_,">
 CurrentResolution= 57.1429 Angstroms, which requires orientationSampling of at least 14.4 degrees for a particle of diameter 440 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 373248
 OrientationalSampling= 15 NrOrientations= 4608
 TranslationalSampling= 20 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 23887872
 OrientationalSampling= 7.5 NrOrientations= 36864
 TranslationalSampling= 10 NrTranslations= 648
=============================
 Expectation iteration 1 of 25
4.30/4.30 hrs ............................................................~~(,_,">
 Maximization...
   0/   0 sec ............................................................~~(,_,">
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
=== Backtrace  ===
=== Backtrace  ===
=== Backtrace  ===
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x55bb0ec3942a]
relion_refine_mpi(+0x5e60c) [0x55bb0eb8f60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x55bb0ee2cabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55bb0ee48a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x55bb0ec60069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x55bb0ec7710c]
relion_refine_mpi(main+0x52) [0x55bb0ec249c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f14bdc4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f14bdc46305]
relion_refine_mpi(_start+0x21) [0x55bb0ec28251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56136c52842a]
relion_refine_mpi(+0x5e60c) [0x56136c47e60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56136c71babb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56136c737a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56136c54f069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56136c56610c]
relion_refine_mpi(main+0x52) [0x56136c5139c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f266aa4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f266aa46305]
relion_refine_mpi(_start+0x21) [0x56136c517251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56542492742a]
relion_refine_mpi(+0x5e60c) [0x56542487d60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x565424b1aabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x565424b36a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56542494e069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56542496510c]
relion_refine_mpi(main+0x52) [0x5654249129c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7faedd64624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7faedd646305]
relion_refine_mpi(_start+0x21) [0x565424916251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu148:295268] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu148:295268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



This is a template for reporting bugs. Please fill in as much information as you can.

Describe your problem

Please write a clear description of what the problem is.
Data processing questions should be posted to the CCPEM mailing list, not here.
DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).

Environment:

  • OS: [e.g. Ubuntu 16.04 LTS]
  • MPI runtime: [e.g. OpenMPI 2.0.1]
  • RELION version [e.g. RELION-3.1-devel-commit-6ba935 (please see the title bar of the GUI)]
  • Memory: [e.g. 128 GB]
  • GPU: [e.g. GTX 1080Ti]

Dataset:

  • Box size: [e.g. 256 px]
  • Pixel size: [e.g. 0.9 Å/px]
  • Number of particles: [e.g. 150,000]
  • Description: [e.g. A tetrameric protein of about 400 kDa in total]

Job options:

  • Type of job: [e.g. Refine3D]
  • Number of MPI processes: [e.g. 4]
  • Number of threads: [e.g. 6]
  • Full command (see note.txt in the job directory):
    `which relion_refine_mpi` --o Refine3D/job019/run --auto_refine --split_random_halves --i CtfRefine/job018/particles_ctf_refine.star --ref PostProcess/job001/postprocess.mrc --firstiter_cc --ini_high 12 --dont_combine_weights_via_disc --scratch_dir /ssd --pool 3 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 142 --flatten_solvent --zero_mask --solvent_mask Result-by-Rado/run_class001_mask_th0.01_ns3_ngs7_box400.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 3 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym O --low_resol_join_halves 40 --norm --scale  --j 8 --gpu "" --keep_scratch --pipeline_control Refine3D/job019/
    

Error message:

Please cite the full error message as the example below.

A line in the STAR file contains fewer columns than the number of labels. Expected = 3 Found = 2
Error in line: 0 0.0
in: /prog/relion-devel-lmb/src/metadata_table.cpp, line 966
=== Backtrace  ===
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x42e981]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable12readStarLoopERSt14basic_ifstreamIcSt11char_traitsIcEEPSt6vectorI8EMDLabelSaIS6_EESsb+0xedd) [0x4361ad]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable8readStarERSt14basic_ifstreamIcSt11char_traitsIcEERKSsPSt6vectorI8EMDLabelSaIS8_EESsb+0x580) [0x436f10]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10Micrograph4readE8FileNameb+0x5a3) [0x454bb3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10MicrographC2E8FileNameS0_d+0x2e3) [0x4568b3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler14isMoviePresentERK13MetaDataTableb+0x180) [0x568280]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler17cullMissingMoviesERKSt6vectorI13MetaDataTableSaIS1_EEi+0xe6) [0x568dc6]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MotionRefiner4initEv+0x56f) [0x49e1ff]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(main+0x31) [0x42a5e1]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7ac026e495]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi() [0x42b3cf]
==================
@biochem-fan
Copy link
Member

Did you read and try suggestions in #738?

@huwjenkins
Copy link
Contributor

If this is the same issue as described in #738 this is caused by the OS destroying the semaphores. Are you logging out of the machine whilst RELION is running? And on the cluster did you log into and then log out of the node the job was running on?

ipcs -s is useful to diagnose what's going on.

@biochem-fan this was not an issue in RELION-4 because this code was omitted due the #ifdef CUDAs in src/projector.cpp being missed in f453d2c but came back in RELION-5 when they were changed to #ifdef _CUDA_ENABLED in 38f0c4f

@DrJesseHansen
Copy link
Author

thanks for the response. #738 suggests adding coarse search option to yes? I had that turned off, I set it to on and am re-running the job. I'll update my post when I know whether it worked.

Regarding the logging in/out: yes I am. Well sortof. I have a GPU node reserved which I log into using turboVNC, so it's a remote access node which runs constantly. The session is running constantly and I login from home/work to check on my job, however to regain access to the node each time I must SSH directly into the nod and reset my password. When it was run on the cluster I don't recall whether I logged into the node but I doubt it. I'll try running it again that way and ensure I do not log in to that node.

I ran ipcs -s, results below. I am runing on two GPUs. Should I run this again if/when the job fails?

------ Semaphore Arrays --------
key semid owner perms nsems
0x8ba79abb 2 jhansen 666 1
0x8ba79aba 3 jhansen 666 1

@huwjenkins
Copy link
Contributor

Should I run this again if/when the job fails?

You should try running this when you "login from home/work to check on my job". In the case I saw in #738 what I did was:

  1. submit job from the workstation with a delayed start (via SLURM).
  2. log out of all sessions on the workstation
  3. log back in once the job started and verify the semaphores were present
  4. log out and log back in and see that the semaphores were destroyed
  5. wait for the job to crash
  6. repeat 1 and 2 but never log in whilst the job was running - result was successful completion of the job.

The workaround was to use screen to keep a session open on the workstation. But I'm surprised this is also happening on a cluster because you're unlikely to ever log into the node where the job is running.

@huwjenkins
Copy link
Contributor

huwjenkins commented Aug 23, 2024

If you have admin rights you could also test if adding RemoveIPC=no in logind.conf and restarting logind fixes it. See systemd/systemd#2039 (comment)

@DrJesseHansen
Copy link
Author

update: it made it to the second iteration! So adding in coarse search option made a difference. Nice.

@DrJesseHansen
Copy link
Author

update. this just happened again. Same dataset, this time during ab initio. It got to iteration 113 then crashed. See below.

I definitely did NOT log into the node this time as it was processing.

Gradient optimisation iteration 112 of 200 with 3457 particles (Step size 0.5)
50.52/50.52 min ............................................................~~(,_,">
Maximization...
  0/   0 sec ............................................................~~(,_,">
CurrentResolution= 23.2758 Angstroms, which requires orientationSampling of at least 6.66667 degrees for a particle of diameter 400 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 7483392
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 3 NrTranslations= 203
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 478937088
OrientationalSampling= 3.75 NrOrientations= 294912
TranslationalSampling= 1.5 NrTranslations= 1624
=============================
Gradient optimisation iteration 113 of 200 with 3518 particles (Step size 0.5)
52.13/52.13 min ............................................................~~(,_,">
Maximization...
  0/   0 sec ............................................................~~(,_,">
in: /dev/shm/src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
=== Backtrace  ===
relion_refine(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56442ac3fd2a]
relion_refine(+0x7050c) [0x56442abb350c]
relion_refine(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56442ae8873b]
relion_refine(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56442ac696ac]
relion_refine(_ZN11MlOptimiser11expectationEv+0x21) [0x56442acae421]
relion_refine(_ZN11MlOptimiser7iterateEv+0x86) [0x56442acbfc26]
relion_refine(main+0x3c) [0x56442ac2de4c]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x14c9249fe24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x14c9249fe305]
relion_refine(_start+0x21) [0x56442ac31671]
==================
ERROR: 
semop lock error

RELION version: 5.0-beta-3-commit-6331fe
exiting with an error ...

@DrJesseHansen
Copy link
Author

this is happening now for all of my refine jobs. At least I can get to about iteration 10 before it crashes. Any idea what is causing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants