semop lock error during 2D Classification #738

heejongkim · 2021-02-18T22:56:23Z

Hi,

I'm getting a weird stochastic "semop lock error" msg during the 2D classification.

Let me try my best to provide as much info as i can.
System setting:
2x Xeon total 64 threads with 252GB RAM coupled with 4x RTX 2080 Ti
OS:
ubuntu 18.04 LTS
Relion versions:
I tried both 3.1.1 and 3.1.1-commit-9f3bf1
Cuda version:
Tried both Cuda 11 and Cuda 9.2 (compiled relion separately)

Dataset:
Pixel size: 1.12, Voltage: 300, Cs: 2.7
Particle Box size:
300px but bin2

2D Class setting:
Optimization: 50 ~ 100, T: 2, Number of iteration: 25 ~ 30, mask diameter (A): 300
Sampling: alignment Yes, angular sampling:6, Offset search range: 10, offset search step: 1, allow coarse sampling: NO

Stochastically failed at random iteration but consistently failed at the end of the iteration and either beginning or somewhere in the middle of maximization step.

Here's one example of output and error:

>  Expectation iteration 3 of 25
> 41.20/41.20 min ............................................................~~(,_,">
>  Maximization ...
>    1/   1 sec ............................................................~~(,_,">
>  Estimating accuracies in the orientational assignment ... 
>    1/   1 sec ............................................................~~(,_,">
>  Auto-refine: Estimated accuracy angles= 12.2 degrees; offsets= 7.392 Angstroms
>  Coarser-sampling: Angular step= 9.47368 degrees.
>  Coarser-sampling: Offset search range= 22.4 Angstroms; offset step= 5.9136 Angstroms
>  CurrentResolution= 12 Angstroms, which requires orientationSampling of at least 5.21739 degrees for a particle of diameter 260 Angstroms
>  Oversampling= 0 NrHiddenVariableSamplingPoints= 17100
>  OrientationalSampling= 18.9474 NrOrientations= 19
>  TranslationalSampling= 11.8272 NrTranslations= 9
> =============================
>  Oversampling= 1 NrHiddenVariableSamplingPoints= 547200
>  OrientationalSampling= 9.47368 NrOrientations= 152
>  TranslationalSampling= 5.9136 NrTranslations= 36
> =============================
>  Expectation iteration 4 of 25
> 51.28/51.28 min ............................................................~~(,_,">
>  Maximization ...
>    0/   0 sec ............................................................~~(,_,">
> 
>  RELION version: 3.1.1-commit-9f3bf1
>  exiting with an error ...


-----------------------------

>  in: /home/heejong/relion_dev/src/projector.cpp, line 202
> ERROR: 
> semop lock error
> === Backtrace  ===
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x55d048657de7]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x3bfd) [0x55d0486abcdd]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0xa41) [0x55d0487c90d1]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55d0487e225c]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fc) [0x55d04867361c]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x2b2) [0x55d048683c62]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(main+0x73) [0x55d048643dc3]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f228e238b97]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_start+0x2a) [0x55d048646eaa]
> ==================
> ERROR: 
> semop lock error
> Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
>

I even tried with MPICH 3.4.1 instead of OpenMPI and still getting the same error at random.

Any suggestions to get through this would be highly appreciated.

Thank you.

best,
hee jong kim

The text was updated successfully, but these errors were encountered:

biochem-fan · 2021-02-18T23:15:41Z

@arom4github Can you investigate this? This is happening in the code you introduced in Projector.

heejongkim · 2021-02-19T00:12:07Z

I just updated the original post that it's most likely related to "allow coarse sampling".
When it's set to No, it's more likely happening than when it's set to Yes.

biochem-fan · 2021-02-25T07:48:42Z

@heejongkim

Does this happen on all datasets (such as our tutorial dataset) or only on this particular dataset?
How many MPI processes and threads are you running per GPU?
Can you tell @arom4github the full command line you used?
We need to reproduce the problem locally to fix it.

heejongkim · 2021-02-25T19:13:33Z

Let me get that tutorial daatset and see if i can reproduce the issue as well.
I will get back to you soon.
Thanks.

heejongkim · 2021-02-25T21:41:25Z

I tried up to Class2D after auto-pick with LoG and did not get those error msg.
I wonder it's due to different OS with different dependency version or dataset issue.
What would be the best information I could possibly provide you? @biochem-fan

biochem-fan · 2021-02-25T22:16:03Z

Did you process the tutorial dataset on the same machine using the same binary as the failing dataset? What if you extract the tutorial dataset to the same box size (in pixels) as the failing one?

heejongkim · 2021-02-26T08:11:41Z

yes. same ubuntu and same relion.
I tried with bigger box to match the one that i used for the failing dataset but it was fine.
However, i should note that 1) the failing dataset is super res. so it's binned from motion correction, 2) particle coordinates were transferred from cryosparc.

biochem-fan · 2021-02-26T08:21:54Z

Both sound unlikely causes.

Are particles centered well? (Are rlnOriginX/YAngst large?)
What happens if you re-extract particles in RELION with re-centering after successful Class2D?

biochem-fan · 2021-09-27T23:17:45Z

No response for a while; if this still happens in 4.0, please reopen.

huwjenkins · 2022-05-05T14:45:04Z

I just encountered this in RELION 3.1.2 on Ubuntu 20.04. Adding some debug info to src/projector.cpp revealed that when the call to semop() here:

relion/src/projector.cpp

Line 201 in fa923df

if (semop(semid, &op_lock[0], 2) < 0)

fails errno is set to EINVAL as the semaphore with semid was removed whilst relion_refine_mpi is running. This blog post appears relevant but as @biochem-fan implies updating to RELION-4 will also fix the problem.

arom4github · 2022-05-11T04:42:37Z

Do you know what/who deletes the semaphore in a wrong time?
The semaphore can be dropped by the user itself from the command line or by administrator.

huwjenkins · 2022-05-11T07:47:30Z

I think they are removed when the user logs out - that's what the man page for logind.conf suggests:

       RemoveIPC=
           Controls whether System V and POSIX IPC objects belonging to the user shall be removed when the user fully
           logs out. Takes a boolean argument. If enabled, the user may not consume IPC resources after the last of the
           user's sessions terminated. This covers System V semaphores, shared memory and message queues, as well as
           POSIX shared memory and message queues. Note that IPC objects of the root user and other system users are
           excluded from the effect of this setting. Defaults to "yes".

The jobs where I saw these failures were submitted using SLURM but from the node (workstation) that the jobs were running on (i.e. not from a login node on a cluster) and I was logged on when the jobs started and had logged out before they failed. Presumably setting RemoveIPC=no will fix this but I haven't tested it as I don't have admin rights on this machine.

biochem-fan closed this as completed Sep 27, 2021

biochem-fan mentioned this issue Aug 23, 2024

semop lock error during 3D classification #1177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semop lock error during 2D Classification #738

semop lock error during 2D Classification #738

heejongkim commented Feb 18, 2021 •

edited by biochem-fan

Loading

biochem-fan commented Feb 18, 2021

heejongkim commented Feb 19, 2021

biochem-fan commented Feb 25, 2021

heejongkim commented Feb 25, 2021

heejongkim commented Feb 25, 2021

biochem-fan commented Feb 25, 2021 •

edited

Loading

heejongkim commented Feb 26, 2021

biochem-fan commented Feb 26, 2021

biochem-fan commented Sep 27, 2021

huwjenkins commented May 5, 2022 •

edited

Loading

arom4github commented May 11, 2022

huwjenkins commented May 11, 2022

semop lock error during 2D Classification #738

semop lock error during 2D Classification #738

Comments

heejongkim commented Feb 18, 2021 • edited by biochem-fan Loading

biochem-fan commented Feb 18, 2021

heejongkim commented Feb 19, 2021

biochem-fan commented Feb 25, 2021

heejongkim commented Feb 25, 2021

heejongkim commented Feb 25, 2021

biochem-fan commented Feb 25, 2021 • edited Loading

heejongkim commented Feb 26, 2021

biochem-fan commented Feb 26, 2021

biochem-fan commented Sep 27, 2021

huwjenkins commented May 5, 2022 • edited Loading

arom4github commented May 11, 2022

huwjenkins commented May 11, 2022

heejongkim commented Feb 18, 2021 •

edited by biochem-fan

Loading

biochem-fan commented Feb 25, 2021 •

edited

Loading

huwjenkins commented May 5, 2022 •

edited

Loading