Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semop lock error during 2D Classification #738

Closed
heejongkim opened this issue Feb 18, 2021 · 12 comments
Closed

semop lock error during 2D Classification #738

heejongkim opened this issue Feb 18, 2021 · 12 comments

Comments

@heejongkim
Copy link

heejongkim commented Feb 18, 2021

Hi,

I'm getting a weird stochastic "semop lock error" msg during the 2D classification.

Let me try my best to provide as much info as i can.
System setting:
2x Xeon total 64 threads with 252GB RAM coupled with 4x RTX 2080 Ti
OS:
ubuntu 18.04 LTS
Relion versions:
I tried both 3.1.1 and 3.1.1-commit-9f3bf1
Cuda version:
Tried both Cuda 11 and Cuda 9.2 (compiled relion separately)

Dataset:
Pixel size: 1.12, Voltage: 300, Cs: 2.7
Particle Box size:
300px but bin2

2D Class setting:
Optimization: 50 ~ 100, T: 2, Number of iteration: 25 ~ 30, mask diameter (A): 300
Sampling: alignment Yes, angular sampling:6, Offset search range: 10, offset search step: 1, allow coarse sampling: NO

Stochastically failed at random iteration but consistently failed at the end of the iteration and either beginning or somewhere in the middle of maximization step.

Here's one example of output and error:

>  Expectation iteration 3 of 25
> 41.20/41.20 min ............................................................~~(,_,">
>  Maximization ...
>    1/   1 sec ............................................................~~(,_,">
>  Estimating accuracies in the orientational assignment ... 
>    1/   1 sec ............................................................~~(,_,">
>  Auto-refine: Estimated accuracy angles= 12.2 degrees; offsets= 7.392 Angstroms
>  Coarser-sampling: Angular step= 9.47368 degrees.
>  Coarser-sampling: Offset search range= 22.4 Angstroms; offset step= 5.9136 Angstroms
>  CurrentResolution= 12 Angstroms, which requires orientationSampling of at least 5.21739 degrees for a particle of diameter 260 Angstroms
>  Oversampling= 0 NrHiddenVariableSamplingPoints= 17100
>  OrientationalSampling= 18.9474 NrOrientations= 19
>  TranslationalSampling= 11.8272 NrTranslations= 9
> =============================
>  Oversampling= 1 NrHiddenVariableSamplingPoints= 547200
>  OrientationalSampling= 9.47368 NrOrientations= 152
>  TranslationalSampling= 5.9136 NrTranslations= 36
> =============================
>  Expectation iteration 4 of 25
> 51.28/51.28 min ............................................................~~(,_,">
>  Maximization ...
>    0/   0 sec ............................................................~~(,_,">
> 
>  RELION version: 3.1.1-commit-9f3bf1
>  exiting with an error ...


-----------------------------

>  in: /home/heejong/relion_dev/src/projector.cpp, line 202
> ERROR: 
> semop lock error
> === Backtrace  ===
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x55d048657de7]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x3bfd) [0x55d0486abcdd]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0xa41) [0x55d0487c90d1]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55d0487e225c]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x2fc) [0x55d04867361c]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x2b2) [0x55d048683c62]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(main+0x73) [0x55d048643dc3]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f228e238b97]
> /opt/relion-3.1_dev/bin/relion_refine_mpi(_start+0x2a) [0x55d048646eaa]
> ==================
> ERROR: 
> semop lock error
> Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
> 

I even tried with MPICH 3.4.1 instead of OpenMPI and still getting the same error at random.

Any suggestions to get through this would be highly appreciated.

Thank you.

best,
hee jong kim

@biochem-fan
Copy link
Member

@arom4github Can you investigate this? This is happening in the code you introduced in Projector.

@heejongkim
Copy link
Author

I just updated the original post that it's most likely related to "allow coarse sampling".
When it's set to No, it's more likely happening than when it's set to Yes.

@biochem-fan
Copy link
Member

@heejongkim

Does this happen on all datasets (such as our tutorial dataset) or only on this particular dataset?
How many MPI processes and threads are you running per GPU?
Can you tell @arom4github the full command line you used?
We need to reproduce the problem locally to fix it.

@heejongkim
Copy link
Author

Let me get that tutorial daatset and see if i can reproduce the issue as well.
I will get back to you soon.
Thanks.

@heejongkim
Copy link
Author

I tried up to Class2D after auto-pick with LoG and did not get those error msg.
I wonder it's due to different OS with different dependency version or dataset issue.
What would be the best information I could possibly provide you? @biochem-fan

@biochem-fan
Copy link
Member

biochem-fan commented Feb 25, 2021

Did you process the tutorial dataset on the same machine using the same binary as the failing dataset? What if you extract the tutorial dataset to the same box size (in pixels) as the failing one?

@heejongkim
Copy link
Author

yes. same ubuntu and same relion.
I tried with bigger box to match the one that i used for the failing dataset but it was fine.
However, i should note that 1) the failing dataset is super res. so it's binned from motion correction, 2) particle coordinates were transferred from cryosparc.

@biochem-fan
Copy link
Member

Both sound unlikely causes.

Are particles centered well? (Are rlnOriginX/YAngst large?)
What happens if you re-extract particles in RELION with re-centering after successful Class2D?

@biochem-fan
Copy link
Member

No response for a while; if this still happens in 4.0, please reopen.

@huwjenkins
Copy link
Contributor

huwjenkins commented May 5, 2022

I just encountered this in RELION 3.1.2 on Ubuntu 20.04. Adding some debug info to src/projector.cpp revealed that when the call to semop() here:

if (semop(semid, &op_lock[0], 2) < 0)

fails errno is set to EINVAL as the semaphore with semid was removed whilst relion_refine_mpi is running. This blog post appears relevant but as @biochem-fan implies updating to RELION-4 will also fix the problem.

@arom4github
Copy link
Contributor

Do you know what/who deletes the semaphore in a wrong time?
The semaphore can be dropped by the user itself from the command line or by administrator.

@huwjenkins
Copy link
Contributor

I think they are removed when the user logs out - that's what the man page for logind.conf suggests:

       RemoveIPC=
           Controls whether System V and POSIX IPC objects belonging to the user shall be removed when the user fully
           logs out. Takes a boolean argument. If enabled, the user may not consume IPC resources after the last of the
           user's sessions terminated. This covers System V semaphores, shared memory and message queues, as well as
           POSIX shared memory and message queues. Note that IPC objects of the root user and other system users are
           excluded from the effect of this setting. Defaults to "yes".

The jobs where I saw these failures were submitted using SLURM but from the node (workstation) that the jobs were running on (i.e. not from a login node on a cluster) and I was logged on when the jobs started and had logged out before they failed. Presumably setting RemoveIPC=no will fix this but I haven't tested it as I don't have admin rights on this machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants