Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MotionCor error with TIFFOpen to many files lead to broken frame #819

Closed
KiSchnelle opened this issue Oct 15, 2021 · 12 comments
Closed

MotionCor error with TIFFOpen to many files lead to broken frame #819

KiSchnelle opened this issue Oct 15, 2021 · 12 comments
Assignees
Labels

Comments

@KiSchnelle
Copy link

A MotionCor job with relions own version failed after around 3k out of 4k processed eer files. Continuing the job works fine and finishes though normally.

Environment:

  • OS: Ubuntu 20.04 LTS
  • MPI runtime: Intel Mpi 2021.4
  • RELION version 4.0-beta-1-commit-cf2dc7
  • Number of Nodes: 2
  • Memory: 196GB/node
  • GPU: 1080
  • idk if needed, libtiff-dev version: 4.1.0+git191117-2ubuntu0.20.04.2

Dataset:

  • 4300 eer files from F4 (1603 internal frames)

Job options:

  • Type of job: MotionCor Relion version
  • Number of MPI processes: 4
  • Number of threads: 6
  • Full command (see note.txt in the job directory):
srun -n 4 `which relion_run_motioncorr_mpi` --i Import/job007/movies.star --o MotionCorr/job011/ --first_frame_sum 1 --last_frame_sum -1 --use_own  --j 6 --bin_factor 1 --bfactor 
150 --dose_per_frame 0.75 --preexposure 0 --patch_x 5 --patch_y 5 --eer_grouping 40 --gainref ../../rawdata/2021-09-10_HOPS_6s/20210527_105657_EER_GainReference.gain --gain_rot 0 
--gain_flip 0 --dose_weighting  --grouping_for_ps 5  --eer_upsampling 1 --pipeline_control MotionCorr/job011/
TIFFOpen: Movies/FoilHole_10311198_Data_10288223_10288225_20210911_135024_EER.eer: Too many open files.
in: /sbdata/software/apps/relion_beta_git/relion/src/renderEER.cpp, line 214
ERROR: 
Broken frame in file Movies/FoilHole_10311198_Data_10288223_10288225_20210911_135024_EER.eer
=== Backtrace  ===
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x5d) [0x43505d]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_ZN11EERRenderer10readLegacyEP8_IO_FILE+0x670) [0x4f78b0]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_ZN11EERRenderer4readE8FileNamei+0x659) [0x4f6569]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_ZN16MotioncorrRunner26executeOwnMotionCorrectionER10Micrograph+0x605) [0x486765]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_ZN19MotioncorrRunnerMpi3runEv+0x760) [0x4e11d0]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(main+0x1a7) [0x429d27]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fc5860cc0b3]
/sbdata/software/apps/relion_install/relion_beta_gpu_2_61/bin/relion_run_motioncorr_mpi(_start+0x2e) [0x429abe]

ERROR: 
Broken frame in file Movies/FoilHole_10311198_Data_10288223_10288225_20210911_135024_EER.eer
Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 33.0 ON ernie104 CANCELLED AT 2021-10-14T22:49:26 ***
srun: error: ernie105: task 3: Killed
srun: launch/slurm: _step_signal: Terminating StepId=33.0
srun: error: ernie105: task 2: Killed
srun: error: ernie104: task 0: Killed
srun: error: ernie104: task 1: Killed

P.S. a little off topic and not breaking
When installing with intel 2021.4 you get following warning

icpc: command line remark #10412: option '-mkl=parallel' is deprecated and will be removed in a future release. Please use the replacement option '-qmkl=parallel'
@biochem-fan
Copy link
Member

As the message suggests, Movies/FoilHole_10311198_Data_10288223_10288225_20210911_135024_EER.eer is probably broken. Remove this from your movie STAR file and continue the job.

(I am not sure what is going on with Too many open files, though)

@biochem-fan
Copy link
Member

option '-mkl=parallel' is deprecated and will be removed in a future release. Please use the replacement option '-qmkl=parallel'

We cannot change this at the moment, in order to remain compatible with earlier compiler versions. But we will take note of this. Thanks.

@KiSchnelle
Copy link
Author

KiSchnelle commented Oct 15, 2021

As the message suggests, Movies/FoilHole_10311198_Data_10288223_10288225_20210911_135024_EER.eer is probably broken. Remove this from your movie STAR file and continue the job.

(I am not sure what is going on with Too many open files, though)

I did continue without removing it and it finished fine.

I thought since its exactly the same file as the TIFFOpen that the TIFFOpen error maybe "corrupted" it in a sense that led to the broken Frame error. And since continue without removing it worked i guess it may be the case?

@biochem-fan
Copy link
Member

I don't know. My colleagues have processed thousands of EER movies without this error. If there are file handle leaks, they should have seen this error. Since I cannot reproduce this issue and you managed to process your dataset, I will close this issue. If you encounter the same issue again, please reopen.

@mokca
Copy link

mokca commented Feb 11, 2022

I can't add anything helpful to this, but I have seen this problem both with 3.1.2 and 4.0-beta. It typically happens with a few thousand EER files. We also see the "TiffOpen: Too many open files" error before a broken frame is reported.

The file reported as being broken is fine - if it's removed a different file is reported as being broken.

I've found that upping the number of MPI processes can get rid of the error, presumably because fewer files are opened per process and the 'too may files' limit isn't hit.

@biochem-fan
Copy link
Member

biochem-fan commented Feb 11, 2022

I don't know how to fix this. We do close TIFF files in https://github.com/3dem/relion/blob/ver4.0/src/renderEER.cpp#L264.

@KiSchnelle
Copy link
Author

Could it be something similar to this problem here?

https://stackoverflow.com/questions/15262542/tiffopen-too-many-open-files

Since as far as i understand the code your function is also called read?

@biochem-fan
Copy link
Member

biochem-fan commented Feb 15, 2022

@KiSchnelle

That is an interesting case but our code should be OK. Ours is C++'s member function. The function name is mangled https://en.wikipedia.org/wiki/Name_mangling#Complex_example.

@mokca
Copy link

mokca commented Feb 23, 2022

@biochem-fan

I put some diagnostic printfs in renderEER.cpp and something odd appears to be happening:

fopen Movies/test_0191.eer (0xdc2000)
TIFFOpen Movies/test_0191.eer (0xdc3700)
fclose 0xdc2000
fopen Movies/test_0191.eer (0xdca0e0)
TIFFOpen Movies/test_0191.eer (0xdca310)
fclose 0xdca0e0
TIFFClose 0xdca310
fopen Movies/test_0192.eer (0xdc2000)
TIFFOpen Movies/test_0192.eer (0xde4620)
fclose 0xdc2000
fopen Movies/test_0192.eer (0xdeaff0)
TIFFOpen Movies/test_0192.eer (0xdeb220)
fclose 0xdeaff0
TIFFClose 0xdeb220
fopen Movies/test_0193.eer (0xdc2000)
TIFFOpen Movies/test_0193.eer (0xde52c0)
fclose 0xdc2000
fopen Movies/test_0193.eer (0xde8a10)
TIFFOpen Movies/test_0193.eer (0xde8c40)
fclose 0xde8a10
TIFFClose 0xde8c40

Looking at test_0192.eer, for example, there are:

  • 2 fopen
  • 2 fclose
  • 2 TIFFOpen
  • 1 TIFFClose

I don't understand the code yet, but it seems that the first TIFFOpen doesn't have an associated TIFFClose. The second TIFFOpen does.

The modified code was:

    // First of all, check the file size
    FILE *fh = fopen(fn_movie.c_str(), "r");
    printf("fopen %s (%p)\n", fn_movie.c_str(), (void *) fh);

    if (fh == NULL)
        REPORT_ERROR("Failed to open " + fn_movie);

    fseek(fh, 0, SEEK_END);
    file_size = ftell(fh);
    fseek(fh, 0, SEEK_SET);

    silenceTIFFWarnings();

    // Try reading as TIFF; this handle is kept open
    ftiff = TIFFOpen(fn_movie.c_str(), "r");
    printf("TIFFOpen %s (%p)\n",fn_movie.c_str(), (void *) ftiff);

...

    fclose(fh);
    printf("fclose %p\n", (void *) fh);
    ready = true;

...

    TIFFClose(ftiff);
     printf("TIFFClose %p\n", (void *) ftiff);

@mokca
Copy link

mokca commented Feb 23, 2022

I've made a little more progress.

MotioncorrRunner::executeOwnMotionCorrection calls EERRenderer::Read, which calls TIFFOpen. It later calls EERRenderer::renderFrames, which calls EERRenderer::lazyReadFrames, which calls TIFFClose. This is fine.

However, at some point before this Microgaph::SetMovie is called. This also calls EERRenderer::Read, which results in a call to TIFFOpen. There is no corresponding call to TIFFClose.

@biochem-fan
Copy link
Member

@mokca Thank you very much for great investigation!
Now I understood what is wrong. I will take care of this in a few days.

@biochem-fan
Copy link
Member

I repaired this problem in the commit 505e422 for both 3.1 (the master and ver3.1 branch) and 4.0 (the ver4.0 branch).

Thank you very much for your report and investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants