Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenotypeGVCFs memory issues on GATK 4.6.0.0 #8918

Open
jin0008 opened this issue Jul 17, 2024 · 15 comments
Open

GenotypeGVCFs memory issues on GATK 4.6.0.0 #8918

jin0008 opened this issue Jul 17, 2024 · 15 comments

Comments

@jin0008
Copy link

jin0008 commented Jul 17, 2024

Bug Report

Affected tool(s) or class(es)

GenotypeGVCFs

Affected version(s)

4.6.0.0

Description

When I was doing GenotypeGVCFs from GenomicsDB of 420 samples, the process interrupted due to significant memory issues. This process was eating up memory continuously. In 4.5.0.0, I did same process, and I confirmed it works fine.

@gokalpcelik
Copy link
Contributor

Can you provide your logs that shows the error message?

@jin0008
Copy link
Author

jin0008 commented Jul 19, 2024 via email

@gokalpcelik
Copy link
Contributor

Can you provide more details on what operating system you are using and other related information such as java version etc?

Even if the process gets interrupted by the system there must be a java segfault message at some point thrown by the process. Did you observe any files with names ERR around the output file?

@jin0008
Copy link
Author

jin0008 commented Jul 19, 2024 via email

@gokalpcelik
Copy link
Contributor

Can you tell us how much is your heap size for this task? (-Xmx? -Xms?)

@broadinstitute broadinstitute deleted a comment from SaarGirl Aug 2, 2024
@icemduru
Copy link

i have a similar issue. Weirdly -Xmx does not help.

@gokalpcelik
Copy link
Contributor

@icemduru
Can you provide more details on your issue? How many samples do you have? How did you combine them and what are your command lines for this process?
Can you provide more details on the system that you are running these commands on?

GenotypeGVCFs is not known to have memory leak issues. Our tests indicated that it only needs around 4~6GBs of total memory to genotype 120 whole genome samples (Per contig).

@icemduru
Copy link

@icemduru Can you provide more details on your issue? How many samples do you have? How did you combine them and what are your command lines for this process? Can you provide more details on the system that you are running these commands on?

GenotypeGVCFs is not known to have memory leak issues. Our tests indicated that it only needs around 4~6GBs of total memory to genotype 120 whole genome samples (Per contig).

Thanks for reply. I have 370 samples. I have run HaplotypeCaller for each of them. Then run GenomicsDBImport for each of the chromosome (it is a plant genome, about 420 mb in total genome size). Then tried to run GenotypeGVCFs for each chromosome. I attached the log file for chr1.
slurm-22616776.out_text.txt

@gokalpcelik
Copy link
Contributor

Hi @icemduru
Looks like your slurm workload manager was configured to have a limit of 48GBs of maximum process memory size per execution. Your java instance is set with -Xmx45G which will cover most of this limit and leaves only a handful of memory space for the native GenomicsDB library. Native libraries work above the heapsize so it is better for you to set your -Xmx to a more sensible size of 8~12GB and leave rest of the memory space to the native library to use.

Keep in mind that this memory limit on slurm could be set per user not per task therefore you may need to run a single contig at a time or maybe 2 of them simultaneously. Otherwise slurm may interefere with all the tasks and cancel all your jobs.

One final reminder. We strongly recommend users to set the temporary directory to somewhere else other than /tmp. Slurm workload manager interferes with this preference and sometimes results in premature termination of the gatk processes due to deletion of extracted native library and accessory files.

I hope this helps.

@icemduru
Copy link

Hi @icemduru Looks like your slurm workload manager was configured to have a limit of 48GBs of maximum process memory size per execution. Your java instance is set with -Xmx45G which will cover most of this limit and leaves only a handful of memory space for the native GenomicsDB library. Native libraries work above the heapsize so it is better for you to set your -Xmx to a more sensible size of 8~12GB and leave rest of the memory space to the native library to use.

Keep in mind that this memory limit on slurm could be set per user not per task therefore you may need to run a single contig at a time or maybe 2 of them simultaneously. Otherwise slurm may interefere with all the tasks and cancel all your jobs.

One final reminder. We strongly recommend users to set th
slurm-22680938.out_text.txt
e temporary directory to somewhere else other than /tmp. Slurm workload manager interferes with this preference and sometimes results in premature termination of the gatk processes due to deletion of extracted native library and accessory files.

I hope this helps.

Thank you for your help, but unfortunately it didn't resolve the issue. I've already tried allocating 10GB of memory using the -Xmx10g flag and redirecting the temporary directory away from /tmp. However, GATK is still attempting to consume more than 48GB of RAM, resulting in the termination of my run.
slurm-22680938.out_text.txt

@gokalpcelik
Copy link
Contributor

Hi again.
Did you add the --consolidate true parameter to GenomicsDBImport during importing stage? It is a step which collapses each layer of import into a single layer which prevents tools to open too many files at once but it may also take sometime at the end of the importing stage. It also reduces the amount of book keeping to be done by the genotyper.

@icemduru
Copy link

Hi again. Did you add the --consolidate true parameter to GenomicsDBImport during importing stage? It is a step which collapses each layer of import into a single layer which prevents tools to open too many files at once but it may also take sometime at the end of the importing stage. It also reduces the amount of book keeping to be done by the genotyper.

Hi,
Thanks for the suggestion. I have used the --consolidate true parameter to GenomicsDBImport during importing stage. However, it did not help. But I solved my problem using large memory machines. For future reference, required memory was 95.11 GB for 370 samples dataset using -Xmx8G and --disable-bam-index-caching true.

@Wangchangsh
Copy link

Same problem. Any solution or update?

@gokalpcelik
Copy link
Contributor

Hi @Wangchangsh
Yes there is an update for this issue. We were able to recreate this problem in our hands and looks like there is a memory management issue somewhere in the GenomicsDB related code inside GenotypeGVCFs.

Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using

gatk SelectVariants -V gendb://instancename -O GVCF_export.g.vcf.gz -R ref.fa -L whateverintervalusedinGDBimport

and later using this GVCF file as input for the GenotypeGVCFs tool. This ensures that memory usage won't go above unreasonable levels and won't cause any appearant leaks.

I hope this helps.

Regards.

@Wangchangsh
Copy link

Thank you for your prompt response. I used the -L parameter to split tasks into Mb-level to prevent memory issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants