Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

haplotypecaller ArrayIndexOutOfBoundsException : org.broadinstitute.hellbender.tools.walkers.annotator.TandemRepeat.getNumTandemRepeatUnits #8106

Open
1 of 2 tasks
lindenb opened this issue Nov 25, 2022 · 5 comments
Assignees

Comments

@lindenb
Copy link

lindenb commented Nov 25, 2022

Instructions

Bug Report

Hi GATK team , I'm afraid I found an exception in gatk HC related to #6516

Affected tool(s) or class(es)

GATK HC v4.3.0.0

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of [date of test?]

Description

+ gatk --java-options '-Xmx5g -Djava.io.tmpdir=TMP' HaplotypeCaller -R /LAB-DATA/BiRD/resources/species/human/cng.fr/hs38me/hs38me_all_chr.fasta --minimum-mapping-quality 10 --sample-ploidy 2 --do-not-run-physical-phasing --alleles TMP/jeter.vcf.gz -L TMP/jeter.vcf.gz -I /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20221123.hs38me.NTS299.ultrares/work/87/5fa0df303dc4f06212547353be621c/BAMS/cluster.aaaaaaacx.bam.list -O TMP/jeter2.vcf.gz                                                                                Using GATK jar /LAB-DATA/BiRD/users/lindenbaum-p/packages/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar
Running:                                                                               
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx5g -Dj
ava.io.tmpdir=TMP -jar /LAB-DATA/BiRD/users/lindenbaum-p/packages/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R /LAB-DATA/BiRD/resources/species/human/cng.fr/hs38me/hs38me_all_chr.fasta --minimum-mapping-quality 10 --sample-ploidy 2 --do-not-run-physical-phasing --alleles TMP/jeter.vcf.gz -L TMP/jeter.vcf.gz -I /SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20221123.hs38me.NTS299.ultrares/work/87/5fa0df303dc4f06212547353be621c/BAMS/cluster.aaaaaaacx.bam.list -O TMP/jeter2.vcf.gz
18:15:19.107 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/LAB-DATA/BiRD/users/lindenbaum-p/packages/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.j
ar!/com/intel/gkl/native/libgkl_compression.so                                                                                                                                
18:15:21.727 INFO  HaplotypeCaller - ------------------------------------------------------------                                                                   
18:15:21.728 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0                                                                                              
18:15:21.728 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
18:15:21.728 INFO  HaplotypeCaller - Executing as lindenbaum-p@bigmem002 on Linux v3.10.0-1160.66.1.el7.x86_64 amd64
18:15:21.728 INFO  HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_131-b11   
18:15:21.728 INFO  HaplotypeCaller - Start Date/Time: November 24, 2022 6:15:19 PM CET                                                                                        
18:15:21.728 INFO  HaplotypeCaller - ------------------------------------------------------------
18:15:21.728 INFO  HaplotypeCaller - ------------------------------------------------------------                                               
18:15:21.728 INFO  HaplotypeCaller - HTSJDK Version: 3.0.1
18:15:21.728 INFO  HaplotypeCaller - Picard Version: 2.27.5
18:15:21.728 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
18:15:21.728 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2                                                                                                    
18:15:21.729 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false                                      
18:15:21.729 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true                                            
18:15:21.729 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false                           
18:15:21.729 INFO  HaplotypeCaller - Deflater: IntelDeflater                                                                                                                  
18:15:21.729 INFO  HaplotypeCaller - Inflater: IntelInflater                                                                                                                  
18:15:21.729 INFO  HaplotypeCaller - GCS max retries/reopens: 20                   
18:15:21.729 INFO  HaplotypeCaller - Requester pays: disabled                                                                                                                 
18:15:21.729 INFO  HaplotypeCaller - Initializing engine                                                                                                                      
18:15:23.068 INFO  FeatureManager - Using codec VCFCodec to read file file:///SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20221123.hs38me.NTS299.ultrares/work/34/f410396038
a7bb49eaece47a172e2d/TMP/jeter.vcf.gz                                                  
18:15:23.077 WARN  IntelInflater - Zero Bytes Written : 0             
18:15:23.361 INFO  FeatureManager - Using codec VCFCodec to read file file:///SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20221123.hs38me.NTS299.ultrares/work/34/f410396038
18:15:23.361 INFO  FeatureManager - Using codec VCFCodec to read file file:///SCRATCH-BIRD/users/lindenbaum-p/work/NEXTFLOW/20221123.hs38me.NTS299.ultrares/work/34/f4[0/1667]a7bb49eaece47a172e2d/TMP/jeter.vcf.gz 
18:15:23.374 WARN  IntelInflater - Zero Bytes Written : 0                              
18:15:23.385 WARN  IntelInflater - Zero Bytes Written : 0                              
18:15:23.403 INFO  IntervalArgumentCollection - Processing 1028 bp from intervals      
18:15:23.411 INFO  HaplotypeCaller - Done initializing engine                          
18:15:23.430 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/LAB-DATA/BiRD/users/lindenbaum-p/packages/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so                                                                                                                                            
18:15:23.475 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/LAB-DATA/BiRD/users/lindenbaum-p/packages/gatk/gatk-4.3.0.0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so                                                                                                                                
18:15:23.651 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM  
18:15:23.651 INFO  IntelPairHmm - Available threads: 4                                                                                                                        
18:15:23.651 INFO  IntelPairHmm - Requested threads: 4                                                                                                                        
18:15:23.651 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation                                                                    
18:15:23.671 INFO  ProgressMeter - Starting traversal                                                                                                                         
18:15:23.671 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute                                                               
18:15:26.788 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position chr1:30191420 and possibly subsequent; at least 10 samples must have called genotypes 
18:15:27.190 WARN  DepthPerSampleHC - Annotation will not be calculated at position chr1:30477350 and possibly subsequent; genotype for sample B00I9EL is not called
18:15:35.547 INFO  ProgressMeter -        chr1:32128426              0.2                    40            202.1                                                               
18:15:48.416 INFO  ProgressMeter -        chr1:36398656              0.4                    80            194.0   
18:15:51.025 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.012874514                            
18:15:51.026 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 5.818477527000001
18:15:51.026 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 1.35 sec                                                                                
18:15:51.027 INFO  HaplotypeCaller - Shutting down engine                                                                                                                     
[November 24, 2022 6:15:51 PM CET] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.53 minutes.
Runtime.totalMemory()=3212836864                                                       
java.lang.ArrayIndexOutOfBoundsException                                               
        at java.util.Arrays.copyOfRange(Arrays.java:3521)          
        at org.broadinstitute.hellbender.tools.walkers.annotator.TandemRepeat.getNumTandemRepeatUnits(TandemRepeat.java:54)                                                   
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyRegionTrimmer.trim(AssemblyRegionTrimmer.java:189)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:655)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:271)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)                                                          
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)                                                                  
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1095)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)                                                                      
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)                                                    
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)                                                                 
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)     
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)                                                                                                             

the same command ran fin with another intervals.

@lindenb
Copy link
Author

lindenb commented Nov 27, 2022

and here is a minimal example:

minimal.tar.gz

with

+ gatk --java-options '-Xmx5g -Djava.io.tmpdir=.' HaplotypeCaller -R /LAB-DATA/BiRD/resources/species/human/cng.fr/hs38me/hs38me_all_chr.fasta --minimum-mapping-quality 10 --sample-ploidy 2 --do-not-run-physical-phasing --alleles jeter2.vcf.gz -L jeter2.vcf.gz -I jeter.bam -O jeter3.vcf.gz
java.lang.ArrayIndexOutOfBoundsException
        at java.util.Arrays.copyOfRange(Arrays.java:3521)
        at org.broadinstitute.hellbender.tools.walkers.annotator.TandemRepeat.getNumTandemRepeatUnits(TandemRepeat.java:54)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyRegionTrimmer.trim(AssemblyRegionTrimmer.java:189)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:655)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:271)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1095)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

@lindenb
Copy link
Author

lindenb commented Jan 23, 2023

I'm trying to explore this bug:

my cmd is :

$ java -Xmx5g -Djava.io.tmpdir=TMP -jar ./build/libs/gatk-package-unspecified-SNAPSHOT-local.jar   HaplotypeCaller -R /LAB-DATA/BiRD/resources/species/human/cng.fr/hs38me/hs38me_all_chr.fasta --minimum-mapping-quality 10 --sample-ploidy 2 --do-not-run-physical-phasing --alleles jeter2.vcf.gz -L jeter2.vcf.gz -I jeter.bam -O TMP/jeter3.vcf.gz   

in : ./src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/TandemRepeat.getNumTandemRepeatUnits

the variant is VC Unknown @ chr1:36957826-36958130 Q109.21 of type=INDEL alleles=[CCTGTGTGTCCCATGACTCTGTGCCCCGTGAGTCTGTGTGCCCCGTGACTCTGTGCCCCTGAGTCTGTGTGCCCTGTGAGTCTGTGCCCCGTGAGCCTGTGTGCCCCATGAGCCTGTGTGTCCCATGACTCTGTGCCCCGTGAGTCTGTGTGCCCCATGACTCTGTGCCCCCTGAGTCTGTGTGCCCGGTGAGTCTGTGCCCCATGAGCCTGTGTGCTCTTTGAATCTTTGCTCCGTGAGCCTGTGTGCTCTTTGAATCTGCGCCCTGTGAGCCTGTGTGCCCCGTGAGTCTGTGTCCCATGAGT*, C] ...

with ref.getWindow().getStart() = 36958026

so startIndex is lower than zero when = final int startIndex = vc.getStart() + 1 - ref.getWindow().getStart(); = -199

which rises a java.lang.ArrayIndexOutOfBoundsException with: Arrays.copyOfRange(refBases, startIndex, refBases.length)

so it looks like the ReferenceContext used by getNumTandemRepeatUnits was too narrow for the given variant.

@davidbenjamin
Copy link
Contributor

I think I see what's going on.

Since the force calling allele is so huge (a deletion from 36957826 to 36958130) the GATK engine does not create an AssemblyRegion that spans it. Rather, it creates a big region from 36957826 to 36958125 and a tiny one from 36958125 to 36958130. This is silly and worth fixing but the bug hasn't occurred yet.

The GATK goes through both assembly regions, the big one and the small one, and the force calling allele is genotyped in both. This happens because in HaplotyeCallerEngine, line 607, the call to features.getValues(hcArgs.alleles) grabs all overlapping variants in the force calling VCF, thus leading to its appearance in both assembly regions.

I think the fix might be as simple as counting only force calling alleles that begin in an assembly region. A different solution might be to guarantee upstream that force calling alleles fit completely in a single assembly region, perhaps skipping them with a warning if they are too big for the GATK to handle.

@droazen I can get my hands dirty and probably fix this reasonably quickly at this point, but could you weigh in on the two possible solutions?

@droazen
Copy link
Contributor

droazen commented Jan 26, 2023

@davidbenjamin Counting only alleles that begin in an assembly region seems more consistent with our past solutions to problems of this nature. See also the PR #6388, which we might be able to use here.

@davidbenjamin
Copy link
Contributor

Great, in that case I am optimistic that this will be a quick fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants