HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

gokalpcelik · 2024-08-06T14:31:11Z

Hi all.
This looks like an issue presented by a forum post

Haplotypecaller FORMAT:DP is affected by the interval in WGS

User uploaded a toy data for us to test and I was able to recreate this issue under GATK 4.6.0.0. I have not tested it with any other versions.

When whole contig is given as interval all variant sites in the multisample VCF is reported with DP value much less than what it is supposed to be in samples where no variation occur. Samples with variants are shown as expected DP. Setting ploidy 2 for this analysis restores the expected DP value for samples with HOMREF sites no matter what interval is used. Numbers can be seen in the figure as well as those variant contexts.

This is what user and I got with the whole contig given as interval

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DA10_DA10	TDY1754_TDY1754
CP022325.1	69348	.	T	A	2920.63	.	AC=1;AF=0.500;AN=2;DP=85;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=29.56;SOR=0.824	GT:AD:DP:GQ:PL	0:4,0:4:99:0,119	1:0,79:79:99:2931,0

This is what comes when -L is set to CP022325.1:69347-69349. This is the same DP reported when ploidy is set to 2 no matter what interval is used. This is also the expected DP value.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DA10_DA10	TDY1754_TDY1754
CP022325.1	69348	.	T	A	2920.63	.	AC=1;AF=0.500;AN=2;DP=98;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=25.36;SOR=0.824	GT:AD:DP:GQ:PL	0:17,0:17:99:0,685	1:0,79:79:99:2931,0

User data is in the incoming folder with name cmateusiak_20240805.tar.gz

The reference genome is a fungal one from the below link

Fungi reference C.NeoformansKN99

The text was updated successfully, but these errors were encountered:

droazen · 2024-08-06T15:12:31Z

@jamesemery will look into this later on in the week

cmatKhan · 2024-08-06T15:16:29Z

Also tested with 4.3.0.0 -- same result

cmatKhan · 2024-08-29T17:23:07Z

@jamesemery -- did you happen to get a change to look at this?

gokalpcelik · 2024-08-29T18:28:25Z

Hi @cmatKhan
Haploid confidence emission of HaplotypeCaller is a little more aggressive than diploid emission therefore there is really no easy way to change this behavior without changing the major parts of the reference confidence emission engine.

Currently the only way to fix this issue for you is to enable -ERC BP_RESOLUTION parameter which will eventually emit every single site on the genome but will emit close to what you will expect to see per SNP site.

I hope this helps.

jamesemery · 2024-08-29T19:06:56Z

Sorry for the delayed response @cmatKhan. The fix here is a little complicated because its a confluence of expected behaviors adding up to this.

Specifically in -ERC GVCF mode in HaplotypeCaller we merge adjacent regions with similar GQ scores but within those merged reference blocks we track the DP (mean dp) and the MIN_DP. When we go to genotype in GenotypeGVCFs we fall back to Min_DP when reporting the DP since it must be at least that high at the sight in question. For Diploid samples and with the default reference blocking we use, this works fine and for most of the range up to 30+ bases you are able to recover a pretty close approximation of what your DP is in most cases. However for Haploid data the assumptions at play mean that the GQ gets very high and starts to max out the field (99) so you end up with extremely long reference confidence blocks with very high confidence and a min_DP of some low value (in your case for the data you shared with us of 4, which is the threshold for hitting GQ=90 in the haploid model). This is also part of why adjusting the intervals for traversal was relevant, as if they were too long they were pulling in bases 1000+ bp away that only have 4fold coverage with reads and torpedoing the reported DP.

So far nothing seems to obviously be bugged (we don't test/use haploid calling mode very often so its less well tested and this sort of issue is not surprising). We might be over-confident about reporting GQ for haploid reference blocks (though mathematically its sound that you only need a few reference observations to be sure about the ref call). Short of adjusting that model drastically there will always be some threshold where "after X fold coverage of reads everything is merged into a big adjacent ref block". Unfortunately with block merging on we never expect the DP to be exact (since it should vary over the block and its expensive to store that information exactly) but the issue is much more pronounced in your haploid sample.

We didn't identify a good fix for this issue on our end but you might as well know that we looked into it and have considered why you are seeing this behavior.

cmatKhan · 2024-08-30T00:10:16Z

Thank you for the explanation. For what it's worth, that matches very well with what I observed in the gvcf. I appreciate the better insight.

I think it is worth putting a note to this effect in the --ploidy argument documentation, and printing a warning in the standard out at runtime that this artifact exists in the data. A less visible, but still useful, change would be to add it to the AD / DP blog post from not too far back.

droazen assigned jamesemery Aug 6, 2024

droazen added HaplotypeCaller bug labels Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

gokalpcelik commented Aug 6, 2024

droazen commented Aug 6, 2024

cmatKhan commented Aug 6, 2024

cmatKhan commented Aug 29, 2024

gokalpcelik commented Aug 29, 2024

jamesemery commented Aug 29, 2024

cmatKhan commented Aug 30, 2024

HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

Comments

gokalpcelik commented Aug 6, 2024

droazen commented Aug 6, 2024

cmatKhan commented Aug 6, 2024

cmatKhan commented Aug 29, 2024

gokalpcelik commented Aug 29, 2024

jamesemery commented Aug 29, 2024

cmatKhan commented Aug 30, 2024