Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCaller is reporting DP in HOMREF region differently when ploidy is set to 1 with different Interval inputs #8943

Open
gokalpcelik opened this issue Aug 6, 2024 · 6 comments

Comments

@gokalpcelik
Copy link
Contributor

Hi all.
This looks like an issue presented by a forum post

Haplotypecaller FORMAT:DP is affected by the interval in WGS

User uploaded a toy data for us to test and I was able to recreate this issue under GATK 4.6.0.0. I have not tested it with any other versions.

When whole contig is given as interval all variant sites in the multisample VCF is reported with DP value much less than what it is supposed to be in samples where no variation occur. Samples with variants are shown as expected DP. Setting ploidy 2 for this analysis restores the expected DP value for samples with HOMREF sites no matter what interval is used. Numbers can be seen in the figure as well as those variant contexts.

This is what user and I got with the whole contig given as interval

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DA10_DA10	TDY1754_TDY1754
CP022325.1	69348	.	T	A	2920.63	.	AC=1;AF=0.500;AN=2;DP=85;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=29.56;SOR=0.824	GT:AD:DP:GQ:PL	0:4,0:4:99:0,119	1:0,79:79:99:2931,0

This is what comes when -L is set to CP022325.1:69347-69349. This is the same DP reported when ploidy is set to 2 no matter what interval is used. This is also the expected DP value.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DA10_DA10	TDY1754_TDY1754
CP022325.1	69348	.	T	A	2920.63	.	AC=1;AF=0.500;AN=2;DP=98;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=25.36;SOR=0.824	GT:AD:DP:GQ:PL	0:17,0:17:99:0,685	1:0,79:79:99:2931,0

image

User data is in the incoming folder with name cmateusiak_20240805.tar.gz

The reference genome is a fungal one from the below link

Fungi reference C.NeoformansKN99

@droazen
Copy link
Collaborator

droazen commented Aug 6, 2024

@jamesemery will look into this later on in the week

@cmatKhan
Copy link

cmatKhan commented Aug 6, 2024

Also tested with 4.3.0.0 -- same result

@cmatKhan
Copy link

@jamesemery -- did you happen to get a change to look at this?

@gokalpcelik
Copy link
Contributor Author

Hi @cmatKhan
Haploid confidence emission of HaplotypeCaller is a little more aggressive than diploid emission therefore there is really no easy way to change this behavior without changing the major parts of the reference confidence emission engine.

Currently the only way to fix this issue for you is to enable -ERC BP_RESOLUTION parameter which will eventually emit every single site on the genome but will emit close to what you will expect to see per SNP site.

I hope this helps.

@jamesemery
Copy link
Collaborator

Sorry for the delayed response @cmatKhan. The fix here is a little complicated because its a confluence of expected behaviors adding up to this.

Specifically in -ERC GVCF mode in HaplotypeCaller we merge adjacent regions with similar GQ scores but within those merged reference blocks we track the DP (mean dp) and the MIN_DP. When we go to genotype in GenotypeGVCFs we fall back to Min_DP when reporting the DP since it must be at least that high at the sight in question. For Diploid samples and with the default reference blocking we use, this works fine and for most of the range up to 30+ bases you are able to recover a pretty close approximation of what your DP is in most cases. However for Haploid data the assumptions at play mean that the GQ gets very high and starts to max out the field (99) so you end up with extremely long reference confidence blocks with very high confidence and a min_DP of some low value (in your case for the data you shared with us of 4, which is the threshold for hitting GQ=90 in the haploid model). This is also part of why adjusting the intervals for traversal was relevant, as if they were too long they were pulling in bases 1000+ bp away that only have 4fold coverage with reads and torpedoing the reported DP.

So far nothing seems to obviously be bugged (we don't test/use haploid calling mode very often so its less well tested and this sort of issue is not surprising). We might be over-confident about reporting GQ for haploid reference blocks (though mathematically its sound that you only need a few reference observations to be sure about the ref call). Short of adjusting that model drastically there will always be some threshold where "after X fold coverage of reads everything is merged into a big adjacent ref block". Unfortunately with block merging on we never expect the DP to be exact (since it should vary over the block and its expensive to store that information exactly) but the issue is much more pronounced in your haploid sample.

We didn't identify a good fix for this issue on our end but you might as well know that we looked into it and have considered why you are seeing this behavior.

@cmatKhan
Copy link

Thank you for the explanation. For what it's worth, that matches very well with what I observed in the gvcf. I appreciate the better insight.

I think it is worth putting a note to this effect in the --ploidy argument documentation, and printing a warning in the standard out at runtime that this artifact exists in the data. A less visible, but still useful, change would be to add it to the AD / DP blog post from not too far back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants