Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with complex haplotypes #272

Closed
brentp opened this issue Feb 14, 2020 · 14 comments
Closed

issues with complex haplotypes #272

brentp opened this issue Feb 14, 2020 · 14 comments

Comments

@brentp
Copy link

brentp commented Feb 14, 2020

hi, I am using DV 0.9 to find de novo variants in trios and so I am enriching for weird stuff.
I am sure your team is aware of some/all of these, but I'll document at least 1 case here for the record.

Nearly all obvious false positive de novo calls follow the same pattern. Mostly there is a haplotype from mom , and a haplotype from dad that are different. The kid inherits both the variable haplotypes, but the combination of DV and glnexus gentoype such that the variant appears as de novo.

But, here is an example where there is just an incorrect call from DV that leads to 3 neighboring spurious de novo calls in the kid (top row):
bad-dn

note that dad in the 3rd row has a single read with a 1-base deletion (dash) followed by an insertion (purple tick). I can't show all of the reads in this image, but I have scrolled through and verified that is the only read.

here is the content of the dad's VCF for that region (the mom's is actually very similar):

chr8	75144980	.	CT	C	44.7	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:44:34:16,18:0.529412:44,0,53
chr8	75144983	.	T	TG	49.5	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:49:35:18,17:0.485714:49,0,64

note that that is the single-base del and the insertion that occurs in only 1 read. Since the mom's is the same, maybe there's something akin to realignment going on, but
by contrast, here is the kid's (seemingly more sensible) VCF for that region:

chr8	75144981	.	T	A	71.9	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:66:27:11,16:0.592593:71,0,67
chr8	75144982	.	A	T	63.6	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:61:27:11,16:0.592593:63,0,63
chr8	75144983	.	T	G	67.9	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:66:27:11,16:0.592593:67,0,71

here is the content of the gvcf for dad:

chr8	75144980	.	CT	C,<*>	44.7	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:44:34:16,18,0:0.529412,0:44,0,53,990,990,990
chr8	75144982	.	A	<*>	0	.	END=75144982	GT:GQ:MIN_DP:PL	0/0:48:16:0,48,479
chr8	75144983	.	T	TG,<*>	49.5	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:49:35:18,17,0:0.485714,0:49,0,64,990,990,990
chr8	75144984	.	G	<*>	0	.	END=75145000	GT:GQ:MIN_DP:PL	0/0:50:31:0,105,1049

and kid:

chr8	75144981	.	T	A,<*>	71.9	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:66:27:11,16,0:0.592593,0:71,0,67,990,990,990
chr8	75144982	.	A	T,<*>	63.6	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:61:27:11,16,0:0.592593,0:63,0,63,990,990,990
chr8	75144983	.	T	G,<*>	67.9	PASS	.	GT:GQ:DP:AD:VAF:PL	0/1:66:27:11,16,0:0.592593,0:67,0,71,990,990,990
chr8	75144984	.	G	<*>	0	.	END=75145000	GT:GQ:MIN_DP:PL	0/0:50:25:0,75,749

I am attaching a small sam for kid and dad aligned to hg38
kid.sam.gz
dad.sam.gz

I have other scenarios, but this one is one that seems clearly a deep variant issue and not a problem with glnexus.

@tedyun
Copy link
Collaborator

tedyun commented Feb 14, 2020

Thank you Brent! We started an internal discussion about it. Will keep you posted.

@akolesnikov
Copy link
Collaborator

akolesnikov commented Feb 14, 2020

I ran make_examples with dad.sam.gz and recorded output of the realigner (upper part is original BAM and lower part show realigned reads). You are right Brent, realinged reads are mapped to support 1 del and 1 ins. But in the end the haplotype would be the same with 1 DEL and 1 INS or 3 SNPs, it is just a different representation. This behavior depends on penalty scores that we use for Smith–Waterman alignment.

image

@akolesnikov
Copy link
Collaborator

akolesnikov commented Feb 15, 2020

And child reads should have been aligned the same way as dad's. But we have a special algorithm that decides whether to run realignment. In the case of a child this algorithm decided against realignment.
This algorithm can be turned off in which case reads will be realigned for all samples the same way. In order to do that make_examples needs to run with --nows_use_window_selector_model flag.
Below is the example command I used to run make_example:

make_examples
--ref GRCh38.genome.fa
--reads kid_sorted.bam
--examples /tmp/examples.tfrecord.gz
--mode calling
--regions chr8:75144950-75145050
--nows_use_window_selector_model

@pichuan
Copy link
Collaborator

pichuan commented Feb 15, 2020

Thanks @akolesnikov !

@brentp if you're using run_deepvariant to run, you should be able to use

--make_examples_extra_args "ws_use_window_selector_model=false"

@brentp
Copy link
Author

brentp commented Feb 15, 2020

hi, thanks for the replies. so --make_examples_extra_args "ws_use_window_selector_model=false" (since I am using run_deepvariant) will accomplish the same as what @akolesnikov suggests with --nows_use_window_selector_model ?

and I don't see much documentation on that option, is there any downside to that? (I assume it just affects run time?)

@pichuan
Copy link
Collaborator

pichuan commented Feb 15, 2020

Yes, if you use run_deepvariant (which is a wrapper for the 3 separate steps), the *_extra_args flags are just a more flexible way to allow you specify flags for each step.
In run_deepvariant, we hard-coded some of the commonly used flags but not all of them. For example, you can't directly specify --ws_use_window_selector_model to run_deepvariant.
We thought people might eventually have use cases for all other less known flags. (Which is why the *_extra_args flags exist, but haven't been advertised or documented other than just the flag description.

make_examples_extra_args description can be found here:

flags.DEFINE_string(
'make_examples_extra_args', None,
'A comma-separated list of flag_name=flag_value. "flag_name" has to be '
'valid flags for make_examples.py. If the flag_value is boolean, it has to '
'be flag_name=true or flag_name=false.')

Regarding the downside of turning off ws_use_window_selector_model in make_examples -- yes, it'll be slower. It's a small model where we used to decide whether we need to realign a window or not. When we set it to default, we found that it significantly decreased runtime, with negligible trade-offs.

You can find the description of this feature when it's first released in v0.7:
https://github.com/google/deepvariant/releases/tag/v0.7.0

"Changed window selector to use a linear decision model for choosing realignment candidates. This can be controlled by a flag. -ws_use_window_selector_model which is now on by default."

@kokyriakidis
Copy link

@pichuan Is this option gonna increase accuracy and sensitivity in general?

@pichuan
Copy link
Collaborator

pichuan commented Feb 17, 2020

@kokyriakidis to your question, having ws_use_window_selector_model was to improve runtime, not to improve accuracy or sensitivity. But empirically we expect the trade-off on accuracy to be small. Below you can see my comparison between turning it on/off for WGS and WES on v0.9. (On PACBIO setting, this doesn't affect anything because we don't run realigner for PACBIO.)


I ran some numbers based on the v0.9 WGS and WES case study data. I used this type of CPU machine for the runtime.

The following results were done in two settings:
BASE: This is the default (i.e., ws_use_window_selector_model is true)
EXPT: Turn off window selector (i.e., set ws_use_window_selector_model to false)

On WGS case study:

Settings make_examples runtime Indel F1 SNP F1
BASE 81m29.320s 0.998112 0.999633
EXPT 107m3.893s 0.998156 0.999642

On WES case study:

Settings make_examples runtime Indel F1 SNP F1
BASE 10m5.515s 0.973295 0.999318
EXPT 19m25.869s 0.974056 0.999362

For the detailed hap.py output, you can see here.

@kokyriakidis
Copy link

Hi @pichuan ,

Thank you very much for this wonderful analysis! Turning it off MAY be worth the increased runtime in cases where someone is looking for rare SNPs in a population or in a clinical environment where you prioritize accuracy over runtime.

@brentp
Copy link
Author

brentp commented Feb 18, 2020

I re-ran DV+glnexus on a cohort of ~150 WGS trios. Using --make_examples_extra_args "ws_use_window_selector_model=false"
does reduce the apparent mendelian violation rate. Here is the plot of that before (left) and with that argument (right). This is after some sane filtering on GQ and rare in gnomad.

original

The right side is lower; good. But, then I further filtered that set of variants on your recently released thousand genomes DV VCF, requiring an allele frequency less than 1% there. Now, even though the RHS as dropped by ~20 DNs per trio, the "before" LHS, has a lower number of putative de novos and the end result is that it is worse to run with ws_use_window_selector_model=false

filtering

This must be because you ran the thousand genomes samples without turning off the window selector and I am annotating on exact REF and ALT allele matches, not using haplotype enumeration as does hap.py, but AFAIK, this is how all (allele frequency) annotation software will work.

Any recommendations on how to proceed? I did bcftools norm on my set and on the deep variant thousand G calls and I suspect something like vcfallelelicprimitives will not help here either.
A thousand G callset run without the window selector would help, but is only a band-aid. Is there an annotation software that does haplotype matching? Presumably this will be more of a problem as the GATK juggernaut fades.

@AndrewCarroll
Copy link
Collaborator

Hi @brentp

This is a very good question, but I am not sure I have a good answer for you. For you and @kokyriakidis the fact that window selector can cause inconsistency across a pedigree was not something we previously appreciated when considering the trade-offs between speed and accuracy. In the intermediate future, we hope to profile performance of make_examples to improve speed and at that time we will revisit the decision to enable this by default. This will not occur for the upcoming release, but possibly the following one.

It is possible to create a BED file covering de novo sites and force window selector=false across all of them, this should be fast and it could be an exercise we would want to do on the 1KG pVCF.

We are also working on some more involved methods for calling in a trio, but that is also in the intermediate timescale.

@brentp
Copy link
Author

brentp commented Feb 19, 2020

Hi Andrew, thanks for the reply. I hadn't previously appreciated the extent of this problem before either--not just in DV, but everywhere. The "kid" and "dad" variants I posted above can be made to be the same variant-set if we know they are on the same haplotype--and running with the window selection off does this, but it still creates a problem with annotating across different call-sets when different representations are used. gnomad prefers the representation of an insertion and a deletion rather than the 3 individual SNPs.

So, even if this is further resolved in DV (running with ws_use_window_selector_model=false is sufficient for me), then it will be, in cases like this, impossible to correctly annotate across cohorts without phasing information.

edit: feel free to close this issue as my original issue is addressed.

@brentp
Copy link
Author

brentp commented Feb 27, 2020

I can open a separate issue if it's helpful, but just a couple more things related to this...
First, while the haplotype stuff like in the images above is mostly gone with ws_use_window_selector_model=false, I still see the problem in some false positive calls.

Another thing that happens with things that DV calls de novos but obviously are not is that the kid will just meet some threshold and have a number of MQ ~40 reads with the de novo, where as the parent will have a number of reads with the allele that are MQ ~18 or lower. But the VCF reports AD[1] == 0 for many of these in the parent. If the count of low-quality alleles were reported in the sample fields in the VCF, it would be simpler to filter to make sure the allele was absent from the parent.

pichuan added a commit that referenced this issue Mar 26, 2020
…e to increase the accuracy and runtime of Illumina WGS and WES runs, but doesn't affect PacBio.

This is a follow-up from #272.

PiperOrigin-RevId: 298676661
@tedyun
Copy link
Collaborator

tedyun commented Mar 27, 2020

Hi @brentp @kokyriakidis ,

We (genomics team in Google Health) have released DeepVariant 0.10 on 3/26, which includes improving consistency and accuracy by turning off ws_use_window_selector_model by default, along with the following other changes:

  • Updated to Python3 and TensorFlow2
  • Improved PacBio model for amplified libraries

For more information, please read our release notes: https://github.com/google/deepvariant/releases/tag/v0.10.0
If you have any feedback about your experience with v0.10, please let us know.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants