Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor het genotyping code in ModelSegments. #3915

Open
samuelklee opened this issue Dec 5, 2017 · 7 comments
Open

Refactor het genotyping code in ModelSegments. #3915

samuelklee opened this issue Dec 5, 2017 · 7 comments

Comments

@samuelklee
Copy link
Contributor

samuelklee commented Dec 5, 2017

@davidbenjamin We might be able to share some code with contamination calculation, etc. Tangentially related, we also should unify the pileup-based tools at some point. Low priority, we can discuss after release.

@davidbenjamin
Copy link
Contributor

@samuelklee These are good goals.

@samuelklee
Copy link
Contributor Author

Making a note of #4001 here.

@samuelklee
Copy link
Contributor Author

Perhaps add a mode to ModelSegments that takes in a VCF of genotyped hets for the case (rather than allelic counts for the case and optionally for the matched normal). This would remove the responsibility for genotyping hets to an external tool, which could be better suited for handling cases like high purity LOH with no matched normal (in which case the naive genotyping done by ModelSegments is inappropriate). This is somewhat related to @sooheelee's concerns in #4717.

@samuelklee
Copy link
Contributor Author

samuelklee commented Aug 10, 2018

After looking more at the data from an hg38 NovaSeq run by @kcibul, I think a better strategy would be to use the normal allelic counts as a prior on whether a site is homozygous, rather than hard filtering on these sites (and pulling down corresponding counts from the tumor---this strategy was held over from GetHetCoverage/AllelicCapSeg). The main reason is that the normal will typically be sequenced at lower coverage (~30x), so this strategy will cause us to miss obvious hets in the tumor (~80x).

This is now relevant for two reasons: 1) it seems that we will want to run the filter with more stringent parameters, as higher base error rates are causing homs to leak past the filter, which in turn affects the fit of the allele-fraction model (which only attempts to model hets) by biasing normal segments towards unbalanced, and 2) we now want to run ModelSegments separately on the normal to allow for the filtering of germline events. So we want to be more stringent with low-coverage normals without affecting our high-coverage tumors.

For example, here's some hg38 NovaSeq FFPE WGS data from a ~40x normal:

download

Compare to an hg19 TCGA WGS ~40x normal:

download 1

The hom-ref tail in the first plot is much fatter and clearly leaks into the het cloud. Also curious is that the het cloud is far less binomial (or even beta-binomial---note also the absence of the tail extending to the origin).

I am still not sure why the incoming data looks different. There are several confounding factors: NovaSeq vs. HiSeq, hg38 vs. hg19, AF > 2% gnomAD sites vs. AF > 10% 1000G sites, FFPE vs. frozen, etc. I have not seen enough examples/combinations to be able to say which are the most important factors. Changing the genotyping/filtering strategy can get around this change in the data without a corresponding change in the allele-fraction model for now, but getting the data to look as good as possible upstream would be even better.

Another thought: would be nice if the strategy was easily compatible with an eventual implementation of multi-sample segmentation, which would require that the same sites are used in both the tumor and the normal. We would want to strike a balance between maximizing the number of sites and including questionable sites from the normal.

Will add more details later. @davidbenjamin @LeeTL1220 @eitanbanks @sooheelee may be interested.

@LeeTL1220
Copy link
Contributor

@samuelklee Are the bins in the hist2D logarithmic? Could you post an updated plot with a colorbar?

FYI... @yfarjoun

@samuelklee
Copy link
Contributor Author

Sure, here you go:

kristian
tcga

@samuelklee samuelklee removed their assignment Feb 1, 2019
@samuelklee
Copy link
Contributor Author

@bhanugandham @fleharty this issue touches upon our discussion of https://gatkforums.broadinstitute.org/gatk/discussion/24335/loh-detection-using-gatk4s-somatic-cnv-workflow. We might consider just a simple modification of the genotyping step (e.g., keeping all ROHs longer than a hard threshold) to start, which would probably cover the most common use cases with minimal effort. Can use 100% HCC1143 in tumor-only mode as an initial test, but it would be good to collect other examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants