Refactor het genotyping code in ModelSegments. #3915

samuelklee · 2017-12-05T15:40:57Z

@davidbenjamin We might be able to share some code with contamination calculation, etc. Tangentially related, we also should unify the pileup-based tools at some point. Low priority, we can discuss after release.

davidbenjamin · 2017-12-05T16:06:28Z

@samuelklee These are good goals.

samuelklee · 2017-12-20T19:43:33Z

Making a note of #4001 here.

samuelklee · 2018-05-21T17:45:41Z

Perhaps add a mode to ModelSegments that takes in a VCF of genotyped hets for the case (rather than allelic counts for the case and optionally for the matched normal). This would remove the responsibility for genotyping hets to an external tool, which could be better suited for handling cases like high purity LOH with no matched normal (in which case the naive genotyping done by ModelSegments is inappropriate). This is somewhat related to @sooheelee's concerns in #4717.

samuelklee · 2018-08-10T19:57:30Z

After looking more at the data from an hg38 NovaSeq run by @kcibul, I think a better strategy would be to use the normal allelic counts as a prior on whether a site is homozygous, rather than hard filtering on these sites (and pulling down corresponding counts from the tumor---this strategy was held over from GetHetCoverage/AllelicCapSeg). The main reason is that the normal will typically be sequenced at lower coverage (~30x), so this strategy will cause us to miss obvious hets in the tumor (~80x).

This is now relevant for two reasons: 1) it seems that we will want to run the filter with more stringent parameters, as higher base error rates are causing homs to leak past the filter, which in turn affects the fit of the allele-fraction model (which only attempts to model hets) by biasing normal segments towards unbalanced, and 2) we now want to run ModelSegments separately on the normal to allow for the filtering of germline events. So we want to be more stringent with low-coverage normals without affecting our high-coverage tumors.

For example, here's some hg38 NovaSeq FFPE WGS data from a ~40x normal:

Compare to an hg19 TCGA WGS ~40x normal:

The hom-ref tail in the first plot is much fatter and clearly leaks into the het cloud. Also curious is that the het cloud is far less binomial (or even beta-binomial---note also the absence of the tail extending to the origin).

I am still not sure why the incoming data looks different. There are several confounding factors: NovaSeq vs. HiSeq, hg38 vs. hg19, AF > 2% gnomAD sites vs. AF > 10% 1000G sites, FFPE vs. frozen, etc. I have not seen enough examples/combinations to be able to say which are the most important factors. Changing the genotyping/filtering strategy can get around this change in the data without a corresponding change in the allele-fraction model for now, but getting the data to look as good as possible upstream would be even better.

Another thought: would be nice if the strategy was easily compatible with an eventual implementation of multi-sample segmentation, which would require that the same sites are used in both the tumor and the normal. We would want to strike a balance between maximizing the number of sites and including questionable sites from the normal.

Will add more details later. @davidbenjamin @LeeTL1220 @eitanbanks @sooheelee may be interested.

LeeTL1220 · 2018-08-22T13:20:44Z

@samuelklee Are the bins in the hist2D logarithmic? Could you post an updated plot with a colorbar?

FYI... @yfarjoun

samuelklee · 2018-08-22T13:37:46Z

Sure, here you go:

samuelklee · 2019-09-16T15:39:19Z

@bhanugandham @fleharty this issue touches upon our discussion of https://gatkforums.broadinstitute.org/gatk/discussion/24335/loh-detection-using-gatk4s-somatic-cnv-workflow. We might consider just a simple modification of the genotyping step (e.g., keeping all ROHs longer than a hard threshold) to start, which would probably cover the most common use cases with minimal effort. Can use 100% HCC1143 in tumor-only mode as an initial test, but it would be good to collect other examples.

samuelklee self-assigned this Dec 5, 2017

samuelklee added the Copy Number tools label Dec 5, 2017

samuelklee mentioned this issue Dec 5, 2017

Added code and WDL to complete ModelSegments CNV pipeline. #3913

Merged

samuelklee mentioned this issue Jun 18, 2018

Add VCF input and support for allelic counts from indels to ModelSegments pipeline. #4903

Open

samuelklee added the Somatic CNV label Jan 31, 2019

samuelklee removed their assignment Feb 1, 2019

samuelklee mentioned this issue Mar 12, 2020

Enabled multisample segmentation in ModelSegments. #6499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor het genotyping code in ModelSegments. #3915

Refactor het genotyping code in ModelSegments. #3915

samuelklee commented Dec 5, 2017 •

edited

Loading

davidbenjamin commented Dec 5, 2017

samuelklee commented Dec 20, 2017

samuelklee commented May 21, 2018

samuelklee commented Aug 10, 2018 •

edited

Loading

LeeTL1220 commented Aug 22, 2018

samuelklee commented Aug 22, 2018

samuelklee commented Sep 16, 2019

Refactor het genotyping code in ModelSegments. #3915

Refactor het genotyping code in ModelSegments. #3915

Comments

samuelklee commented Dec 5, 2017 • edited Loading

davidbenjamin commented Dec 5, 2017

samuelklee commented Dec 20, 2017

samuelklee commented May 21, 2018

samuelklee commented Aug 10, 2018 • edited Loading

LeeTL1220 commented Aug 22, 2018

samuelklee commented Aug 22, 2018

samuelklee commented Sep 16, 2019

samuelklee commented Dec 5, 2017 •

edited

Loading

samuelklee commented Aug 10, 2018 •

edited

Loading