Overhaul of Mutect2 filtering #5688

davidbenjamin · 2019-02-19T07:36:12Z

Closes #4893. Closes #5086. Closes #5684. Closes #4500. Makes #4933, #4958, and #5085 possible.

@takutosato Failing tests are superficial. You can begin reviewing.

This is a big PR:

Refactor of all M2 filtering. Each filter has its own class, and the filtering engine ties it all together.
Learn allele fraction clustering and somatic SNV and indel priors.
More probabilistic filters.
All filters have a common probabilistic threshold.
M2 determines threshold automatically.
Rewrite of all M2 documentation.
Several filters, including strand bias and normal artifact, learn their own parameters.

@LeeTL1220 M2 validations look really, really good.

@meganshand Once this goes in mitochondria best practices will need to be tweaked again. We can merge the dangling tails homoplasmic fix before merging this.

codecov-io · 2019-02-19T07:55:36Z

Codecov Report

Merging #5688 into master will decrease coverage by 51.14%.
The diff coverage is 86.938%.

@@               Coverage Diff               @@
##              master     #5688       +/-   ##
===============================================
- Coverage     86.999%   35.859%   -51.14%     
+ Complexity     31873     17519    -14354     
===============================================
  Files           1942      1975       +33     
  Lines         146789    147184      +395     
  Branches       16216     16228       +12     
===============================================
- Hits          127705     52779    -74926     
- Misses         13174     89595    +76421     
+ Partials        5910      4810     -1100

Impacted Files	Coverage Δ	Complexity Δ
...titute/hellbender/tools/walkers/GenotypeGVCFs.java	`89.352% <ø> (-0.971%)`	`93 <0> (-1)`
...lkers/ReferenceConfidenceVariantContextMerger.java	`90.544% <ø> (-4.298%)`	`108 <0> (-2)`
...stitute/hellbender/tools/walkers/CombineGVCFs.java	`94.479% <ø> (ø)`	`70 <0> (ø)`	⬇️
...ellbender/tools/exome/FilterByOrientationBias.java	`83.019% <ø> (ø)`	`14 <0> (ø)`	⬇️
.../walkers/contamination/CalculateContamination.java	`96.552% <ø> (ø)`	`10 <0> (ø)`	⬇️
...ender/tools/walkers/annotator/InbreedingCoeff.java	`72.414% <ø> (-13.793%)`	`7 <0> (-3)`
...der/tools/walkers/CombineGVCFsIntegrationTest.java	`1.262% <ø> (-88.675%)`	`2 <0> (-36)`
...s/walkers/validation/CalculateMixingFractions.java	`78.431% <ø> (ø)`	`15 <0> (ø)`	⬇️
...s/walkers/validation/MergeMutect2CallsWithMC3.java	`81.25% <ø> (ø)`	`13 <0> (ø)`	⬇️
...tools/walkers/mutect/SomaticLikelihoodsEngine.java	`86.667% <ø> (-2.222%)`	`17 <0> (-1)`
... and 1382 more

takutosato

I've reviewed everything but the ThresholdCalculator (and wdls and tests). Just wanted to give you review comments I have thus far so you don't have to wait until I'm done done.

takutosato · 2019-02-21T16:59:54Z

docs/mutect/mutect.tex

+\usepackage{listings}
+\lstset{basicstyle=\ttfamily,
+  showstringspaces=false,
+  commentstyle=\color{red},


This is a matter of taste, but I think red is a little distracting. gray looked good when I tried. I'm sure other colors would work too

Also, I think captions look better at the bottom. Adding captionpos=b here will do it, if you choose to move them.

done x 2 -- looks much better

takutosato · 2019-02-21T17:00:18Z

docs/mutect/mutect.tex

+   [-I tumor2.bam -I tumor3.bam . . .]  \
+   # Mutect2 may input matched normals from the same individual
+   [-I normal1.bam -I normal2.bam . . .]  \
+   # For most purposes Mutect2 should be supplied with gnomad


gnomad -> gnomAD

takutosato · 2019-02-21T20:19:28Z

docs/mutect/mutect.tex

-where $\psi$ is the digamma function and $N$ is the number of reads.  To obtain $q(\vz)$ and $q(\vf)$ we iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence.  A very reasonable initialization is to set $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise.  Having obtained the mean field of $\vz$, we would like to plug it into Eq \ref{evidence}.  We can't do this directly, of course, because Eq \ref{evidence} says nothing about our mean field factorization.  Rather, we need the variational approximation (Bishop's Eq 10.3) to the model evidence, which is
+We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits.  First, if there are many reads, each allele is associated with many reads and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated.  Second, if the allele assignments of reads are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$.  In the variational Bayesian mean-field formalism we have
+\begin{align}
+q(\vf) \propto& E_{q(\vz)} \left[ P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto {\rm Dir}(\vf | \valpha + \sum_r \bar{\vz}_r) \equiv {\rm Dir}(\vf | \vbeta), \quad \vbeta = \valpha + \sum_r \bar{\vz}_r \label{z_mean_field}  \\


Because expectation and log don't commute, I think the first of these proportional-to relationships is not true i.e. q(f) ~ E_z [P(X,Z,f)] isn't the same as ln q(f) ~ E_z [ln P(X,Z,f)] (aka Bishop 10.9)

fixed -- fortunately the equations to the right were correct

takutosato · 2019-02-21T20:20:06Z

docs/mutect/mutect.tex

 \end{align}
-Before we proceed, let's introduce some notation.  First, from Eq \ref{qf} the posterior $q(\vf)$ is
+are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way.  It simply falls out of the mean field equations.}  We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:


$q(\vz)$ -> $q(\vz_r)$

And missing a period before footnote

This one is correct as long as you overload \vz in that without a subscript it's the set of all \vz_r. How do you feel about it?

fixed the missing period

Yup \vz as the set of all \vz_r sounds good to me

takutosato · 2019-02-26T19:49:06Z

docs/mutect/mutect.tex

 \begin{equation}
-E_{\rm Dir(\vf | \vomega)} \left[ \ln f_a \right] = \psi(\omega_a) - \psi(\sum_{a^\prime} \omega_{a^\prime}) \equiv h_a(\vomega).
+P({\rm error}) = 1 - (1 - {\rm max~artifact~error~prob})(1 - {\rm max~non-somatic~prob})(1 - {\rm sequencing~error~prob}).


"non-somatic" looks like "non minus somatic"

fixed, and now I know that \textrm is better than \rm and doesn't need those tildes.

takutosato · 2019-03-04T21:59:12Z

.../java/org/broadinstitute/hellbender/tools/walkers/mutect/filtering/StrandArtifactFilter.java

+                .filter(eStep -> eStep.getArtifactProbability() > 0.1).collect(Collectors.toList());
+        final double totalArtifacts = potentialArtifacts.stream().mapToDouble(EStep::getArtifactProbability).sum();
+        final double totalNonArtifacts = eSteps.stream().mapToDouble(e -> 1 - e.getArtifactProbability()).sum();
+        strandArtifactPrior = (totalArtifacts + ARTIFACT_PSEUDOCOUNT) / (totalNonArtifacts + NON_ARTIFACT_PSEUDOCOUNT);


According to my calculation the denominator should be the total count N = totalArtifacts + totalNonArtifacts + alpha, instead of the effectiveCount of non artifacts.

True, and this fixes a bug with this branch that showed up in mitochondria

takutosato · 2019-03-05T18:40:33Z

.../java/org/broadinstitute/hellbender/tools/walkers/mutect/filtering/StrandArtifactFilter.java

+
+    @Override
+    protected void learnParameters() {
+        final List<EStep> potentialArtifacts = eSteps.stream()


Why filter the list by prob > 0.1? I'm sure it wouldn't hurt but also seems unnecessary

Partly speed in the case of large vcfs, partly to be robust to a bad initialization where we don't want a lot of questionable strand artifacts to cause us to learn bad parameters.

I see thanks

takutosato · 2019-03-06T19:54:12Z

docs/mutect/mutect.tex

 \end{align}
+where ${\rm D}_j$ denotes the $j$th existing Dirichlet Process cluster, ${\rm Dirichlet}_{\rm new}$ denotes a newly-created Dirichlet cluster, $N_j$ is the number of variants assigned to cluster $j$, and $N_D$ is the total number of variants assigned to Dirichlet clusters.


${\rm D}_j$ -> ${\rm Dirichlet}_j$ to be consistent with the equations

takutosato · 2019-03-06T20:42:34Z

docs/mutect/mutect.tex

+P({\rm Sequencing~error}) \propto& 1 - \pi_{\rm real} \\
+P({\rm High~AF}) \propto& \pi_{\rm real} \pi_H \\
+P({\rm Background}) \propto& \pi_{\rm real} \pi_B \\
+P({\rm Dirichlet}_j) \propto& \pi_{\rm real} \pi_D \frac{ N_j}{N_D + \alpha} \\


Maybe add the superscript -i or equivalent on N_j and N_D to signify the fact that you call popDatum() before each Gibbs update.

And I think we can replace \propto with =

takutosato · 2019-03-06T21:04:38Z

...va/org/broadinstitute/hellbender/tools/walkers/mutect/clustering/SomaticClusteringModel.java

+            pruneEmptyClusters();
+
+            final List<List<Datum>> dataByCluster = clusters.stream().map(c -> new ArrayList<Datum>()).collect(Collectors.toList());
+            for (final MutableInt datumIndex = new MutableInt(0); datumIndex.getValue() < clusterAssignments.size(); datumIndex.increment()) {


I would use an int instead of MutableInt here

There's a lambda but I rewrote it with an IndexRange.

I see, that's inconvenient but sounds good

takutosato

Finished review. A couple more comments.

takutosato · 2019-03-07T20:26:01Z

...va/org/broadinstitute/hellbender/tools/walkers/mutect/clustering/SomaticClusteringModel.java

+        IntStream.range(OFFSET, clusters.size()).boxed()
+                .sorted(Comparator.comparingDouble(c -> -log10CRPWeight(c)))
+                .forEach(c -> result.add(ImmutablePair.of("Binomial cluster " + clusterIndex.toString(),
+                        String.format("weight = %.4f, %s", Math.pow(10, log10CRPWeight(c)), clusters.get(c).toString()))));


log10CRPWeight(c) + log10SparseClustersWeight might be more informative than just log10CRPWeight(c), but I can see the benefit of the conditional probability too. I'm just bringing this up in case it was unintended.

I intended the conditional but it was a dilemma between the two options.

takutosato · 2019-03-07T20:53:18Z

...ava/org/broadinstitute/hellbender/tools/walkers/mutect/filtering/StrictStrandBiasFilter.java

+                });
+
+    // filter if there is no alt evidence in the forward or reverse strand
+        return Math.min(altForwardCount.getValue(), altReverseCount.getValue()) >= minReadsOnEachStrand;


I think you want to replace >= with <.

fixed. That would have been quite a present for Maddy and Mark.

davidbenjamin · 2019-03-07T21:05:29Z

@takutosato Back to you!

takutosato · 2019-03-08T16:57:35Z

Looks good! @davidbenjamin

davidbenjamin added Epic Mutect labels Feb 19, 2019

davidbenjamin assigned takutosato Feb 19, 2019

davidbenjamin requested a review from takutosato February 19, 2019 07:36

davidbenjamin mentioned this pull request Feb 25, 2019

MPOS=. INFO field makes FilterMutectCalls fail #5684

Closed

davidbenjamin added this to the Mutect 3 milestone Feb 26, 2019

davidbenjamin force-pushed the db_m2_filtering_refactor branch 2 times, most recently from 5a74c30 to 654ac5f Compare March 3, 2019 05:06

davidbenjamin mentioned this pull request Mar 6, 2019

Unit test Mutect2 filters #5763

Open

takutosato reviewed Mar 7, 2019

View reviewed changes

davidbenjamin force-pushed the db_m2_filtering_refactor branch from 49d4895 to 2ef54c6 Compare March 7, 2019 19:53

takutosato reviewed Mar 7, 2019

View reviewed changes

davidbenjamin force-pushed the db_m2_filtering_refactor branch from 2921ce5 to 48e464d Compare March 8, 2019 01:40

davidbenjamin added 6 commits March 7, 2019 20:41

Overhaul of Mutect2 filtering

b4b1939

clean-up after rebase

427d154

first Takuto edits

0bee6b1

second Takuto edits

48e464d

rebase whoops

e266470

rebase fun

4c430ed

takutosato approved these changes Mar 8, 2019

View reviewed changes

davidbenjamin force-pushed the db_m2_filtering_refactor branch from f2aa08e to a131d56 Compare March 10, 2019 02:43

davidbenjamin added 6 commits March 9, 2019 21:44

whoops

a131d56

strand stuff

8be8e88

better?

1051a7d

whoops

3ab145f

test header

142d03a

strand

58007fd

davidbenjamin added 3 commits March 11, 2019 12:37

finite precision bug fix

c0e2482

whoops

2ae8e6d

test code

dcf3251

davidbenjamin merged commit 46d42c4 into master Mar 11, 2019

davidbenjamin deleted the db_m2_filtering_refactor branch March 11, 2019 21:31

davidbenjamin mentioned this pull request Mar 12, 2019

bring Funcotator changes to M2 NIO WDL #5742

Merged

meganshand mentioned this pull request Mar 22, 2019

Changing defaults in mitochondria mode for M2 filtering rehaul #5827

Merged

davidbenjamin mentioned this pull request Mar 29, 2019

Mutect2 standard-min-confidence-threshold-for-calling #5845

Closed

		\end{align}
		where ${\rm D}_j$ denotes the $j$th existing Dirichlet Process cluster, ${\rm Dirichlet}_{\rm new}$ denotes a newly-created Dirichlet cluster, $N_j$ is the number of variants assigned to cluster $j$, and $N_D$ is the total number of variants assigned to Dirichlet clusters.

Overhaul of Mutect2 filtering #5688

Overhaul of Mutect2 filtering #5688

Conversation

davidbenjamin commented Feb 19, 2019 • edited Loading

codecov-io commented Feb 19, 2019 • edited Loading

Codecov Report

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin commented Mar 7, 2019

takutosato commented Mar 8, 2019

davidbenjamin commented Feb 19, 2019 •

edited

Loading

codecov-io commented Feb 19, 2019 •

edited

Loading