Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added code and WDL to complete ModelSegments CNV pipeline. #3913

Merged
merged 11 commits into from
Dec 14, 2017

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Dec 5, 2017

Some notes on individual commits:

Updated CallCopyRatioSegments and PreprocessIntervals; reorganized copynumber packages.
-For motivation of changes in CallCopyRatioSegments, see #3825.
-I added the ability to turn off binning in PreprocessIntervals by specifying bin_length = 0.
-I removed the separation between coverage and allelic packages to make the package structure a bit simpler.
-@MartonKN should review, since he wrote PreprocessIntervals and is updating the caller.

Added segmentation classes and tests for ModelSegments CNV pipeline.
-I added implementations of copy-ratio, allele-fraction, and "multidimensional" (joint) segmentation. All implementations are pretty boilerplate; they simply partition by contig and then call out to KernelSegmenter. Note that there is some logic in multidimensional segmentation that only uses the first het in each copy-ratio interval and if any are available, and imputes the alt-allele fraction to 0.5 if not.
-Makes sense for @mbabadi to review this, since he reviewed the KernelSegmenter PR.

Added modeling classes and tests for ModelSegments CNV pipeline.
-Most of this code is copied from the old MCMC code. However, I've done some overall code cleanup and refactoring, especially to remove some overextraction of methods in the allele-fraction likelihoods (see #2860). I also added downsampling and scaling of likelihoods to cut down on runtime. Tests have been simplified and rewritten to use simulated data.
-@LeeTL1220 do you think you could take a look?

Added ModelSegments CLI.
-Mostly control flow to handle optional inputs and validation, but there is some ugly and not well documented code that essentially does the GetHetCoverage step. We'll refactor later, I filed #3915.
-@asmirnov239 can review. This is lower priority than the gCNV VCF writing.

Deleted gCNV WDL and Cromwell tests.
-Trivial to review.

Added WDL and Cromwell tests for ModelSegments CNV pipeline.
-This includes the cost optimizations from @meganshand and @jsotobroad (sorry guys, I wasn't sure how to track your contributions while fixing up commits!) I also added tests for both GC/no-GC pair workflows.
-@MartonKN should review to gain familiarity with the WDL. Note that this WDL has already been through many revisions from @meganshand, @jsotobroad, and @LeeTL1220, so hopefully there shouldn't be too much for you to find serious fault with.

Note that I punted on adding MultidimensionalKernelSegmenterUnitTest and ModelSegmentsIntegrationTest. Filed #3916.

Closes #2858. (FINALLY!)
Closes #3825.
Closes #3661.

@codecov-io
Copy link

codecov-io commented Dec 5, 2017

Codecov Report

Merging #3913 into master will decrease coverage by 0.302%.
The diff coverage is 63.402%.

@@               Coverage Diff               @@
##              master     #3913       +/-   ##
===============================================
- Coverage     79.476%   79.174%   -0.302%     
- Complexity     17286     17452      +166     
===============================================
  Files           1131      1155       +24     
  Lines          61514     62744     +1230     
  Branches        9738      9816       +78     
===============================================
+ Hits           48889     49677      +788     
- Misses          8852      9245      +393     
- Partials        3773      3822       +49
Impacted Files Coverage Δ Complexity Δ
...er/tools/copynumber/formats/records/CopyRatio.java 69.565% <ø> (ø) 8 <0> (?)
...hellbender/tools/genome/SparkGenomeReadCounts.java 89.844% <ø> (ø) 20 <0> (ø) ⬇️
...tools/copynumber/formats/records/AllelicCount.java 76.667% <ø> (ø) 16 <0> (?)
...ools/copynumber/CreateReadCountPanelOfNormals.java 86.957% <ø> (ø) 7 <0> (ø) ⬇️
...hellbender/tools/copynumber/DenoiseReadCounts.java 88.889% <ø> (ø) 9 <0> (ø) ⬇️
.../copynumber/formats/records/AnnotatedInterval.java 52.632% <ø> (ø) 7 <0> (?)
.../formats/collections/ModeledSegmentCollection.java 69.767% <ø> (ø) 3 <0> (?)
...ols/copynumber/formats/records/ModeledSegment.java 93.75% <ø> (ø) 9 <0> (?)
...umber/formats/collections/CopyRatioCollection.java 100% <ø> (ø) 8 <0> (?)
.../tools/copynumber/denoising/SVDDenoisingUtils.java 79.204% <ø> (ø) 43 <0> (?)
... and 89 more

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee done with the review of the ModelSegments CLI. Feel free to not incorporate the suggested changes or make tickets for them.

import java.util.stream.Collectors;

/**
* @author Samuel Lee &lt;[email protected]&gt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is missing, were you planning to add it later in your documentation pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

}

logger.info("Modeling available denoised copy ratios and heterozygous allelic counts...");
//initial MCMC model fitting performed by MultidimensionalModeller constructor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could provide a link here to MultidimensionalModeller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this in non-Javadoc comments?

}

private void validateArguments() {
Utils.nonNull(outputPrefix);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check presence of outputDir too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will also double check that this is done in corresponding tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, and done.

//(for use by CallCopyRatioSegments, if copy ratios are available)
final MultidimensionalSegmentCollection multidimensionalSegments;
if (inputDenoisedCopyRatiosFile != null && inputAllelicCountsFile == null) {
readDenoisedCopyRatios();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer more functional code and I feel like this and other methods calls below (such as readDenoisedCopyRatios and readAndFilterAllelicCounts) should not work through side effects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the control flow could probably be improved here and in the gCNV CLIs. Filed #3951.


//filter on total count in matched normal
logger.info(String.format("Filtering allelic counts in matched normal with total count less than %d...", minTotalAlleleCount));
final AllelicCountCollection filteredNormalAllelicCounts = new AllelicCountCollection(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could extract this and other similar filtering code to a new function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, already filed #3915.


@Argument(
doc = "Threshold on z-score of non-log2 copy ratio used for calling segments. " +
"If non-log2 copy ratio z-score is above this threshold for a copy-neutral segment, " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I accidentally copied and pasted a few lines from above here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MartonKN
Copy link
Contributor

Done with my code review @samuelklee!

@samuelklee
Copy link
Contributor Author

@LeeTL1220, @mbabadi is out on vacation. Would you mind reviewing the commit I assigned to him? Thanks!

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really easy for me to follow. Thanks for that. I have one javadoc change that is not major.

elif [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true ]]; then
echo "Running legacy CNV somatic workflows";
bash scripts/cnv_cromwell_tests/somatic_legacy/run_cnv_somatic_workflows.sh;
elif [[ $RUN_CNV_GERMLINE_WDL == true ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we no longer running germline tests? Is it because we are going to roll out the new workflow and don't want the support burden?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cleaned up in the germline PRs. The new workflows are run there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, gCNV Spark also got blown away.

set -e
GATK_JAR=${default="/root/gatk.jar" gatk4_jar_override}

java -Xmx${default="8" mem}g -jar $GATK_JAR CollectFragmentCounts \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these default memory values come from @jsotobroad ? This is fairly clunky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think he made the changes to all tasks. Perhaps just the most memory intensive ones. Made a note in #3940 to clean these up.


runtime {
docker: "${gatk_docker}"
memory: select_first([mem, 5]) + " GB"
memory: machine_mem + " MB"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like what I was expecting for memory runtime attributes.


The reference used must be the same between PoN and case samples.

- ``CNVSomaticPanelWorkflow.gatk_docker`` -- GATK Docker image (e.g., ``broadinstitute/gatk:x.beta.x``).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an example that is not in beta as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Made a note to change this in the gCNV WDL readme too.

@@ -0,0 +1,13 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example values would be better here. Otherwise, the template is not very useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just placeholders in the M2 JSON templates ("$intervals"). See the "Please note that python templates..." in the M2 WDL readme. I think this was copied over from the legacy CNV readme, from which it was subsequently deleted.

This is essentially the output of wdltool inputs, and I think that having the parameter type is actually more useful than the python template strings. But in my opinion, these templates are only slightly useful to begin with and they're one more thing that we have to keep up to date. I'm actually fine with removing them altogether. @davidbenjamin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in my opinion, these templates are only slightly useful to begin with and they're one more thing that we have to keep up to date. I'm actually fine with removing them altogether.

I don't think they're useful, either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will delete them.

@@ -101,7 +102,7 @@ install:
else
./gradlew assemble;
./gradlew installDist;
if [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true || $RUN_M2_WDL == true || $RUN_CNV_GERMLINE_WDL == true ]]; then
if [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true || $RUN_CNV_SOMATIC_WDL == true || $RUN_CNV_GERMLINE_WDL == true || $RUN_M2_WDL == true ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still going to run the legacy WDL? Haven't you removed the code in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaned up in a later PR.

Collectors.toList()));
}

public MultidimensionalSegmentCollection findSegmentation(final int maxNumChangepointsPerChromosome,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added some here and to the other KernelSegmenters.

Copy link
Contributor

@MartonKN MartonKN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with my review @samuelklee

@@ -66,11 +67,11 @@
public static final String PADDING_SHORT_NAME = "P";

@Argument(
doc = "Length (in bp) of the bins.",
doc = "Length (in bp) of the bins. If zero, no binning will be performed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double space in front of 'If'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what our policy is on double spaces between sentences. @sooheelee @vdauwera?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving them for now, then.

@@ -120,6 +121,9 @@ private static IntervalList padAndMergeIntervals(final List<SimpleInterval> inpu
}

private static IntervalList generateBins(final IntervalList preparedIntervalList, final int binLength, final SAMSequenceDictionary sequenceDictionary) {
if (binLength == 0) {
return IntervalList.copyOf(preparedIntervalList);
}
final IntervalList bins = new IntervalList(sequenceDictionary);
for (final Interval interval : preparedIntervalList) {
for (int binStart = interval.getStart(); binStart <= interval.getEnd(); binStart += binLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good!

@Argument(
doc = "Threshold on z-score of non-log2 copy ratio used for calling segments. " +
"If non-log2 copy ratio z-score is above this threshold for a copy-neutral segment, " +
"it is considered an outlier and not used in the calculation of the length-weighted standard deviation " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should say 'length-weighted mean and standard deviation used for calling the segment.'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done.

@@ -161,7 +161,7 @@ task CreatePanelOfNormals {
--truncatePercentileThreshold 0.1 \
--noQC ${default="false" no_qc} \
--output ${pon_entity_id}.pon
}
>>>

runtime {
docker: "${gatk_docker}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the code through, looks good.

@samuelklee
Copy link
Contributor Author

Thanks all, addressed comments and rebased. Will merge after tests pass unless there are any objections.

@samuelklee
Copy link
Contributor Author

Oops, pretty sure that I screwed up the rebase and that tests will fail. Will fix later tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants