-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added code and WDL to complete ModelSegments CNV pipeline. #3913
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3913 +/- ##
===============================================
- Coverage 79.476% 79.174% -0.302%
- Complexity 17286 17452 +166
===============================================
Files 1131 1155 +24
Lines 61514 62744 +1230
Branches 9738 9816 +78
===============================================
+ Hits 48889 49677 +788
- Misses 8852 9245 +393
- Partials 3773 3822 +49
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee done with the review of the ModelSegments CLI. Feel free to not incorporate the suggested changes or make tickets for them.
import java.util.stream.Collectors; | ||
|
||
/** | ||
* @author Samuel Lee <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is missing, were you planning to add it later in your documentation pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
} | ||
|
||
logger.info("Modeling available denoised copy ratios and heterozygous allelic counts..."); | ||
//initial MCMC model fitting performed by MultidimensionalModeller constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could provide a link here to MultidimensionalModeller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this in non-Javadoc comments?
} | ||
|
||
private void validateArguments() { | ||
Utils.nonNull(outputPrefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also check presence of outputDir
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will also double check that this is done in corresponding tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, and done.
//(for use by CallCopyRatioSegments, if copy ratios are available) | ||
final MultidimensionalSegmentCollection multidimensionalSegments; | ||
if (inputDenoisedCopyRatiosFile != null && inputAllelicCountsFile == null) { | ||
readDenoisedCopyRatios(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer more functional code and I feel like this and other methods calls below (such as readDenoisedCopyRatios
and readAndFilterAllelicCounts
) should not work through side effects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the control flow could probably be improved here and in the gCNV CLIs. Filed #3951.
|
||
//filter on total count in matched normal | ||
logger.info(String.format("Filtering allelic counts in matched normal with total count less than %d...", minTotalAlleleCount)); | ||
final AllelicCountCollection filteredNormalAllelicCounts = new AllelicCountCollection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could extract this and other similar filtering code to a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, already filed #3915.
|
||
@Argument( | ||
doc = "Threshold on z-score of non-log2 copy ratio used for calling segments. " + | ||
"If non-log2 copy ratio z-score is above this threshold for a copy-neutral segment, " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I accidentally copied and pasted a few lines from above here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Done with my code review @samuelklee! |
@LeeTL1220, @mbabadi is out on vacation. Would you mind reviewing the commit I assigned to him? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was really easy for me to follow. Thanks for that. I have one javadoc change that is not major.
elif [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true ]]; then | ||
echo "Running legacy CNV somatic workflows"; | ||
bash scripts/cnv_cromwell_tests/somatic_legacy/run_cnv_somatic_workflows.sh; | ||
elif [[ $RUN_CNV_GERMLINE_WDL == true ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we no longer running germline tests? Is it because we are going to roll out the new workflow and don't want the support burden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cleaned up in the germline PRs. The new workflows are run there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, gCNV Spark also got blown away.
set -e | ||
GATK_JAR=${default="/root/gatk.jar" gatk4_jar_override} | ||
|
||
java -Xmx${default="8" mem}g -jar $GATK_JAR CollectFragmentCounts \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did these default memory values come from @jsotobroad ? This is fairly clunky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think he made the changes to all tasks. Perhaps just the most memory intensive ones. Made a note in #3940 to clean these up.
|
||
runtime { | ||
docker: "${gatk_docker}" | ||
memory: select_first([mem, 5]) + " GB" | ||
memory: machine_mem + " MB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more like what I was expecting for memory runtime attributes.
scripts/cnv_wdl/somatic/README.md
Outdated
|
||
The reference used must be the same between PoN and case samples. | ||
|
||
- ``CNVSomaticPanelWorkflow.gatk_docker`` -- GATK Docker image (e.g., ``broadinstitute/gatk:x.beta.x``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an example that is not in beta as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Made a note to change this in the gCNV WDL readme too.
@@ -0,0 +1,13 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example values would be better here. Otherwise, the template is not very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just placeholders in the M2 JSON templates ("$intervals"). See the "Please note that python templates..." in the M2 WDL readme. I think this was copied over from the legacy CNV readme, from which it was subsequently deleted.
This is essentially the output of wdltool inputs, and I think that having the parameter type is actually more useful than the python template strings. But in my opinion, these templates are only slightly useful to begin with and they're one more thing that we have to keep up to date. I'm actually fine with removing them altogether. @davidbenjamin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in my opinion, these templates are only slightly useful to begin with and they're one more thing that we have to keep up to date. I'm actually fine with removing them altogether.
I don't think they're useful, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will delete them.
@@ -101,7 +102,7 @@ install: | |||
else | |||
./gradlew assemble; | |||
./gradlew installDist; | |||
if [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true || $RUN_M2_WDL == true || $RUN_CNV_GERMLINE_WDL == true ]]; then | |||
if [[ $RUN_CNV_SOMATIC_LEGACY_WDL == true || $RUN_CNV_SOMATIC_WDL == true || $RUN_CNV_GERMLINE_WDL == true || $RUN_M2_WDL == true ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still going to run the legacy WDL? Haven't you removed the code in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaned up in a later PR.
Collectors.toList())); | ||
} | ||
|
||
public MultidimensionalSegmentCollection findSegmentation(final int maxNumChangepointsPerChromosome, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, added some here and to the other KernelSegmenters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with my review @samuelklee
@@ -66,11 +67,11 @@ | |||
public static final String PADDING_SHORT_NAME = "P"; | |||
|
|||
@Argument( | |||
doc = "Length (in bp) of the bins.", | |||
doc = "Length (in bp) of the bins. If zero, no binning will be performed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double space in front of 'If'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what our policy is on double spaces between sentences. @sooheelee @vdauwera?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving them for now, then.
@@ -120,6 +121,9 @@ private static IntervalList padAndMergeIntervals(final List<SimpleInterval> inpu | |||
} | |||
|
|||
private static IntervalList generateBins(final IntervalList preparedIntervalList, final int binLength, final SAMSequenceDictionary sequenceDictionary) { | |||
if (binLength == 0) { | |||
return IntervalList.copyOf(preparedIntervalList); | |||
} | |||
final IntervalList bins = new IntervalList(sequenceDictionary); | |||
for (final Interval interval : preparedIntervalList) { | |||
for (int binStart = interval.getStart(); binStart <= interval.getEnd(); binStart += binLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good!
@Argument( | ||
doc = "Threshold on z-score of non-log2 copy ratio used for calling segments. " + | ||
"If non-log2 copy ratio z-score is above this threshold for a copy-neutral segment, " + | ||
"it is considered an outlier and not used in the calculation of the length-weighted standard deviation " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should say 'length-weighted mean and standard deviation used for calling the segment.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, done.
@@ -161,7 +161,7 @@ task CreatePanelOfNormals { | |||
--truncatePercentileThreshold 0.1 \ | |||
--noQC ${default="false" no_qc} \ | |||
--output ${pon_entity_id}.pon | |||
} | |||
>>> | |||
|
|||
runtime { | |||
docker: "${gatk_docker}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read the code through, looks good.
7992f64
to
6771284
Compare
Thanks all, addressed comments and rebased. Will merge after tests pass unless there are any objections. |
Oops, pretty sure that I screwed up the rebase and that tests will fail. Will fix later tonight. |
…pynumber packages.
797ce64
to
05500d1
Compare
Some notes on individual commits:
Updated CallCopyRatioSegments and PreprocessIntervals; reorganized copynumber packages.
-For motivation of changes in CallCopyRatioSegments, see #3825.
-I added the ability to turn off binning in PreprocessIntervals by specifying bin_length = 0.
-I removed the separation between coverage and allelic packages to make the package structure a bit simpler.
-@MartonKN should review, since he wrote PreprocessIntervals and is updating the caller.
Added segmentation classes and tests for ModelSegments CNV pipeline.
-I added implementations of copy-ratio, allele-fraction, and "multidimensional" (joint) segmentation. All implementations are pretty boilerplate; they simply partition by contig and then call out to KernelSegmenter. Note that there is some logic in multidimensional segmentation that only uses the first het in each copy-ratio interval and if any are available, and imputes the alt-allele fraction to 0.5 if not.
-Makes sense for @mbabadi to review this, since he reviewed the KernelSegmenter PR.
Added modeling classes and tests for ModelSegments CNV pipeline.
-Most of this code is copied from the old MCMC code. However, I've done some overall code cleanup and refactoring, especially to remove some overextraction of methods in the allele-fraction likelihoods (see #2860). I also added downsampling and scaling of likelihoods to cut down on runtime. Tests have been simplified and rewritten to use simulated data.
-@LeeTL1220 do you think you could take a look?
Added ModelSegments CLI.
-Mostly control flow to handle optional inputs and validation, but there is some ugly and not well documented code that essentially does the GetHetCoverage step. We'll refactor later, I filed #3915.
-@asmirnov239 can review. This is lower priority than the gCNV VCF writing.
Deleted gCNV WDL and Cromwell tests.
-Trivial to review.
Added WDL and Cromwell tests for ModelSegments CNV pipeline.
-This includes the cost optimizations from @meganshand and @jsotobroad (sorry guys, I wasn't sure how to track your contributions while fixing up commits!) I also added tests for both GC/no-GC pair workflows.
-@MartonKN should review to gain familiarity with the WDL. Note that this WDL has already been through many revisions from @meganshand, @jsotobroad, and @LeeTL1220, so hopefully there shouldn't be too much for you to find serious fault with.
Note that I punted on adding MultidimensionalKernelSegmenterUnitTest and ModelSegmentsIntegrationTest. Filed #3916.
Closes #2858. (FINALLY!)
Closes #3825.
Closes #3661.