-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled multisample segmentation in ModelSegments. #6499
Conversation
047ec14
to
b0239f1
Compare
Per comments on the Slack channel, this can also be used as a component in germline tagging for matched tumor-normal pairs. For the same series above: Here is the 100% normal run through ModelSegments, which yields 36 segments: Joint segmentation of the 100% normal and the 100% tumor yields 241 segments (up from 130 for the 100% tumor alone, as above). Using this joint segmentation for subsequent ModelSegments runs: For the 100% normal, this yields 88 segments (up from 36): For the 100% tumor, this yields 166 segments (up from 130): I haven't performed detailed validations, but some spot checking suggests that this actually mitigate oversegmentation while still increasing sensitivity to shared events. For example, there is a small 13-bin deletion in chr19 that is found when running the 100% normal alone, but gets broken up into two adjacent deletions when running the 100% tumor alone (probably just due to statistical noise in the copy ratios). When running jointly, the deletion does not get broken up. However, as discussed over Slack, we should probably run some scenarios with simulated data to check behavior---for example, how robust is the joint segmentation to some of the samples being noisy/oversegmented? There are lots of options for restructuring the workflow. We could potentially modify ModelSegments to take in the denoised copy ratios from the normal, when available, and add modeling of the normal and germline tagging to that tool. Or we could break things up into separate tools. @fleharty any opinions? Note that another benefit of using this joint segmentation for germline tagging is that common breakpoints will be shared. This obviates the need for a lot of the idiosyncratic code (in the experimental postprocessing tools) that deals with reconciling segmentations and combining breakpoints. In general, I think such code is extremely prone to off-by-one errors and should be avoided, if possible. See #5450 for a reminder of some of the remaining issues with that workflow. |
I'll also note that now may be a good time to address #5626. Right now we are throwing away a decent fraction of the allele-fraction data during segmentation, since we only use the first site in each copy-ratio bin. (We do use all the sites overlapping with the copy-ratio bins during modeling, though.) |
354121a
to
c2d34ae
Compare
c2d34ae
to
6f78819
Compare
Travis reported job failures from build 29804
|
Are we totally committed to this name? Rather than chop elbows and ankles into smaller pieces, I would have gone with |
Yeah, the goofy name is just a placeholder, as is the separate tool itself. Unless there are any objections (@fleharty @mwalker174?) or I run into any unforeseen snafus or parameter ambiguities along the way, I am going to roll the multisample-segmentation functionality into ModelSegments. We can toggle this functionality by passing multiple
This will perform both het genotyping and joint segmentation, but will yield a Picard interval-list Users can use this joint segmentation in their own downstream tools, but we can also allow ModelSegments to ingest it via in a new
Each scatter of ModelSegments will run as before, aside from skipping the segmentation step in favor of using the joint segmentation. We will repeat the het-genotyping step, but this is cheap and it's probably better to repeat it to make sure filtering is applied consistently. It would also require more changes to the command line to specify where to output the hets for each sample during multisample segmentation and to skip genotyping in each scatter, if we were to go that route. There are many possible combinations of inputs that need to be tested, but the same is already true of the current ModelSegments. Furthermore, there are slight wrinkles when running in tumor-only mode (i.e., when I think this structure sets us up nicely to accommodate germline tagging/filtering in the near future. We can still pass the Picard interval list containing the joint segmentation to the scatter for the normal, but can instead subsequently pass the *.modelBegin.seg result from the normal to the tumors. This modeled-segment file will have breakpoints identical to those from the joint segmentation (as opposed to the *.modelFinal.seg result, since that undergoes segment smoothing/merging), but will also contain the segment-level posteriors necessary for performing germline filtering. We will just need to add code to toggle on the type of |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
07f4778
to
02a5c27
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
a4fc769
to
5316a6f
Compare
This comment has been minimized.
This comment has been minimized.
5316a6f
to
28a6d21
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Added tests for the multisample segmenter and some unit tests for the genotyping code. @fleharty @mwalker174 this is ready for review! |
0a5bf73
to
15a4a8d
Compare
15a4a8d
to
21bc423
Compare
21bc423
to
1dba22a
Compare
@fleharty @mwalker174 Is this a PR you guys are interested in getting into an upcoming GATK release? |
@fleharty @mwalker174 can we close this out before it gets too stale? |
@fleharty @mwalker174 any ETA on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee
Several pretty trivial comments. Sorry for the extreme delay.
src/main/java/org/broadinstitute/hellbender/tools/copynumber/ModelSegments.java
Outdated
Show resolved
Hide resolved
.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java
Outdated
Show resolved
Hide resolved
.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java
Outdated
Show resolved
Hide resolved
.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java
Show resolved
Hide resolved
.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java
Show resolved
Hide resolved
...ents-wes-tumor-2-allelic-counts-SM-74P4M-v1-chr20-downsampled.deduplicated.allelicCounts.tsv
Show resolved
Hide resolved
...adinstitute/hellbender/tools/copynumber/arguments/SomaticSegmentationArgumentCollection.java
Show resolved
Hide resolved
...org/broadinstitute/hellbender/tools/copynumber/formats/metadata/SimpleLocatableMetadata.java
Show resolved
Hide resolved
...ute/hellbender/tools/copynumber/utils/genotyping/NaiveHeterozygousPileupGenotypingUtils.java
Outdated
Show resolved
Hide resolved
...ute/hellbender/tools/copynumber/segmentation/MultisampleMultidimensionalKernelSegmenter.java
Outdated
Show resolved
Hide resolved
…egmenter, removed unnecessary segment classes, and changed bounds of SimpleIntervalCollection and AnnotatedIntervalCollection.
1dba22a
to
b62fa3d
Compare
Finally getting around to this @fleharty, apologies! Decided to punt on most of your requests, but hopefully my reasoning is acceptable. Perhaps we can also get it in by release and @takutosato can use this version of ModelSegments in his CRSP evaluations (and maybe even verify there are no changes from the previous version in typical, single-sample copy-ratio + allele-fraction use---since from that perspective all code changes should just be a refactor---if he has time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge away! Nice to get this in finally!
Also extracted some argument collections and genotyping code (see #3915), fixed up some documentation, and did some refactoring to the Segmenter classes.
This is just a first implementation for evaluation and feedback. There is some redundant (but cheap) computation performed in the genotyping step and both the genotyping and segmentation steps are not optimized for memory use. However, since requirements are not onerous (probably around ~10GB memory and <10 minutes for ~10 typical WGS samples), it might not be worth fixing up at the expense of extra code.
Likewise, this implementation requires all inputs be available. We could relax this to allow optional dimensions of input (i.e., copy ratios or allele counts) and/or case-only mode (as in ModelSegments), at the expense of extra control-flow code.
One could also perform segmentation with an external tool and pass it to ModelSegments, as long as it is properly formatted.
Closes #2924.