Enabled multisample segmentation in ModelSegments. #6499

samuelklee · 2020-03-12T17:53:00Z

Also extracted some argument collections and genotyping code (see #3915), fixed up some documentation, and did some refactoring to the Segmenter classes.

This is just a first implementation for evaluation and feedback. There is some redundant (but cheap) computation performed in the genotyping step and both the genotyping and segmentation steps are not optimized for memory use. However, since requirements are not onerous (probably around ~10GB memory and <10 minutes for ~10 typical WGS samples), it might not be worth fixing up at the expense of extra code.

Likewise, this implementation requires all inputs be available. We could relax this to allow optional dimensions of input (i.e., copy ratios or allele counts) and/or case-only mode (as in ModelSegments), at the expense of extra control-flow code.

One could also perform segmentation with an external tool and pass it to ModelSegments, as long as it is properly formatted.

Closes #2924.

samuelklee · 2020-03-12T20:00:33Z

Here's an example of a potential workflow. We will use the one of the in silico purity series (mixed at the read/allele counts level) I created for the TH evaluation:

Here is the 100% tumor run through ModelSegments, which yields 130 segments:

Here is the 75% tumor + 25% normal mixture run through ModelSegments, which yields 83 segments:

Here is the 25% tumor + 75% normal mixture run through ModelSegments, which yields 50 segments:

We can compare against the new workflow, in which we run SegmentJointSamples to jointly segment on the two mixtures. This yields a joint-sample segmentation with 162 segments, which can be passed as an additional input to individual ModelSegments runs on the two mixtures. It is used as the initial segmentation for both runs, after which the usual modelling and smoothing steps are performed.

For the 75% tumor + 25% normal mixture, this yields 122 segments (up from 83):

For the 25% tumor + 75% normal mixture, this yields 105 segments (up from 50):

One could imagine that smoothing could be disabled (so that all samples retain the common segmentation after modeling) or made more aggressive (so that private events don't get inadvertently introduced into other samples due to noise, perhaps), depending on the use case.

It looks like the joint segmentation allows some additional events to be resolved, although I haven't done any rigorous evaluations. We could probably cook up some evaluations using simulated toy data or in silico mixtures, but there's really no reason why this shouldn't work decently well, especially if the kernel-segmentation method works well on a single sample for your data.

It would also be interesting to understand at which point changing segmentation parameters on a single sample can no longer yield the same performance as joint segmentation on a fixed number of samples; however, this is probably a function of various S/N ratios, and it might not be easy to characterize this behavior outside of toy data. The segmentation parameter space is big enough to make this unwieldy even for toy data, too.

Perhaps we can get some feedback from test users---not only on performance, but also on the structure of the new workflow. It might also be worth gauging whether a new WDL is warranted.

Otherwise, we just need to add some unit tests for correctness of the multisample-segmentation backend class, integration tests for plumbing of the new tool, and perhaps address some of the issues mentioned above. Then I'd say this is good to go.

samuelklee · 2020-03-13T15:05:02Z

Per comments on the Slack channel, this can also be used as a component in germline tagging for matched tumor-normal pairs. For the same series above:

Here is the 100% normal run through ModelSegments, which yields 36 segments:

Joint segmentation of the 100% normal and the 100% tumor yields 241 segments (up from 130 for the 100% tumor alone, as above). Using this joint segmentation for subsequent ModelSegments runs:

For the 100% normal, this yields 88 segments (up from 36):

For the 100% tumor, this yields 166 segments (up from 130):

I haven't performed detailed validations, but some spot checking suggests that this actually mitigate oversegmentation while still increasing sensitivity to shared events. For example, there is a small 13-bin deletion in chr19 that is found when running the 100% normal alone, but gets broken up into two adjacent deletions when running the 100% tumor alone (probably just due to statistical noise in the copy ratios). When running jointly, the deletion does not get broken up. However, as discussed over Slack, we should probably run some scenarios with simulated data to check behavior---for example, how robust is the joint segmentation to some of the samples being noisy/oversegmented?

There are lots of options for restructuring the workflow. We could potentially modify ModelSegments to take in the denoised copy ratios from the normal, when available, and add modeling of the normal and germline tagging to that tool. Or we could break things up into separate tools. @fleharty any opinions?

Note that another benefit of using this joint segmentation for germline tagging is that common breakpoints will be shared. This obviates the need for a lot of the idiosyncratic code (in the experimental postprocessing tools) that deals with reconciling segmentations and combining breakpoints. In general, I think such code is extremely prone to off-by-one errors and should be avoided, if possible. See #5450 for a reminder of some of the remaining issues with that workflow.

samuelklee · 2020-03-13T15:12:17Z

I'll also note that now may be a good time to address #5626. Right now we are throwing away a decent fraction of the allele-fraction data during segmentation, since we only use the first site in each copy-ratio bin. (We do use all the sites overlapping with the copy-ratio bins during modeling, though.)

gatk-bot · 2020-03-30T18:23:54Z

Travis reported job failures from build 29794
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	openjdk8	29794.1	logs

gatk-bot · 2020-03-30T19:40:51Z

Travis reported job failures from build 29804
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
R	openjdk8	29804.6	logs
integration	oraclejdk8	29804.12	logs
python	openjdk8	29804.5	logs
cloud	openjdk8	29804.1	logs
integration	openjdk11	29804.13	logs
cloud	openjdk11	29804.15	logs
integration	openjdk8	29804.2	logs
variantcalling	openjdk8	29804.4	logs

gatk-bot · 2020-04-01T01:06:58Z

Travis reported job failures from build 29830
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	oraclejdk8	29830.12	logs
integration	openjdk11	29830.13	logs
cloud	openjdk8	29830.1	logs
cloud	openjdk11	29830.15	logs

ldgauthier · 2020-04-01T13:54:55Z

Are we totally committed to this name? Rather than chop elbows and ankles into smaller pieces, I would have gone with JointlySegmentSamples or MultisampleSegmentation.

samuelklee · 2020-04-01T15:21:06Z

Yeah, the goofy name is just a placeholder, as is the separate tool itself. Unless there are any objections (@fleharty @mwalker174?) or I run into any unforeseen snafus or parameter ambiguities along the way, I am going to roll the multisample-segmentation functionality into ModelSegments.

We can toggle this functionality by passing multiple --denoised-copy-ratios and --allelic-counts arguments, e.g.:

gatk ModelSegments 
  --normal-allelic-counts normal.allelicCounts.tsv (this is only used for het genotyping)
  --denoised-copy-ratios normal.denoisedCR.tsv
  --denoised-copy-ratios tumor-1.denoisedCR.tsv
  ...
  --denoised-copy-ratios tumor-N.denoisedCR.tsv
  --allelic-counts normal.allelicCounts.tsv
  --allelic-counts tumor-1.allelicCounts.tsv
  ...
  --allelic-counts tumor-N.allelicCounts.tsv \
  -O .
  --output-prefix joint-segmentation

This will perform both het genotyping and joint segmentation, but will yield a Picard interval-list joint-segmentation.interval_list as its sole output. (Although we could proceed to perform MCMC model inference on each sample in series, we'll stop at segmentation to enforce the scattering of inference across samples, which will be quicker.) We can also allow for copy-ratio-only and allelic-count-only modes.

Users can use this joint segmentation in their own downstream tools, but we can also allow ModelSegments to ingest it via in a new --segments argument:

gatk ModelSegments
  --normal-allelic-counts normal.allelicCounts.tsv (equivalently, we could omit this and adjust minimum-total-allele-count-case, as is done in the WDL)
  --allelic-counts normal.allelicCounts.tsv
  --denoised-copy-ratios normal.denoisedCR.tsv
  --segments joint-segmentation.interval_list
  -O .
  --output-prefix normal

gatk ModelSegments
  --normal-allelic-counts normal.allelicCounts.tsv
  --allelic-counts tumor-1.allelicCounts.tsv
  --denoised-copy-ratios tumor-1.denoisedCR.tsv
  --segments joint-segmentation.interval_list
  -O .
  --output-prefix tumor-1

  ...

Each scatter of ModelSegments will run as before, aside from skipping the segmentation step in favor of using the joint segmentation. We will repeat the het-genotyping step, but this is cheap and it's probably better to repeat it to make sure filtering is applied consistently. It would also require more changes to the command line to specify where to output the hets for each sample during multisample segmentation and to skip genotyping in each scatter, if we were to go that route.

There are many possible combinations of inputs that need to be tested, but the same is already true of the current ModelSegments. Furthermore, there are slight wrinkles when running in tumor-only mode (i.e., when --normal-allelic-counts are not available). Because each sample is genotyped indiviudally, each may yield a different set of hets (in contrast to genotyping in matched-normal mode, in which the normal determines the set of hets used in all samples). We will thus have to take the intersection of these hets before performing multisample segmentation. Unfortunately, we will not be able to re-perform this intersection in each scatter, since we will no longer have access to the hets from the other samples. However, we will ultimately intersect the hets from each sample with the joint segmentation before modeling, which may be a rough proxy for the intersection of hets from all samples. As always, tumor-only mode may yield suboptimal results in certain scenarios, e.g., high purity CNLOH. I think I'm OK with just documenting these wrinkles, rather than working too hard to iron them out.

I think this structure sets us up nicely to accommodate germline tagging/filtering in the near future. We can still pass the Picard interval list containing the joint segmentation to the scatter for the normal, but can instead subsequently pass the *.modelBegin.seg result from the normal to the tumors. This modeled-segment file will have breakpoints identical to those from the joint segmentation (as opposed to the *.modelFinal.seg result, since that undergoes segment smoothing/merging), but will also contain the segment-level posteriors necessary for performing germline filtering.

We will just need to add code to toggle on the type of --segments input (Picard interval list or modeled segments), add filtering arguments and code, perhaps output an additional seg file showing filter status for the *.modelBegin.seg segments, and modify the segment merging/smoothing code to properly account for filter status (so we don't incorrectly impute across filtered segments, which is currently done by the unsupported code, for whatever reason). I think this should clock in at well under ~5k lines of code, which is the count for the current unsupported code (see #5450 (comment)).

samuelklee · 2020-04-15T19:41:37Z

Added tests for the multisample segmenter and some unit tests for the genotyping code. @fleharty @mwalker174 this is ready for review!

droazen · 2020-06-17T15:35:22Z

@fleharty @mwalker174 Is this a PR you guys are interested in getting into an upcoming GATK release?

samuelklee · 2020-09-21T17:25:47Z

@fleharty @mwalker174 can we close this out before it gets too stale?

samuelklee · 2020-11-13T15:12:43Z

@fleharty @mwalker174 any ETA on this?

fleharty

@samuelklee
Several pretty trivial comments. Sorry for the extreme delay.

src/main/java/org/broadinstitute/hellbender/tools/copynumber/ModelSegments.java

.../broadinstitute/hellbender/tools/copynumber/arguments/CopyNumberArgumentValidationUtils.java

...ents-wes-tumor-2-allelic-counts-SM-74P4M-v1-chr20-downsampled.deduplicated.allelicCounts.tsv

...adinstitute/hellbender/tools/copynumber/arguments/SomaticSegmentationArgumentCollection.java

...org/broadinstitute/hellbender/tools/copynumber/formats/metadata/SimpleLocatableMetadata.java

...ute/hellbender/tools/copynumber/utils/genotyping/NaiveHeterozygousPileupGenotypingUtils.java

...ute/hellbender/tools/copynumber/segmentation/MultisampleMultidimensionalKernelSegmenter.java

…egmenter, removed unnecessary segment classes, and changed bounds of SimpleIntervalCollection and AnnotatedIntervalCollection.

samuelklee · 2021-06-09T15:35:32Z

Finally getting around to this @fleharty, apologies! Decided to punt on most of your requests, but hopefully my reasoning is acceptable. Perhaps we can also get it in by release and @takutosato can use this version of ModelSegments in his CRSP evaluations (and maybe even verify there are no changes from the previous version in typical, single-sample copy-ratio + allele-fraction use---since from that perspective all code changes should just be a refactor---if he has time).

fleharty

Merge away! Nice to get this in finally!

samuelklee force-pushed the sl_multisample_segmentation branch from 047ec14 to b0239f1 Compare March 12, 2020 20:01

samuelklee force-pushed the sl_multisample_segmentation branch 3 times, most recently from 354121a to c2d34ae Compare March 30, 2020 18:10

samuelklee force-pushed the sl_multisample_segmentation branch from c2d34ae to 6f78819 Compare March 30, 2020 19:20

This comment has been minimized.

Sign in to view

samuelklee force-pushed the sl_multisample_segmentation branch from 07f4778 to 02a5c27 Compare April 3, 2020 17:22

This comment has been minimized.

Sign in to view

samuelklee force-pushed the sl_multisample_segmentation branch 7 times, most recently from a4fc769 to 5316a6f Compare April 9, 2020 18:32

This comment has been minimized.

Sign in to view

samuelklee force-pushed the sl_multisample_segmentation branch from 5316a6f to 28a6d21 Compare April 9, 2020 19:03

This comment has been minimized.

Sign in to view

samuelklee force-pushed the sl_multisample_segmentation branch 3 times, most recently from 0a5bf73 to 15a4a8d Compare April 15, 2020 19:48

samuelklee requested review from mwalker174 and fleharty April 15, 2020 19:55

samuelklee force-pushed the sl_multisample_segmentation branch from 15a4a8d to 21bc423 Compare April 16, 2020 15:38

samuelklee assigned fleharty and mwalker174 Apr 16, 2020

samuelklee force-pushed the sl_multisample_segmentation branch from 21bc423 to 1dba22a Compare April 16, 2020 15:51

fleharty requested changes Feb 11, 2021

View reviewed changes

samuelklee added 7 commits June 9, 2021 10:56

Fixed minor documentation and miscellaneous issues.

89944c8

Extracted somatic CNV argument collections.

38612e3

Extracted genotyping utility method and added unit tests.

3b75111

Refactored KernelSegmenters to use MultisampleMultidimensionalKernelS…

5419e2e

…egmenter, removed unnecessary segment classes, and changed bounds of SimpleIntervalCollection and AnnotatedIntervalCollection.

Added test resources for ModelSegmentsIntegrationTest.

6ab8b74

Added utility methods and minor fixes for validation.

98cd553

Enabled multisample segmentation in ModelSegments.

b62fa3d

samuelklee force-pushed the sl_multisample_segmentation branch from 1dba22a to b62fa3d Compare June 9, 2021 14:57

samuelklee added 2 commits June 9, 2021 11:00

Optimized imports in copynumber packages using explicit imports.

12b003b

Addressed review comments.

eedf857

fleharty approved these changes Jun 11, 2021

View reviewed changes

samuelklee merged commit 1156be7 into master Jun 11, 2021

samuelklee deleted the sl_multisample_segmentation branch June 11, 2021 19:25

samuelklee mentioned this pull request Aug 17, 2021

Refactoring of ModelSegments Segmenter and Modeller backend classes. #5625

Closed

samuelklee mentioned this pull request Jan 27, 2022

Added numerical-stability tests and updated test data for all ModelSegments single-sample and multiple-sample modes. #7652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled multisample segmentation in ModelSegments. #6499

Enabled multisample segmentation in ModelSegments. #6499

samuelklee commented Mar 12, 2020 •

edited

Loading

samuelklee commented Mar 12, 2020 •

edited

Loading

samuelklee commented Mar 13, 2020

samuelklee commented Mar 13, 2020 •

edited

Loading

gatk-bot commented Mar 30, 2020

gatk-bot commented Mar 30, 2020 •

edited

Loading

gatk-bot commented Apr 1, 2020 •

edited

Loading

ldgauthier commented Apr 1, 2020

samuelklee commented Apr 1, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

samuelklee commented Apr 15, 2020

droazen commented Jun 17, 2020

samuelklee commented Sep 21, 2020

samuelklee commented Nov 13, 2020

fleharty left a comment

samuelklee commented Jun 9, 2021 •

edited

Loading

fleharty left a comment

Enabled multisample segmentation in ModelSegments. #6499

Enabled multisample segmentation in ModelSegments. #6499

Conversation

samuelklee commented Mar 12, 2020 • edited Loading

samuelklee commented Mar 12, 2020 • edited Loading

samuelklee commented Mar 13, 2020

samuelklee commented Mar 13, 2020 • edited Loading

gatk-bot commented Mar 30, 2020

gatk-bot commented Mar 30, 2020 • edited Loading

gatk-bot commented Apr 1, 2020 • edited Loading

ldgauthier commented Apr 1, 2020

samuelklee commented Apr 1, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

samuelklee commented Apr 15, 2020

droazen commented Jun 17, 2020

samuelklee commented Sep 21, 2020

samuelklee commented Nov 13, 2020

fleharty left a comment

Choose a reason for hiding this comment

samuelklee commented Jun 9, 2021 • edited Loading

fleharty left a comment

Choose a reason for hiding this comment

samuelklee commented Mar 12, 2020 •

edited

Loading

samuelklee commented Mar 12, 2020 •

edited

Loading

samuelklee commented Mar 13, 2020 •

edited

Loading

gatk-bot commented Mar 30, 2020 •

edited

Loading

gatk-bot commented Apr 1, 2020 •

edited

Loading

samuelklee commented Apr 1, 2020 •

edited

Loading

samuelklee commented Jun 9, 2021 •

edited

Loading