Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled multisample segmentation in ModelSegments. #6499

Merged
merged 9 commits into from
Jun 11, 2021

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Mar 12, 2020

Also extracted some argument collections and genotyping code (see #3915), fixed up some documentation, and did some refactoring to the Segmenter classes.

This is just a first implementation for evaluation and feedback. There is some redundant (but cheap) computation performed in the genotyping step and both the genotyping and segmentation steps are not optimized for memory use. However, since requirements are not onerous (probably around ~10GB memory and <10 minutes for ~10 typical WGS samples), it might not be worth fixing up at the expense of extra code.

Likewise, this implementation requires all inputs be available. We could relax this to allow optional dimensions of input (i.e., copy ratios or allele counts) and/or case-only mode (as in ModelSegments), at the expense of extra control-flow code.

One could also perform segmentation with an external tool and pass it to ModelSegments, as long as it is properly formatted.

Closes #2924.

@samuelklee
Copy link
Contributor Author

samuelklee commented Mar 12, 2020

Here's an example of a potential workflow. We will use the one of the in silico purity series (mixed at the read/allele counts level) I created for the TH evaluation:

Here is the 100% tumor run through ModelSegments, which yields 130 segments:
T modeled

Here is the 75% tumor + 25% normal mixture run through ModelSegments, which yields 83 segments:
N-25-T-75 modeled

Here is the 25% tumor + 75% normal mixture run through ModelSegments, which yields 50 segments:
N-75-T-25 modeled

We can compare against the new workflow, in which we run SegmentJointSamples to jointly segment on the two mixtures. This yields a joint-sample segmentation with 162 segments, which can be passed as an additional input to individual ModelSegments runs on the two mixtures. It is used as the initial segmentation for both runs, after which the usual modelling and smoothing steps are performed.

For the 75% tumor + 25% normal mixture, this yields 122 segments (up from 83):
N-25-T-75-SJS modeled

For the 25% tumor + 75% normal mixture, this yields 105 segments (up from 50):
N-75-T-25-SJS modeled

One could imagine that smoothing could be disabled (so that all samples retain the common segmentation after modeling) or made more aggressive (so that private events don't get inadvertently introduced into other samples due to noise, perhaps), depending on the use case.

It looks like the joint segmentation allows some additional events to be resolved, although I haven't done any rigorous evaluations. We could probably cook up some evaluations using simulated toy data or in silico mixtures, but there's really no reason why this shouldn't work decently well, especially if the kernel-segmentation method works well on a single sample for your data.

It would also be interesting to understand at which point changing segmentation parameters on a single sample can no longer yield the same performance as joint segmentation on a fixed number of samples; however, this is probably a function of various S/N ratios, and it might not be easy to characterize this behavior outside of toy data. The segmentation parameter space is big enough to make this unwieldy even for toy data, too.

Perhaps we can get some feedback from test users---not only on performance, but also on the structure of the new workflow. It might also be worth gauging whether a new WDL is warranted.

Otherwise, we just need to add some unit tests for correctness of the multisample-segmentation backend class, integration tests for plumbing of the new tool, and perhaps address some of the issues mentioned above. Then I'd say this is good to go.

@samuelklee
Copy link
Contributor Author

Per comments on the Slack channel, this can also be used as a component in germline tagging for matched tumor-normal pairs. For the same series above:

Here is the 100% normal run through ModelSegments, which yields 36 segments:
N modeled

Joint segmentation of the 100% normal and the 100% tumor yields 241 segments (up from 130 for the 100% tumor alone, as above). Using this joint segmentation for subsequent ModelSegments runs:

For the 100% normal, this yields 88 segments (up from 36):
N-SJS modeled

For the 100% tumor, this yields 166 segments (up from 130):
T-SJS modeled

I haven't performed detailed validations, but some spot checking suggests that this actually mitigate oversegmentation while still increasing sensitivity to shared events. For example, there is a small 13-bin deletion in chr19 that is found when running the 100% normal alone, but gets broken up into two adjacent deletions when running the 100% tumor alone (probably just due to statistical noise in the copy ratios). When running jointly, the deletion does not get broken up. However, as discussed over Slack, we should probably run some scenarios with simulated data to check behavior---for example, how robust is the joint segmentation to some of the samples being noisy/oversegmented?

There are lots of options for restructuring the workflow. We could potentially modify ModelSegments to take in the denoised copy ratios from the normal, when available, and add modeling of the normal and germline tagging to that tool. Or we could break things up into separate tools. @fleharty any opinions?

Note that another benefit of using this joint segmentation for germline tagging is that common breakpoints will be shared. This obviates the need for a lot of the idiosyncratic code (in the experimental postprocessing tools) that deals with reconciling segmentations and combining breakpoints. In general, I think such code is extremely prone to off-by-one errors and should be avoided, if possible. See #5450 for a reminder of some of the remaining issues with that workflow.

@samuelklee
Copy link
Contributor Author

samuelklee commented Mar 13, 2020

I'll also note that now may be a good time to address #5626. Right now we are throwing away a decent fraction of the allele-fraction data during segmentation, since we only use the first site in each copy-ratio bin. (We do use all the sites overlapping with the copy-ratio bins during modeling, though.)

@samuelklee samuelklee force-pushed the sl_multisample_segmentation branch 3 times, most recently from 354121a to c2d34ae Compare March 30, 2020 18:10
@gatk-bot
Copy link

Travis reported job failures from build 29794
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 29794.1 logs

@gatk-bot
Copy link

gatk-bot commented Mar 30, 2020

Travis reported job failures from build 29804
Failures in the following jobs:

Test Type JDK Job ID Logs
R openjdk8 29804.6 logs
integration oraclejdk8 29804.12 logs
python openjdk8 29804.5 logs
cloud openjdk8 29804.1 logs
integration openjdk11 29804.13 logs
cloud openjdk11 29804.15 logs
integration openjdk8 29804.2 logs
variantcalling openjdk8 29804.4 logs

@gatk-bot
Copy link

gatk-bot commented Apr 1, 2020

Travis reported job failures from build 29830
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 29830.12 logs
integration openjdk11 29830.13 logs
cloud openjdk8 29830.1 logs
cloud openjdk11 29830.15 logs

@ldgauthier
Copy link
Contributor

Are we totally committed to this name? Rather than chop elbows and ankles into smaller pieces, I would have gone with JointlySegmentSamples or MultisampleSegmentation.

@samuelklee
Copy link
Contributor Author

samuelklee commented Apr 1, 2020

Yeah, the goofy name is just a placeholder, as is the separate tool itself. Unless there are any objections (@fleharty @mwalker174?) or I run into any unforeseen snafus or parameter ambiguities along the way, I am going to roll the multisample-segmentation functionality into ModelSegments.

We can toggle this functionality by passing multiple --denoised-copy-ratios and --allelic-counts arguments, e.g.:

gatk ModelSegments 
  --normal-allelic-counts normal.allelicCounts.tsv (this is only used for het genotyping)
  --denoised-copy-ratios normal.denoisedCR.tsv
  --denoised-copy-ratios tumor-1.denoisedCR.tsv
  ...
  --denoised-copy-ratios tumor-N.denoisedCR.tsv
  --allelic-counts normal.allelicCounts.tsv
  --allelic-counts tumor-1.allelicCounts.tsv
  ...
  --allelic-counts tumor-N.allelicCounts.tsv \
  -O .
  --output-prefix joint-segmentation

This will perform both het genotyping and joint segmentation, but will yield a Picard interval-list joint-segmentation.interval_list as its sole output. (Although we could proceed to perform MCMC model inference on each sample in series, we'll stop at segmentation to enforce the scattering of inference across samples, which will be quicker.) We can also allow for copy-ratio-only and allelic-count-only modes.

Users can use this joint segmentation in their own downstream tools, but we can also allow ModelSegments to ingest it via in a new --segments argument:

gatk ModelSegments
  --normal-allelic-counts normal.allelicCounts.tsv (equivalently, we could omit this and adjust minimum-total-allele-count-case, as is done in the WDL)
  --allelic-counts normal.allelicCounts.tsv
  --denoised-copy-ratios normal.denoisedCR.tsv
  --segments joint-segmentation.interval_list
  -O .
  --output-prefix normal

gatk ModelSegments
  --normal-allelic-counts normal.allelicCounts.tsv
  --allelic-counts tumor-1.allelicCounts.tsv
  --denoised-copy-ratios tumor-1.denoisedCR.tsv
  --segments joint-segmentation.interval_list
  -O .
  --output-prefix tumor-1

  ...

Each scatter of ModelSegments will run as before, aside from skipping the segmentation step in favor of using the joint segmentation. We will repeat the het-genotyping step, but this is cheap and it's probably better to repeat it to make sure filtering is applied consistently. It would also require more changes to the command line to specify where to output the hets for each sample during multisample segmentation and to skip genotyping in each scatter, if we were to go that route.

There are many possible combinations of inputs that need to be tested, but the same is already true of the current ModelSegments. Furthermore, there are slight wrinkles when running in tumor-only mode (i.e., when --normal-allelic-counts are not available). Because each sample is genotyped indiviudally, each may yield a different set of hets (in contrast to genotyping in matched-normal mode, in which the normal determines the set of hets used in all samples). We will thus have to take the intersection of these hets before performing multisample segmentation. Unfortunately, we will not be able to re-perform this intersection in each scatter, since we will no longer have access to the hets from the other samples. However, we will ultimately intersect the hets from each sample with the joint segmentation before modeling, which may be a rough proxy for the intersection of hets from all samples. As always, tumor-only mode may yield suboptimal results in certain scenarios, e.g., high purity CNLOH. I think I'm OK with just documenting these wrinkles, rather than working too hard to iron them out.

I think this structure sets us up nicely to accommodate germline tagging/filtering in the near future. We can still pass the Picard interval list containing the joint segmentation to the scatter for the normal, but can instead subsequently pass the *.modelBegin.seg result from the normal to the tumors. This modeled-segment file will have breakpoints identical to those from the joint segmentation (as opposed to the *.modelFinal.seg result, since that undergoes segment smoothing/merging), but will also contain the segment-level posteriors necessary for performing germline filtering.

We will just need to add code to toggle on the type of --segments input (Picard interval list or modeled segments), add filtering arguments and code, perhaps output an additional seg file showing filter status for the *.modelBegin.seg segments, and modify the segment merging/smoothing code to properly account for filter status (so we don't incorrectly impute across filtered segments, which is currently done by the unsupported code, for whatever reason). I think this should clock in at well under ~5k lines of code, which is the count for the current unsupported code (see #5450 (comment)).

@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@samuelklee samuelklee force-pushed the sl_multisample_segmentation branch 7 times, most recently from a4fc769 to 5316a6f Compare April 9, 2020 18:32
@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@gatk-bot

This comment has been minimized.

@samuelklee
Copy link
Contributor Author

Added tests for the multisample segmenter and some unit tests for the genotyping code. @fleharty @mwalker174 this is ready for review!

@samuelklee samuelklee force-pushed the sl_multisample_segmentation branch 3 times, most recently from 0a5bf73 to 15a4a8d Compare April 15, 2020 19:48
@droazen
Copy link
Contributor

droazen commented Jun 17, 2020

@fleharty @mwalker174 Is this a PR you guys are interested in getting into an upcoming GATK release?

@samuelklee
Copy link
Contributor Author

@fleharty @mwalker174 can we close this out before it gets too stale?

@samuelklee
Copy link
Contributor Author

@fleharty @mwalker174 any ETA on this?

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee
Several pretty trivial comments. Sorry for the extreme delay.

@samuelklee samuelklee force-pushed the sl_multisample_segmentation branch from 1dba22a to b62fa3d Compare June 9, 2021 14:57
@samuelklee
Copy link
Contributor Author

samuelklee commented Jun 9, 2021

Finally getting around to this @fleharty, apologies! Decided to punt on most of your requests, but hopefully my reasoning is acceptable. Perhaps we can also get it in by release and @takutosato can use this version of ModelSegments in his CRSP evaluations (and maybe even verify there are no changes from the previous version in typical, single-sample copy-ratio + allele-fraction use---since from that perspective all code changes should just be a refactor---if he has time).

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge away! Nice to get this in finally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-sample CNV segmentation
6 participants