Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make and evaluate simple changes to ReCapSegCaller. #3825

Closed
samuelklee opened this issue Nov 13, 2017 · 1 comment · Fixed by #3913
Closed

Make and evaluate simple changes to ReCapSegCaller. #3825

samuelklee opened this issue Nov 13, 2017 · 1 comment · Fixed by #3913
Assignees

Comments

@samuelklee
Copy link
Contributor

samuelklee commented Nov 13, 2017

Some preliminary evaluation of the new ModelSegments pipeline on CRSP samples has revealed some weaknesses of the ReCapSeg caller (which is simply ported from the old pipeline) to me.

I think there are a lot of confusing things going on:

  1. For determining copy-neutral segments, all segments with log2 mean below some threshold are used (rather than absolute log2). There is a comment that this is done to "mimic the python code" but I have no idea why this would be sensible, since it includes all deletions.
  2. There is some confusion arising from inconsistent use of z-score and T-statistic. Standard deviation, rather than standard error, is used for calling; i.e., a "called segment" is one that has a mean log2 copy ratio that has a z-score above some threshold with respect to the standard deviation of the log2 copy ratios of intervals that fall within copy-neutral segments (note also that these intervals have already been filtered by z-score to remove outliers). That is, any segment with a mean that falls sufficiently within the fuzziness of the caterpillar is not called.
  3. However, even calling using standard error is probably not what we want. This would simply be asking the question: given a population of copy-neutral intervals with a mean and standard deviation, does any non-copy-neutral segment contain intervals with a mean significantly different than the population? We've already answered this question during segmentation!

I think what we want to do instead is ask questions about the population of segment-level copy-ratio estimates, weighted by length.

@samuelklee
Copy link
Contributor Author

Wrote a SimpleCopyRatioCaller that is still relatively naive, but I think a bit more sensible than ReCapSegCaller. It does the following:

  1. use the non-log2 mean copy ratio to determine copy-neutral segments (those within 1 +/- x, where x is an exposed parameter),
  2. weight segments by length for determining the mean and standard deviation of the non-log2 copy ratio in copy-neutral segments,
  3. filter outlier copy-neutral segments by non-log2 copy ratio z-score,
  4. use the filtered copy-neutral segments to determine a length-weighted mean and standard deviation,
  5. call remaining segments using z-score based on this mean and standard deviation.

@MartonKN take note of these changes! I am sure that your caller will still do much better, especially given the allele-fraction data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant