Make and evaluate simple changes to ReCapSegCaller. #3825

samuelklee · 2017-11-13T22:03:35Z

Some preliminary evaluation of the new ModelSegments pipeline on CRSP samples has revealed some weaknesses of the ReCapSeg caller (which is simply ported from the old pipeline) to me.

I think there are a lot of confusing things going on:

For determining copy-neutral segments, all segments with log2 mean below some threshold are used (rather than absolute log2). There is a comment that this is done to "mimic the python code" but I have no idea why this would be sensible, since it includes all deletions.
There is some confusion arising from inconsistent use of z-score and T-statistic. Standard deviation, rather than standard error, is used for calling; i.e., a "called segment" is one that has a mean log2 copy ratio that has a z-score above some threshold with respect to the standard deviation of the log2 copy ratios of intervals that fall within copy-neutral segments (note also that these intervals have already been filtered by z-score to remove outliers). That is, any segment with a mean that falls sufficiently within the fuzziness of the caterpillar is not called.
However, even calling using standard error is probably not what we want. This would simply be asking the question: given a population of copy-neutral intervals with a mean and standard deviation, does any non-copy-neutral segment contain intervals with a mean significantly different than the population? We've already answered this question during segmentation!

I think what we want to do instead is ask questions about the population of segment-level copy-ratio estimates, weighted by length.

samuelklee · 2017-11-15T13:22:48Z

Wrote a SimpleCopyRatioCaller that is still relatively naive, but I think a bit more sensible than ReCapSegCaller. It does the following:

use the non-log2 mean copy ratio to determine copy-neutral segments (those within 1 +/- x, where x is an exposed parameter),
weight segments by length for determining the mean and standard deviation of the non-log2 copy ratio in copy-neutral segments,
filter outlier copy-neutral segments by non-log2 copy ratio z-score,
use the filtered copy-neutral segments to determine a length-weighted mean and standard deviation,
call remaining segments using z-score based on this mean and standard deviation.

@MartonKN take note of these changes! I am sure that your caller will still do much better, especially given the allele-fraction data.

samuelklee self-assigned this Nov 13, 2017

samuelklee added the Copy Number tools label Nov 13, 2017

samuelklee mentioned this issue Nov 13, 2017

CNV TODOs before release. #3826

Closed

24 tasks

samuelklee mentioned this issue Dec 5, 2017

Added code and WDL to complete ModelSegments CNV pipeline. #3913

Merged

samuelklee closed this as completed in #3913 Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make and evaluate simple changes to ReCapSegCaller. #3825

Make and evaluate simple changes to ReCapSegCaller. #3825

samuelklee commented Nov 13, 2017 •

edited

Loading

samuelklee commented Nov 15, 2017

Make and evaluate simple changes to ReCapSegCaller. #3825

Make and evaluate simple changes to ReCapSegCaller. #3825

Comments

samuelklee commented Nov 13, 2017 • edited Loading

samuelklee commented Nov 15, 2017

samuelklee commented Nov 13, 2017 •

edited

Loading