You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some preliminary evaluation of the new ModelSegments pipeline on CRSP samples has revealed some weaknesses of the ReCapSeg caller (which is simply ported from the old pipeline) to me.
I think there are a lot of confusing things going on:
For determining copy-neutral segments, all segments with log2 mean below some threshold are used (rather than absolute log2). There is a comment that this is done to "mimic the python code" but I have no idea why this would be sensible, since it includes all deletions.
There is some confusion arising from inconsistent use of z-score and T-statistic. Standard deviation, rather than standard error, is used for calling; i.e., a "called segment" is one that has a mean log2 copy ratio that has a z-score above some threshold with respect to the standard deviation of the log2 copy ratios of intervals that fall within copy-neutral segments (note also that these intervals have already been filtered by z-score to remove outliers). That is, any segment with a mean that falls sufficiently within the fuzziness of the caterpillar is not called.
However, even calling using standard error is probably not what we want. This would simply be asking the question: given a population of copy-neutral intervals with a mean and standard deviation, does any non-copy-neutral segment contain intervals with a mean significantly different than the population? We've already answered this question during segmentation!
I think what we want to do instead is ask questions about the population of segment-level copy-ratio estimates, weighted by length.
The text was updated successfully, but these errors were encountered:
Some preliminary evaluation of the new ModelSegments pipeline on CRSP samples has revealed some weaknesses of the ReCapSeg caller (which is simply ported from the old pipeline) to me.
I think there are a lot of confusing things going on:
I think what we want to do instead is ask questions about the population of segment-level copy-ratio estimates, weighted by length.
The text was updated successfully, but these errors were encountered: