Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about drop_outlier_features #237

Closed
bethac07 opened this issue Oct 31, 2022 · 11 comments · Fixed by #238
Closed

Questions about drop_outlier_features #237

bethac07 opened this issue Oct 31, 2022 · 11 comments · Fixed by #238
Labels
question Further information is requested

Comments

@bethac07
Copy link
Member

I was looking at some code @fefossa is writing, and came across drop_outlier_features, which was added in #62 but somehow escaped my notice (it isn't on in the profiling template, for example). This is potentially super useful to us, but I was wondering about the default value (15). I know I'm asking about decisions made in December 2019, a far simpler time, but do you recall how this value was chosen? It doesn't appear at first pass to be documented, and this seems to me to be WAY too low - we routinely see features we believe in above that absolute value. Fernanda is going to check on some profiles with and without that feature reduction step added to get a sense of in her data how much effect it is having, but it is possibly significant.

I know the parameter is configurable, so we can override it if we'd like, but in general I assume we want defaults to be broadly applicable for the vast set of use cases, and I don't know that this one is.

@bethac07 bethac07 added the question Further information is requested label Oct 31, 2022
@gwaybio
Copy link
Member

gwaybio commented Oct 31, 2022

I don't recall our rationale for selecting defaults, but it might have been influenced by pooled cell painting. See #61, which references a private data repo

@bethac07
Copy link
Member Author

Yes, based on the link I suspected that might have been it. It sounds like then perhaps you have no particular attachment to the SPECIFIC number and would maybe be ok with a PR proposing to change the default value? If so, I'll check with the team to figure out if we think there's a reasonably straightforward way to propose an evidence-based default, otherwise I'm just going to say "50" or "100".

@gwaybio
Copy link
Member

gwaybio commented Oct 31, 2022

Yes, definitely no attachment. @niranjchandrasekaran do you have any memories or thoughts here?

@niranjchandrasekaran
Copy link
Member

Pasting what I said in the internal slack thread

We had a discussion about this in the last profiling check in. Yu was wondering whether to use this for feature selection. She found that using the default value of 15 resulted in dropping an additional 500 features, which is a lot (~5900 -> ~650 (without drop_outlier ) -> ~150 (with drop_outlier)). Changing this threshold is likely the right thing to do.

I think the default should be higher, but I don't know how much higher it should be, without looking at some datasets.

@fefossa
Copy link

fefossa commented Nov 1, 2022

When I added drop_outliers to the operations using the default operation = ['correlation_threshold', 'variance_threshold', 'drop_na_columns', 'blocklist','drop_outliers'], it removed 440 more features than without drop_outliers. With outlier_cutoff = 50, it removes an additional 66 features, and with outlier_cutoff = 100 removes an additional 23 features.

@bethac07
Copy link
Member Author

bethac07 commented Nov 1, 2022

We did some initial internal testing in both arrayed and pooled data, and prelim results in both is that this default throws out far too much data- we're still digging but it seems a value in the hundreds is more appropriate. Will make a PR when we dig more but wanted to let you know ASAP for your own data

@yhan8
Copy link

yhan8 commented Nov 2, 2022

For my A549 four compound replicate plates, to perform feature selection if I use operation=['variance_threshold','correlation_threshold','drop_na_columns','blocklist'], my feature size dropped from 5800+ to 650+; However, if I also add 'drop_outliers', my feature size reduce to 130+.

@bethac07
Copy link
Member Author

bethac07 commented Nov 3, 2022

So I ran some Jupyter notebooks on the JUMP pilot data - here are the top-level results. I have a lot more in the notebooks and exported files from them, if folks want a deeper dive.
These are the results for the value of the bottom/top ~2 wells (left) or 1 (right) well for 314 plates of JUMP pilot data - both feature selected after whole-plate normalization, and feature selected after negcon normalization.
FeatureDistributions_feature_selected

Note that the X axes aren't forced here - if you log norm the Y axis in addition to the X, you do have trailing tails all the way out to sometimes 10^17. so this feature is undeniably good, just not with the right default. Normalized negcon here is always higher than normalized, which is good and sensible.

Based on the rough breakdowns below, I think a setting of about 100 is reasonable (it keeps 98%-99% of features after selection, likely, though to be cautious I might feel even more confident in the middle hundreds (~300 or 500).

Percentiles - 2 wells
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_lowest are 5%:1.8973, 25%:2.9956, 50%:4.8945, 75%:8.81885, 95%:22.149, 99%:59.17057999999967, 99.9%:1533.5250000001804, 99.99%:2.1023871399963683e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_lowest are 5%:2.043825, 25%:3.665, 50%:6.8949, 75%:13.777, 95%:37.5, 99%:98.76700000000069, 99.9%:1925.2745000005932, 99.99%:1.490386849996308e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_highest are 5%:2.2709, 25%:3.601, 50%:5.7874, 75%:9.9307, 95%:25.109, 99%:63.60740000000013, 99.9%:280.5766400000674, 99.99%:1766.8402799989447
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_highest are 5%:2.4576, 25%:4.4742, 50%:8.153, 75%:15.464, 95%:42.238099999999974, 99%:108.98809999999997, 99.9%:513.4387600001251, 99.99%:4630.637019997404

Percentiles - 1 well
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_lowest are 5%:1.9537, 25%:3.1439, 50%:5.2232, 75%:9.562325, 95%:24.755899999999965, 99%:67.611, 99.9%:2050.4, 99.99%:2.5136003199997357e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_lowest are 5%:2.0962, 25%:3.8432, 50%:7.3208, 75%:14.811, 95%:41.49919999999999, 99%:112.29440000000002, 99.9%:2551.0712000007275, 99.99%:2.3829724799993216e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_highest are 5%:2.3002, 25%:3.7878, 50%:6.20575, 75%:10.87, 95%:28.283949999999983, 99%:75.08323999999976, 99.9%:329.9385800000548, 99.99%:2486.130199998
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_highest are 5%:2.4907, 25%:4.689, 50%:8.6801, 75%:16.775, 95%:47.257, 99%:128.66440000000003, 99.9%:623.33792000002, 99.99%:7922.325919994525


FeatureDistribution_percentiles.pdf
FeatureDistribution-top_only.pdf

@bethac07
Copy link
Member Author

bethac07 commented Nov 3, 2022

I made a linked PR - showing the graphs to the image analysts and talking with them about their experience running ie Morpheus before, this seemed like a good compromise number. I would rather go high and default to removing only the really most extreme data than potentially clipping off the most interesting real biology. Folks who want it tighter can always override it to be tighter, but this is where we feel a good default lies.

@gwaybio
Copy link
Member

gwaybio commented Nov 3, 2022

Sounds good to me! I will approve the PR once it passes tests

bethac07 added a commit that referenced this issue Nov 7, 2022
Change drop_outlier default to 500
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants