Questions about drop_outlier_features #237

bethac07 · 2022-10-31T19:11:37Z

I was looking at some code @fefossa is writing, and came across drop_outlier_features, which was added in #62 but somehow escaped my notice (it isn't on in the profiling template, for example). This is potentially super useful to us, but I was wondering about the default value (15). I know I'm asking about decisions made in December 2019, a far simpler time, but do you recall how this value was chosen? It doesn't appear at first pass to be documented, and this seems to me to be WAY too low - we routinely see features we believe in above that absolute value. Fernanda is going to check on some profiles with and without that feature reduction step added to get a sense of in her data how much effect it is having, but it is possibly significant.

I know the parameter is configurable, so we can override it if we'd like, but in general I assume we want defaults to be broadly applicable for the vast set of use cases, and I don't know that this one is.

The text was updated successfully, but these errors were encountered:

gwaybio · 2022-10-31T21:40:27Z

I don't recall our rationale for selecting defaults, but it might have been influenced by pooled cell painting. See #61, which references a private data repo

bethac07 · 2022-10-31T21:46:56Z

Yes, based on the link I suspected that might have been it. It sounds like then perhaps you have no particular attachment to the SPECIFIC number and would maybe be ok with a PR proposing to change the default value? If so, I'll check with the team to figure out if we think there's a reasonably straightforward way to propose an evidence-based default, otherwise I'm just going to say "50" or "100".

gwaybio · 2022-10-31T22:18:06Z

Yes, definitely no attachment. @niranjchandrasekaran do you have any memories or thoughts here?

niranjchandrasekaran · 2022-10-31T22:30:53Z

Pasting what I said in the internal slack thread

We had a discussion about this in the last profiling check in. Yu was wondering whether to use this for feature selection. She found that using the default value of 15 resulted in dropping an additional 500 features, which is a lot (~5900 -> ~650 (without drop_outlier ) -> ~150 (with drop_outlier)). Changing this threshold is likely the right thing to do.

I think the default should be higher, but I don't know how much higher it should be, without looking at some datasets.

fefossa · 2022-11-01T11:40:12Z

When I added drop_outliers to the operations using the default operation = ['correlation_threshold', 'variance_threshold', 'drop_na_columns', 'blocklist','drop_outliers'], it removed 440 more features than without drop_outliers. With outlier_cutoff = 50, it removes an additional 66 features, and with outlier_cutoff = 100 removes an additional 23 features.

bethac07 · 2022-11-01T22:04:45Z

We did some initial internal testing in both arrayed and pooled data, and prelim results in both is that this default throws out far too much data- we're still digging but it seems a value in the hundreds is more appropriate. Will make a PR when we dig more but wanted to let you know ASAP for your own data

yhan8 · 2022-11-02T14:55:56Z

For my A549 four compound replicate plates, to perform feature selection if I use operation=['variance_threshold','correlation_threshold','drop_na_columns','blocklist'], my feature size dropped from 5800+ to 650+; However, if I also add 'drop_outliers', my feature size reduce to 130+.

bethac07 · 2022-11-03T13:25:23Z

So I ran some Jupyter notebooks on the JUMP pilot data - here are the top-level results. I have a lot more in the notebooks and exported files from them, if folks want a deeper dive.
These are the results for the value of the bottom/top ~2 wells (left) or 1 (right) well for 314 plates of JUMP pilot data - both feature selected after whole-plate normalization, and feature selected after negcon normalization.

Note that the X axes aren't forced here - if you log norm the Y axis in addition to the X, you do have trailing tails all the way out to sometimes 10^17. so this feature is undeniably good, just not with the right default. Normalized negcon here is always higher than normalized, which is good and sensible.

Based on the rough breakdowns below, I think a setting of about 100 is reasonable (it keeps 98%-99% of features after selection, likely, though to be cautious I might feel even more confident in the middle hundreds (~300 or 500).

Percentiles - 2 wells

The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_lowest are 5%:1.8973, 25%:2.9956, 50%:4.8945, 75%:8.81885, 95%:22.149, 99%:59.17057999999967, 99.9%:1533.5250000001804, 99.99%:2.1023871399963683e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_lowest are 5%:2.043825, 25%:3.665, 50%:6.8949, 75%:13.777, 95%:37.5, 99%:98.76700000000069, 99.9%:1925.2745000005932, 99.99%:1.490386849996308e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_highest are 5%:2.2709, 25%:3.601, 50%:5.7874, 75%:9.9307, 95%:25.109, 99%:63.60740000000013, 99.9%:280.5766400000674, 99.99%:1766.8402799989447
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_highest are 5%:2.4576, 25%:4.4742, 50%:8.153, 75%:15.464, 95%:42.238099999999974, 99%:108.98809999999997, 99.9%:513.4387600001251, 99.99%:4630.637019997404

Percentiles - 1 well

The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_lowest are 5%:1.9537, 25%:3.1439, 50%:5.2232, 75%:9.562325, 95%:24.755899999999965, 99%:67.611, 99.9%:2050.4, 99.99%:2.5136003199997357e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_lowest are 5%:2.0962, 25%:3.8432, 50%:7.3208, 75%:14.811, 95%:41.49919999999999, 99%:112.29440000000002, 99.9%:2551.0712000007275, 99.99%:2.3829724799993216e+17
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_highest are 5%:2.3002, 25%:3.7878, 50%:6.20575, 75%:10.87, 95%:28.283949999999983, 99%:75.08323999999976, 99.9%:329.9385800000548, 99.99%:2486.130199998
The 5%/25%/50%/75%/95%/99%/99.9%/99.99% value of the absolute value of fs_normalized_negcon_highest are 5%:2.4907, 25%:4.689, 50%:8.6801, 75%:16.775, 95%:47.257, 99%:128.66440000000003, 99.9%:623.33792000002, 99.99%:7922.325919994525

FeatureDistribution_percentiles.pdf
FeatureDistribution-top_only.pdf

shntnu · 2022-11-03T13:50:13Z

Here are some related issues/functions that we should peek into later:

bethac07 · 2022-11-03T21:12:11Z

I made a linked PR - showing the graphs to the image analysts and talking with them about their experience running ie Morpheus before, this seemed like a good compromise number. I would rather go high and default to removing only the really most extreme data than potentially clipping off the most interesting real biology. Folks who want it tighter can always override it to be tighter, but this is where we feel a good default lies.

gwaybio · 2022-11-03T21:13:36Z

Sounds good to me! I will approve the PR once it passes tests

Change drop_outlier default to 500

bethac07 added the question Further information is requested label Oct 31, 2022

bethac07 mentioned this issue Nov 3, 2022

Change drop_outlier default to 500 #238

Merged

10 tasks

bethac07 closed this as completed in #238 Nov 7, 2022

bethac07 added a commit that referenced this issue Nov 7, 2022

Merge pull request #238 from /issues/237

8e5d27f

Change drop_outlier default to 500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about drop_outlier_features #237

Questions about drop_outlier_features #237

bethac07 commented Oct 31, 2022

gwaybio commented Oct 31, 2022

bethac07 commented Oct 31, 2022

gwaybio commented Oct 31, 2022

niranjchandrasekaran commented Oct 31, 2022

fefossa commented Nov 1, 2022

bethac07 commented Nov 1, 2022

yhan8 commented Nov 2, 2022

bethac07 commented Nov 3, 2022 •

edited

Loading

shntnu commented Nov 3, 2022 •

edited

Loading

bethac07 commented Nov 3, 2022

gwaybio commented Nov 3, 2022 •

edited

Loading

Questions about drop_outlier_features #237

Questions about drop_outlier_features #237

Comments

bethac07 commented Oct 31, 2022

gwaybio commented Oct 31, 2022

bethac07 commented Oct 31, 2022

gwaybio commented Oct 31, 2022

niranjchandrasekaran commented Oct 31, 2022

fefossa commented Nov 1, 2022

bethac07 commented Nov 1, 2022

yhan8 commented Nov 2, 2022

bethac07 commented Nov 3, 2022 • edited Loading

shntnu commented Nov 3, 2022 • edited Loading

bethac07 commented Nov 3, 2022

gwaybio commented Nov 3, 2022 • edited Loading

bethac07 commented Nov 3, 2022 •

edited

Loading

shntnu commented Nov 3, 2022 •

edited

Loading

gwaybio commented Nov 3, 2022 •

edited

Loading