-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about drop_outlier_features #237
Comments
I don't recall our rationale for selecting defaults, but it might have been influenced by pooled cell painting. See #61, which references a private data repo |
Yes, based on the link I suspected that might have been it. It sounds like then perhaps you have no particular attachment to the SPECIFIC number and would maybe be ok with a PR proposing to change the default value? If so, I'll check with the team to figure out if we think there's a reasonably straightforward way to propose an evidence-based default, otherwise I'm just going to say "50" or "100". |
Yes, definitely no attachment. @niranjchandrasekaran do you have any memories or thoughts here? |
Pasting what I said in the internal slack thread
I think the default should be higher, but I don't know how much higher it should be, without looking at some datasets. |
When I added drop_outliers to the operations using the default |
We did some initial internal testing in both arrayed and pooled data, and prelim results in both is that this default throws out far too much data- we're still digging but it seems a value in the hundreds is more appropriate. Will make a PR when we dig more but wanted to let you know ASAP for your own data |
For my A549 four compound replicate plates, to perform feature selection if I use |
So I ran some Jupyter notebooks on the JUMP pilot data - here are the top-level results. I have a lot more in the notebooks and exported files from them, if folks want a deeper dive. Note that the X axes aren't forced here - if you log norm the Y axis in addition to the X, you do have trailing tails all the way out to sometimes 10^17. so this feature is undeniably good, just not with the right default. Normalized negcon here is always higher than normalized, which is good and sensible. Based on the rough breakdowns below, I think a setting of about 100 is reasonable (it keeps 98%-99% of features after selection, likely, though to be cautious I might feel even more confident in the middle hundreds (~300 or 500). Percentiles - 2 wells
Percentiles - 1 well
FeatureDistribution_percentiles.pdf |
Here are some related issues/functions that we should peek into later:
|
I made a linked PR - showing the graphs to the image analysts and talking with them about their experience running ie Morpheus before, this seemed like a good compromise number. I would rather go high and default to removing only the really most extreme data than potentially clipping off the most interesting real biology. Folks who want it tighter can always override it to be tighter, but this is where we feel a good default lies. |
Sounds good to me! I will approve the PR once it passes tests |
I was looking at some code @fefossa is writing, and came across
drop_outlier_features
, which was added in #62 but somehow escaped my notice (it isn't on in the profiling template, for example). This is potentially super useful to us, but I was wondering about the default value (15). I know I'm asking about decisions made in December 2019, a far simpler time, but do you recall how this value was chosen? It doesn't appear at first pass to be documented, and this seems to me to be WAY too low - we routinely see features we believe in above that absolute value. Fernanda is going to check on some profiles with and without that feature reduction step added to get a sense of in her data how much effect it is having, but it is possibly significant.I know the parameter is configurable, so we can override it if we'd like, but in general I assume we want defaults to be broadly applicable for the vast set of use cases, and I don't know that this one is.
The text was updated successfully, but these errors were encountered: