Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ICA #941

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

[WIP] ICA #941

wants to merge 5 commits into from

Conversation

ivirshup
Copy link
Member

Fixes #767

This is a work in progress PR adding ICA as a dimensionality reduction method. Some points:

This is faster and works with larger data than the sklearn version – entirely due to the whitening step. sklearn uses np.linalg.svd for whitening, which causes errors about using 32 bit lapack for large datasets since we use 32 bit floats and is slow (but exact). I've swapped that with the arpack svd. I may try and upstream this in the future, but there are a number of open PRs about ICA that I'd like to wait for a bit on: scikit-learn/scikit-learn#11860, scikit-learn/scikit-learn#13056.

As a benchmark, I was able to compute 40 dimensions of an ICA on 50k cells (tabula muris) and 7.5k highly variable genes in about a minute (59.3s) on my laptop.

As a comparison (for a smaller dataset – 10k PBMCs) here are two pair grid plots showing cell embeddings on ten components compared with the top ten components of a PCA.

PCA

image

ICA

image

Things left to do:

  • Look into numerical stability
  • Figure out if I should be be scaling the whitening matrix differently
  • More in depth comparison of results with sklearn based ICA
  • Documentation
  • Share _choose_obs_rep with sc.metrics PR

Once this is done, I'd like to also add sklearns NMF.

@LuckyMD
Copy link
Contributor

LuckyMD commented Nov 30, 2019

A quick, naive question: What is the advantage of ICA over PCA? Components 1,3,5,7 of the ICA grid plot don't look particularly independent to me...

@ivirshup
Copy link
Member Author

ivirshup commented Dec 1, 2019

I'll give a brief hand-wavy explanation now, before checking with someone who knows more about it whether my in depth understanding is correct.

PCA is finding a set linearly independent variable which form a new basis for the data. ICA is finding N (user defined) discrete maximally independent signals from the data. They won't form a basis for the input data, and results can vary a lot based on the number of components you try and find. However, each of the signals is discrete and made of a sparser set of variables, which I think makes them more interpretable. I'd relate this to how the PCA components become a single blob while the ICA components keep separating clusters.

For example, in the components that you point out, I would agree 1, 5, and 7 look to be the same (note: I may have underspecified N here). However, component 3 is picking up a signal which is largely colinear with those, except for one cluster. To me, that says the difference in variable loadings between component 3 and the others is worth investigating.

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 1, 2019

Thanks for the explanation. But what do you mean by "discrete" here?

And so you're saying 1, 5, and 7 being given as solutions to ICA is non-optimal. I guess that's just local optima that are found. It feels strange to generally say that ICA is better as higher dimensions still separate out clusters, while at lower dimensions there is redundant information compared to PCA.

@ivirshup
Copy link
Member Author

ivirshup commented Dec 2, 2019

I think saying discrete was redundant with independent, in that each component should correspond to a signal in the data.

And so you're saying 1, 5, and 7 being given as solutions to ICA is non-optimal.

I'm not sure how to interpret it. I know that if I run an analysis on the same dataset with 20 components I get more independent ones. My impression is the "failure modes" of linear decompositions like this are not well characterized.

It feels strange to generally say that ICA is better as higher dimensions

I probably wouldn't say this. I think there are different use cases, and ICA components may be easier to interpret than PCA components. I was also just at a talk by Elana Fertig (who knows much more about this kind of thing than I do) where one of the take aways was "different decompositions for different use-cases".

I think I'll still use PCA for clustering and generating UMAPs.

while at lower dimensions there is redundant information compared to PCA.

I'd note that there is no order to ICA components.

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 2, 2019

ICA components may be easier to interpret than PCA components.

I find this difficult to generalize. I know exactly that in PCA I can interpret a component based on its rank (and/or variance contribution). In ICA I would have to find some sort of enrichment every time to interpret the same thing. Surely ICA must have spurious components as well (even if only based on non-optimal solutions).

I was also just at a talk by Elana Fertig (who knows much more about this kind of thing than I do) where one of the take aways was "different decompositions for different use-cases".

That's not really a useful take-away for me. That would say I should try as many decompositions as possible to see when I get a good result. So then I have to rely on my subjective assessment of "good" being based on what I expect the data to show. This is especially difficult if there is some stochasticity in the output of the decompositions. I assume there is some sort of random seed for ICA given that there is no inherent ordering to the components?

I'd note that there is no order to ICA components.

Good point

@ivirshup
Copy link
Member Author

ivirshup commented Dec 2, 2019

I know exactly that in PCA I can interpret a component based on its rank (and/or variance contribution).

Ah, I meant more specifically that it may be easier to biologically interpret an ICA.

That would say I should try as many decompositions as possible to see when I get a good result.

I'm a little unsure of your meaning here. Do you mean decompositions like decomposition techniques? If so, I don't think this is the right conclusion. I think it means: probably PCA for clustering, probably NMF for finding gene modules. I would also suspect something which finds sparser variable loadings like ICA or NMF could be more robust for cross dataset classification.

If you mean, if the results are unstable how do we know which to trust – I did ask that question. I think it's the usual: have a validation dataset, maybe some ensemble/ robustness method, or do some sort of enrichment. It's an open question, but a lot of our analysis pipeline is.

@LuckyMD
Copy link
Contributor

LuckyMD commented Dec 2, 2019

I was referring to both the instability and what i understood to mean non-robustness to different datasets. But it seems a "use case" is an analytical step here, rather than a particular dataset to be analysed. That makes it a lot better, and it means there is work to be done but a general best practice conclusion would be reachable.

In that case it's only the instability of the algorithm that is the issue per dataset. And in the case where you're doing exploratory analysis for a new dataset, you don't typically have a validation dataset, which makes this pretty challenging for end users of the method. Enrichment could be a way forward I guess... I'm not the biggest fan of using enrichment results as a measure for success though. Enrichment results still require quite a bit of interpretation.

This was referenced Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Will scanpy implement ICA?
2 participants