[WIP] ICA #941

ivirshup · 2019-11-30T10:10:13Z

Fixes #767

This is a work in progress PR adding ICA as a dimensionality reduction method. Some points:

This is faster and works with larger data than the sklearn version – entirely due to the whitening step. sklearn uses np.linalg.svd for whitening, which causes errors about using 32 bit lapack for large datasets since we use 32 bit floats and is slow (but exact). I've swapped that with the arpack svd. I may try and upstream this in the future, but there are a number of open PRs about ICA that I'd like to wait for a bit on: scikit-learn/scikit-learn#11860, scikit-learn/scikit-learn#13056.

As a benchmark, I was able to compute 40 dimensions of an ICA on 50k cells (tabula muris) and 7.5k highly variable genes in about a minute (59.3s) on my laptop.

As a comparison (for a smaller dataset – 10k PBMCs) here are two pair grid plots showing cell embeddings on ten components compared with the top ten components of a PCA.

PCA

ICA

Things left to do:

Look into numerical stability
Figure out if I should be be scaling the whitening matrix differently
More in depth comparison of results with sklearn based ICA
Documentation
Share _choose_obs_rep with sc.metrics PR

Once this is done, I'd like to also add sklearns NMF.

LuckyMD · 2019-11-30T15:50:29Z

A quick, naive question: What is the advantage of ICA over PCA? Components 1,3,5,7 of the ICA grid plot don't look particularly independent to me...

ivirshup · 2019-12-01T05:32:03Z

I'll give a brief hand-wavy explanation now, before checking with someone who knows more about it whether my in depth understanding is correct.

PCA is finding a set linearly independent variable which form a new basis for the data. ICA is finding N (user defined) discrete maximally independent signals from the data. They won't form a basis for the input data, and results can vary a lot based on the number of components you try and find. However, each of the signals is discrete and made of a sparser set of variables, which I think makes them more interpretable. I'd relate this to how the PCA components become a single blob while the ICA components keep separating clusters.

For example, in the components that you point out, I would agree 1, 5, and 7 look to be the same (note: I may have underspecified N here). However, component 3 is picking up a signal which is largely colinear with those, except for one cluster. To me, that says the difference in variable loadings between component 3 and the others is worth investigating.

LuckyMD · 2019-12-01T12:12:54Z

Thanks for the explanation. But what do you mean by "discrete" here?

And so you're saying 1, 5, and 7 being given as solutions to ICA is non-optimal. I guess that's just local optima that are found. It feels strange to generally say that ICA is better as higher dimensions still separate out clusters, while at lower dimensions there is redundant information compared to PCA.

ivirshup · 2019-12-02T04:02:24Z

I think saying discrete was redundant with independent, in that each component should correspond to a signal in the data.

And so you're saying 1, 5, and 7 being given as solutions to ICA is non-optimal.

I'm not sure how to interpret it. I know that if I run an analysis on the same dataset with 20 components I get more independent ones. My impression is the "failure modes" of linear decompositions like this are not well characterized.

It feels strange to generally say that ICA is better as higher dimensions

I probably wouldn't say this. I think there are different use cases, and ICA components may be easier to interpret than PCA components. I was also just at a talk by Elana Fertig (who knows much more about this kind of thing than I do) where one of the take aways was "different decompositions for different use-cases".

I think I'll still use PCA for clustering and generating UMAPs.

while at lower dimensions there is redundant information compared to PCA.

I'd note that there is no order to ICA components.

LuckyMD · 2019-12-02T09:09:03Z

ICA components may be easier to interpret than PCA components.

I find this difficult to generalize. I know exactly that in PCA I can interpret a component based on its rank (and/or variance contribution). In ICA I would have to find some sort of enrichment every time to interpret the same thing. Surely ICA must have spurious components as well (even if only based on non-optimal solutions).

I was also just at a talk by Elana Fertig (who knows much more about this kind of thing than I do) where one of the take aways was "different decompositions for different use-cases".

That's not really a useful take-away for me. That would say I should try as many decompositions as possible to see when I get a good result. So then I have to rely on my subjective assessment of "good" being based on what I expect the data to show. This is especially difficult if there is some stochasticity in the output of the decompositions. I assume there is some sort of random seed for ICA given that there is no inherent ordering to the components?

I'd note that there is no order to ICA components.

Good point

ivirshup · 2019-12-02T09:35:23Z

I know exactly that in PCA I can interpret a component based on its rank (and/or variance contribution).

Ah, I meant more specifically that it may be easier to biologically interpret an ICA.

That would say I should try as many decompositions as possible to see when I get a good result.

I'm a little unsure of your meaning here. Do you mean decompositions like decomposition techniques? If so, I don't think this is the right conclusion. I think it means: probably PCA for clustering, probably NMF for finding gene modules. I would also suspect something which finds sparser variable loadings like ICA or NMF could be more robust for cross dataset classification.

If you mean, if the results are unstable how do we know which to trust – I did ask that question. I think it's the usual: have a validation dataset, maybe some ensemble/ robustness method, or do some sort of enrichment. It's an open question, but a lot of our analysis pipeline is.

LuckyMD · 2019-12-02T10:00:23Z

I was referring to both the instability and what i understood to mean non-robustness to different datasets. But it seems a "use case" is an analytical step here, rather than a particular dataset to be analysed. That makes it a lot better, and it means there is work to be done but a general best practice conclusion would be reachable.

In that case it's only the instability of the algorithm that is the issue per dataset. And in the case where you're doing exploratory analysis for a new dataset, you don't typically have a validation dataset, which makes this pretty challenging for end users of the method. Enrichment could be a way forward I guess... I'm not the biggest fan of using enrichment results as a measure for success though. Enrichment results still require quite a bit of interpretation.

ivirshup added 5 commits November 21, 2019 18:18

Initial implementation of ICA

15487b0

Add '_choose_obs_rep' until it goes somewhere public

b1e1f5e

Fix imports and exports for ica

f0ed730

Reduce memory usage of ICA a bit

a1d5190

Add simple test for ICA

305c614

falexwolf force-pushed the master branch from aa3acd7 to fd4bc99 Compare December 30, 2019 00:53

This was referenced Mar 21, 2024

NMF #2938

Closed

NMF #2939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] ICA #941

[WIP] ICA #941

ivirshup commented Nov 30, 2019

LuckyMD commented Nov 30, 2019

ivirshup commented Dec 1, 2019 •

edited

Loading

LuckyMD commented Dec 1, 2019

ivirshup commented Dec 2, 2019

LuckyMD commented Dec 2, 2019 •

edited

Loading

ivirshup commented Dec 2, 2019

LuckyMD commented Dec 2, 2019

[WIP] ICA #941

Are you sure you want to change the base?

[WIP] ICA #941

Conversation

ivirshup commented Nov 30, 2019

LuckyMD commented Nov 30, 2019

ivirshup commented Dec 1, 2019 • edited Loading

LuckyMD commented Dec 1, 2019

ivirshup commented Dec 2, 2019

LuckyMD commented Dec 2, 2019 • edited Loading

ivirshup commented Dec 2, 2019

LuckyMD commented Dec 2, 2019

ivirshup commented Dec 1, 2019 •

edited

Loading

LuckyMD commented Dec 2, 2019 •

edited

Loading