Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will scanpy implement ICA? #767

Open
chris-rands opened this issue Aug 5, 2019 · 10 comments · May be fixed by #941
Open

Will scanpy implement ICA? #767

chris-rands opened this issue Aug 5, 2019 · 10 comments · May be fixed by #941

Comments

@chris-rands
Copy link
Contributor

chris-rands commented Aug 5, 2019

Are there plans to include independent component analysis (ICA) in Scanpy? This would be useful for me. I know that it is already implemented in Seurat. Thank you

@ivirshup
Copy link
Member

ivirshup commented Aug 7, 2019

Could you tell us about how you've found it useful, or point us towards some literature on it being used? I think this would be easy enough to implement, but when I tried a naive implementation the results weren't that compelling. It's very possible I should've played around with the parameters more. @flying-sheep, do you have thoughts on this?

Here's a simple implementation:

def ica(adata, n_components, inplace=True, **kwargs): 
    from sklearn.decomposition import FastICA 
    ica_transformer = FastICA(n_components=n_components, **kwargs) 
    x_ica = ica_transformer.fit_transform(adata.X) 
    if inplace:
        adata.obsm["X_ica"] = x_ica 
        adata.varm["ICs"] = ica_transformer.components_.T 
    else:
        return ica_transformer 

@Fougere87
Copy link

Hi !
To answer about how it is useful, we are using ICA in our lab to a dataset of more than 100k cells, with a lot of complexity, and the main advantage of ICA against PCA is that it helps us detecting small populations of cells. As these small populations are not accounting for a lot of variance within the dataset, using a treshold on PCs, we discarded the PCs that would allow the separate them.
Another advantage is that we do not make use of a selection of "highly variable genes" anymore, and use all genes expressed in more than 100 cells for the whole analysis... Doing the same and applying PCA gave us quite poor results..

We made use of Seurat implementation.. and I tried fastICA from sklearn once but I couldn't obtain similar results... I have not looked thoroughly into seurat's code tough...

Hope it helps !
Best

@chris-rands
Copy link
Contributor Author

Thank you both. In terms of literature, see this paper (PMID: 30096299), in particular Figure S2

@ivirshup
Copy link
Member

ivirshup commented Aug 8, 2019

I looked through the R fastica package, and I think one of the main differences was the default tolerance. I've changed that and results seem a bit better. I don't think I have a great reference point to evaluate it though. Any chance you could provide a vignette of ICA being used with single cell data, so we can see if the python version can recapitulate the results?

@fidelram
Copy link
Collaborator

@chris-rands Any update on this? Does the code from @ivirshup gives you any meaningful results?

@gokceneraslan
Copy link
Collaborator

gokceneraslan commented Aug 30, 2019

BTW there is a nice "Faster ICA" algorithm now, called Picard. It's implemented as a python package here:

https://pierreablin.github.io/picard/generated/picard.picard.html#picard.picard

It might be worthwhile to try this too.

@ivirshup
Copy link
Member

@gokceneraslan, got around to giving picard a shot. For the pbmc3k dataset, it's a little slower than sklearns FastICA, though sklearn gives non-convergence warnings more often. Have you tried it before?

@chris-rands
Copy link
Contributor Author

Thanks for the feedback and sorry for the delay. The code snippet using sklearn.decomposition.FastICA gave me quite similar results to PCA for my data in terms of the downstream UMAP visualisations (when I simply embedded 50 components in the neighbourhood graph). One difference is that the ICA was slower to compute. I am not confident to create a vignette, as I'm unclear what the 'correct' results should look like. I have not tried the picard implementation yet.

@Fougere87
Copy link

Hi, I tried the snippet, with fastICA and picard, and with a number of cells higher than 30,000, the whitening step cannot be completed. This seems be due to some Lapack limitations.
ValueError: Too large work array required -- computation cannot be performed with standard 32-bit LAPACK.
I don't know how to get around this....
Best,
Chloé

@ivirshup
Copy link
Member

Sorry for the late response, completley forgot to post my response here.

@Fougere87, did that whitening issue occur with picard as well? I saw that with sklearn. I think we could get around that by whitening ourselves with ARPACK.

Picard and sklearn look pretty similar to me in a quick comparison. Below are top 16/30 components (ranked by Geary's C, autocorrelation on the connectivity graph) cell loadings on the pbmc3k dataset. The umap and connectivity matrix here were computed on top of a PCA – which I should maybe do differently. However I think the results are similar enough that it's probably not of consequence.

sklearn FastICA

sklearn_ica

picard ICA

picard_ica2

@ivirshup ivirshup linked a pull request Nov 30, 2019 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants