Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different feature dim numbers after PCA in example script? #4

Open
Y-SHI-MxLucid opened this issue Nov 8, 2021 · 1 comment
Open

Comments

@Y-SHI-MxLucid
Copy link

Hi there,

Have read the preprint very nice one.

I am trying to run the example script in the project, and I found that, the input of MultiMAP.integration:

adata = MultiMAP.Integration([rna, atac_genes], ['X_pca', 'X_lsi'])

rna.obsm['X_pca'] has the dim (4382, 50) while atac_genes.obsm['X_lsi'] has the dim (3166, 49). atac_genes.obsm['X_lsi'] is the output of MultiMAP.TFIDF_LSI() in init.py and MultiMAP.TFIDF_LSI() called tfidf() in matrix.py

MultiMAP.TFIDF_LSI(atac_peaks)
atac_genes.obsm['X_lsi'] = atac_peaks.obsm['X_lsi'].copy()

I later checked in matrix and I think the dim number = 49 might due to the discarding of the first column of the sklearn.decomposition.TruncatedSVD() output?

# n_components passed to here is 50
def tfidf(X, n_components, binarize=True, random_state=0):
    from sklearn.feature_extraction.text import TfidfTransformer
    sc_count = np.copy(X)
    if binarize:
        sc_count = np.where(sc_count < 1, sc_count, 1)
    tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)
    normed_count = tfidf.fit_transform(sc_count)
    lsi = sklearn.decomposition.TruncatedSVD(n_components=n_components, random_state=random_state)
    lsi_r = lsi.fit_transform(normed_count)
    # Here↓↓↓↓
    X_lsi = lsi_r[:, 1:]
    return X_lsi

I wonder is the discarding of the column #0 is to remove the PC1 which usually strongly correlated to sequencing depth? In this way, the 2 inputs of MultiMAP.Integration() has PCA dim of 50 and 49 respectively although the function still runs normally and returns a result with dim (7548, 2), but, is that okay to do so? I have an impression reading the preprint that the 2 dataset to be integrated should have the same PC dim number after PCA reduction, because the inter-dataset point distance need to be calculated. Please could you correct me if my understanding is wrong.

@ktpolanski
Copy link
Contributor

@mikasarkinjain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants