Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python and R versions of package giving different results #151

Open
shaln opened this issue Aug 26, 2024 · 1 comment
Open

Python and R versions of package giving different results #151

shaln opened this issue Aug 26, 2024 · 1 comment
Labels
question User question: anything that's not obviously a bug

Comments

@shaln
Copy link

shaln commented Aug 26, 2024

Hi, I wanted to first and foremost express my appreciation for this fantastic package!

I am trying to apply the TF activity inference functions on my 10X scRNA-seq data. For reference, my dataset has 4 conditions and each condition has 2 timepoints (total 10 samples).

Prior to decoupler, I had processed the scRNA-seq data using Seurat v5 following the SCTransform v2 tutorial, and applied decoupler to Seurat objects obtained after running the PrepSCTFindMarkers function in Seurat.

For my first run of decoupler, I tried running the R version on a Seurat object consisting of only one of the conditions with 2 timepoints - for clarity, let's call this Condition 1, which has D7 and D9. This Seurat object was processed the same way as described above. I obtained the heatmap in Figure A below, of the top 3 TF.

Since the Python implementation is more memory efficient and offers the function to run pseudobulk analysis downstream, which is what I want to do for the Seurat object containing all samples and conditions, I decided to try running the Python version. I ran the Python version on the exact same Seurat object for Condition 1 that I had used for the R version (converted to .h5ad format using the sceasy package). I was surprised to see the Python and R versions giving different results for the same dataset, as shown in the heatmap below the top 3 TFs for each cluster is different between the Python (Figure B) and R versions of decoupler (Figure A).

Picture 1

Is this a bug or expected? I'm not quite sure what could be causing the discrepancy apart from something in the Seurat object to anndata conversion process or differences underneath the hood between the R and Python versions.. I'd appreciate it if you could help clarify this :) Thanks!

@shaln shaln added the question User question: anything that's not obviously a bug label Aug 26, 2024
@PauBadiaM
Copy link
Member

Hi @shaln,

Sorry for the delayed reply, I was on holidays. The observed discrepancies are at how the vignettes present the results but the activity values should be the same (I am assuming that you are comparing sc vs sc, and not sc vs pseudobulk). In the R vignette we plot the top 25 more variable TFs at the activity level, while in the python version we plot the most active TFs per cluster label. If you look at OLIG2, you can see that it has the same distribution of values in both plots. Personally I prefer the python one since you get a clearer picture per cluster (in the R one it might be the case that variable TFs are over-represented in a subset of clusters).
Another source of variability could be that during the conversion some genes are removed, this will affect the background of enrichment methods and results could change slightly.

Hope this is helpful! Let me know if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question User question: anything that's not obviously a bug
Projects
None yet
Development

No branches or pull requests

4 participants
@shaln @PauBadiaM and others