Python and R versions of package giving different results #151

shaln · 2024-08-26T15:01:29Z

Hi, I wanted to first and foremost express my appreciation for this fantastic package!

I am trying to apply the TF activity inference functions on my 10X scRNA-seq data. For reference, my dataset has 4 conditions and each condition has 2 timepoints (total 10 samples).

Prior to decoupler, I had processed the scRNA-seq data using Seurat v5 following the SCTransform v2 tutorial, and applied decoupler to Seurat objects obtained after running the PrepSCTFindMarkers function in Seurat.

For my first run of decoupler, I tried running the R version on a Seurat object consisting of only one of the conditions with 2 timepoints - for clarity, let's call this Condition 1, which has D7 and D9. This Seurat object was processed the same way as described above. I obtained the heatmap in Figure A below, of the top 3 TF.

Since the Python implementation is more memory efficient and offers the function to run pseudobulk analysis downstream, which is what I want to do for the Seurat object containing all samples and conditions, I decided to try running the Python version. I ran the Python version on the exact same Seurat object for Condition 1 that I had used for the R version (converted to .h5ad format using the sceasy package). I was surprised to see the Python and R versions giving different results for the same dataset, as shown in the heatmap below the top 3 TFs for each cluster is different between the Python (Figure B) and R versions of decoupler (Figure A).

Is this a bug or expected? I'm not quite sure what could be causing the discrepancy apart from something in the Seurat object to anndata conversion process or differences underneath the hood between the R and Python versions.. I'd appreciate it if you could help clarify this :) Thanks!

The text was updated successfully, but these errors were encountered:

PauBadiaM · 2024-09-10T08:39:31Z

Hi @shaln,

Sorry for the delayed reply, I was on holidays. The observed discrepancies are at how the vignettes present the results but the activity values should be the same (I am assuming that you are comparing sc vs sc, and not sc vs pseudobulk). In the R vignette we plot the top 25 more variable TFs at the activity level, while in the python version we plot the most active TFs per cluster label. If you look at OLIG2, you can see that it has the same distribution of values in both plots. Personally I prefer the python one since you get a clearer picture per cluster (in the R one it might be the case that variable TFs are over-represented in a subset of clusters).
Another source of variability could be that during the conversion some genes are removed, this will affect the background of enrichment methods and results could change slightly.

Hope this is helpful! Let me know if you have further questions.

shaln added the question User question: anything that's not obviously a bug label Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python and R versions of package giving different results #151

Python and R versions of package giving different results #151

shaln commented Aug 26, 2024

PauBadiaM commented Sep 10, 2024

Python and R versions of package giving different results #151

Python and R versions of package giving different results #151

Comments

shaln commented Aug 26, 2024

PauBadiaM commented Sep 10, 2024