scRNA-seq: how to create the Ranked list? #50

asmariyaz23 · 2019-07-24T10:07:03Z

Hello,

I have some clusters with differentially expresses genes generated using Seurat. I also have a highly curated gene list and I wish to perform enrichment analysis using these on each cluster picking the top X genes. I understand the differentially expressed genes in each cluster can be input to your algorithm as a pathway. But I am confused what the ranked list should be? Is it the curated gene list? If yes, could you explain the values to be input into this list?

Thank you,
Asma

ToledoEM · 2019-07-24T10:26:59Z

Asma,

I would recommend you to check the documentation about metric in the Broad implementation of GSEA https://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

For many clusters the comparisons to do the metric have to depend of your dataset and your clusters (cell types, cell states). Also in my opinion negative enrichment in in a cell type with scRNAseq data is kinda of meaningless, because this is most likely to be driven by another cell type in the dataset (beside the logical exercise to explain it).

konsolerr · 2019-07-24T11:50:35Z

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:

# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)

Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers,
Konstantin

asmariyaz23 · 2019-07-24T13:27:57Z

Thank you for your replies. I am a newbie to analysis so excuse my naive questions. I have a curated a disease related gene list based on mutations (purely literature based), and in parallel I ran Seurat on single cell data (of a region related to this disease). Now I want to see if these clusters are enriched for the curated gene list. I have tried the fisher exact test but the odds ratio doesn't look correct, which led me in finding a package which does gene enrichment. My idea was to supply these clusters of DE genes as pathway list to fgsea, but I don't know how I should fit the curated gene list into this equation. Is this something I can do with this package?

assaron · 2020-12-16T11:17:24Z

Reopening, as this question arises regularly

Mr-Byun · 2023-05-22T10:02:30Z

Unless you are using -sign(avg_logFC) * log10(p.adj)
Instead of
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")
Use
markers <- FoldChange(...)
Then sort by fold change.

Close-your-eyes · 2023-08-22T08:50:17Z

Why not use the formula for signal-to-noise across single cell transcriptomes from both groups?

assaron · 2023-08-22T15:20:58Z

@Close-your-eyes I guess you can use that as well if it works for you. Although I would say that logFC can be estimated more directly by the differential expression method.

amkhasawneh · 2023-10-10T07:59:37Z

Hello.

What I'm doing is actually comparing post-treatment samples to pre-treatment. I'm separating positive and negative differentially-expressed genes (DEGs) and running fgsea on each. For the DEGs with positive log2FC, I take the GO terms with positive NES.
If I want to find out the enriched GO terms in the latter group, should I just take the DEGs that have a negative log2FC, and then flip the ranks' sign to positive? Or should I just keep the negative ranks, but look at the negative enrichment after the GO analysis?

Thanks a lot.

tmontserrat · 2024-06-28T06:27:47Z

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:
# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)
Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers, Konstantin

Dear all,

By doing markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") you will leave all genes not expressed in any cell of both clusters, and I think this is not what we want.

Don't you think it would be better to put some min.pct, for example the 0.01 (the default in Seurat5) or 0.1 to remove from the analysis genes not expressed in any of the two groups of cells? As far as I know, we should remove these genes before using MAST.

Moreover, if we use markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") all genes with 0 counts in both groups will tie in the ranking. And one more thing, I've detected some strange behaviour in Seurat5 for the log2FC of non-expressed genes:

> fchanges <- FoldChange(seurat_t, ident.1 = "cluster1", ident.2 = "cluster2")
> head(fchanges)
        avg_log2FC pct.1 pct.2
Xkr4    -2.6394103 0.000 0.000
Gm37381 -2.6394103 0.000 0.000
Mrpl15   1.0763732 0.235 0.077
Lypla1   0.7276294 0.185 0.077
Tcea1   -0.1477601 0.383 0.308
Rgs20   -2.6394103 0.000 0.000

As you can see, for Xkr4, for example, it seems to me that the avg_log2FC is wrong, since there is no expression in any of the groups:

> sum(seurat_t_subset[["RNA"]]$data["Xkr4", ])
[1] 0

For the fgsea, if we use the FindMarkers strategy and then the -sign(avg_logFC) * log10(p.adj) method to rank the genes, then I suppose the effect will be minimum since log10(1) is zero, and the whole value will turn to zero:

> markers <- FindMarkers(
        seurat_t, ident.1 = "cluster1", ident.2 = "cluster2",
        test.use = "MAST",
        logfc.threshold = 0, min.pct = 0
    )
> markers["Xkr4", ]
   p_val avg_log2FC pct.1 pct.2 p_val_adj
Xkr4    1    -2.6394103            0    0        1

Sirin24 · 2024-09-07T11:26:59Z

any updates please on how to rank genes from Seurat's findmarkers() on disease compared to control? Thank you.

ToledoEM · 2024-09-07T18:23:45Z

https://satijalab.org/seurat/reference/findmarkers

Sirin24 · 2024-09-07T18:26:42Z

@ToledoEM thank you. I meant to ask about how to rank, if the log2fc is sufficient to rank them without taking into consideration or pvalue or adj_pvalue as seurat has changed the way findmarkers() compute DEGS and now it will return more genes than before.

assaron · 2024-09-09T20:32:24Z

@Sirin24 I think @tmontserrat suggestion to add filter for min.pct is a good idea. I've seen other reports about Seurat returning non-zero avgLogFC for not expressed genes, which definitely messes up with GSEA analysis. Using -sign(avg_logFC) * log10(p.adj) also seems reasonable, probably leading to a bit different accents on what pathways are ranked higher, but that should be fine.

Sirin24 · 2024-09-09T21:33:20Z

Thanks @assaron, I gave " -sign(avg_logFC) * log10(p.adj)" a try and it led to Infinite values which would be problematic I think. May I should try to modify pct... by pct you mean pct.1 only?


          X         p_val avg_log2FC pct.1 pct.2     p_val_adj      ranking_metric
1   ENSG00000229807  0.000000e+00  0.7280135 0.638 0.295  0.000000e+00            Inf
2   ENSG00000067057  0.000000e+00  1.1452056 0.713 0.406  0.000000e+00            Inf
3   ENSG00000172987  0.000000e+00  1.4056596 0.800 0.507  0.000000e+00            Inf

assaron · 2024-09-09T22:13:20Z

@Sirin24

May I should try to modify pct... by pct you mean pct.1 only?

No, you should be able to use min.pct=0 in FindMarkers, which will keep only the genes that are expressed in either of the clusters.

I gave " -sign(avg_logFC) * log10(p.adj)" a try and it led to Infinite values which would be problematic I think

Yes, it's not great, as the ranking can't differentiate these genes. Technically though, it should be OK, fgsea should replace the infinite values with a finite ones, producing a warning.

Sirin24 · 2024-10-02T00:27:42Z

@assaron unfortunately I do not think fgsea can handle infinite values.

msigdbr_7.5.1 fgsea_1.30.0

fgseaRes <- fgsea(pathways = pathways, 
+                            stats    = gene_list,
+                            nperm    =10000,
+                            minSize  = 15,
+                            maxSize  = 500)
Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  : 
  Not all stats values are finite numbers
In addition: Warning message:
In fgsea(pathways = pathways, stats = gene_list, nperm = 10000,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.

Flu09 · 2024-10-02T18:17:53Z

I added min and max values for those Infinite values and only 1 pathway passed with adj p value 0.05 (I ranked my gene list by -sign(avg_logFC) * log10(p.adj) ) I am not sure if this approach is working after findmarkers() with MAST.

assaron closed this as completed Sep 27, 2019

assaron mentioned this issue Apr 10, 2020

Ranking the genes from Seurat DE output file #66

Closed

assaron changed the title ~~How to create the Ranked list?~~ scRNA-seq: how to create the Ranked list? Dec 16, 2020

assaron reopened this Dec 16, 2020

assaron added the question label Dec 16, 2020

assaron mentioned this issue Dec 16, 2020

Which testing method is the best? #85

Closed

assaron mentioned this issue Oct 15, 2021

fgsea gives many highly sign. gene sets for comparison where I expect a few #107

Closed

atakanekiz mentioned this issue Feb 11, 2023

Question: Understanding discrepant results when using different ranking approaches and algorithmic parameters #128

Open

dekan-aleksandr mentioned this issue Mar 18, 2023

Which genes should I use? #131

Closed

assaron mentioned this issue Jun 1, 2023

Insignificant p-values #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scRNA-seq: how to create the Ranked list? #50

scRNA-seq: how to create the Ranked list? #50

asmariyaz23 commented Jul 24, 2019

ToledoEM commented Jul 24, 2019

konsolerr commented Jul 24, 2019 •

edited

Loading

asmariyaz23 commented Jul 24, 2019

assaron commented Dec 16, 2020

Mr-Byun commented May 22, 2023

Close-your-eyes commented Aug 22, 2023

assaron commented Aug 22, 2023

amkhasawneh commented Oct 10, 2023

tmontserrat commented Jun 28, 2024 •

edited

Loading

Sirin24 commented Sep 7, 2024

ToledoEM commented Sep 7, 2024

Sirin24 commented Sep 7, 2024

assaron commented Sep 9, 2024

Sirin24 commented Sep 9, 2024 •

edited

Loading

assaron commented Sep 9, 2024

Sirin24 commented Oct 2, 2024 •

edited

Loading

Flu09 commented Oct 2, 2024 •

edited

Loading

scRNA-seq: how to create the Ranked list? #50

scRNA-seq: how to create the Ranked list? #50

Comments

asmariyaz23 commented Jul 24, 2019

ToledoEM commented Jul 24, 2019

konsolerr commented Jul 24, 2019 • edited Loading

asmariyaz23 commented Jul 24, 2019

assaron commented Dec 16, 2020

Mr-Byun commented May 22, 2023

Close-your-eyes commented Aug 22, 2023

assaron commented Aug 22, 2023

amkhasawneh commented Oct 10, 2023

tmontserrat commented Jun 28, 2024 • edited Loading

Sirin24 commented Sep 7, 2024

ToledoEM commented Sep 7, 2024

Sirin24 commented Sep 7, 2024

assaron commented Sep 9, 2024

Sirin24 commented Sep 9, 2024 • edited Loading

assaron commented Sep 9, 2024

Sirin24 commented Oct 2, 2024 • edited Loading

Flu09 commented Oct 2, 2024 • edited Loading

konsolerr commented Jul 24, 2019 •

edited

Loading

tmontserrat commented Jun 28, 2024 •

edited

Loading

Sirin24 commented Sep 9, 2024 •

edited

Loading

Sirin24 commented Oct 2, 2024 •

edited

Loading

Flu09 commented Oct 2, 2024 •

edited

Loading