Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scRNA-seq: how to create the Ranked list? #50

Open
asmariyaz23 opened this issue Jul 24, 2019 · 17 comments
Open

scRNA-seq: how to create the Ranked list? #50

asmariyaz23 opened this issue Jul 24, 2019 · 17 comments
Labels

Comments

@asmariyaz23
Copy link

Hello,

I have some clusters with differentially expresses genes generated using Seurat. I also have a highly curated gene list and I wish to perform enrichment analysis using these on each cluster picking the top X genes. I understand the differentially expressed genes in each cluster can be input to your algorithm as a pathway. But I am confused what the ranked list should be? Is it the curated gene list? If yes, could you explain the values to be input into this list?

Thank you,
Asma

@ToledoEM
Copy link
Contributor

Asma,

I would recommend you to check the documentation about metric in the Broad implementation of GSEA https://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

For many clusters the comparisons to do the metric have to depend of your dataset and your clusters (cell types, cell states). Also in my opinion negative enrichment in in a cell type with scRNAseq data is kinda of meaningless, because this is most likely to be driven by another cell type in the dataset (beside the logical exercise to explain it).

@konsolerr
Copy link

konsolerr commented Jul 24, 2019

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

image

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:

# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)

Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers,
Konstantin

@asmariyaz23
Copy link
Author

Thank you for your replies. I am a newbie to analysis so excuse my naive questions. I have a curated a disease related gene list based on mutations (purely literature based), and in parallel I ran Seurat on single cell data (of a region related to this disease). Now I want to see if these clusters are enriched for the curated gene list. I have tried the fisher exact test but the odds ratio doesn't look correct, which led me in finding a package which does gene enrichment. My idea was to supply these clusters of DE genes as pathway list to fgsea, but I don't know how I should fit the curated gene list into this equation. Is this something I can do with this package?

@assaron assaron closed this as completed Sep 27, 2019
@assaron assaron changed the title How to create the Ranked list? scRNA-seq: how to create the Ranked list? Dec 16, 2020
@assaron
Copy link
Member

assaron commented Dec 16, 2020

Reopening, as this question arises regularly

@Mr-Byun
Copy link

Mr-Byun commented May 22, 2023

Unless you are using -sign(avg_logFC) * log10(p.adj)
Instead of
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")
Use
markers <- FoldChange(...)
Then sort by fold change.

@Close-your-eyes
Copy link

Why not use the formula for signal-to-noise across single cell transcriptomes from both groups?

@assaron
Copy link
Member

assaron commented Aug 22, 2023

@Close-your-eyes I guess you can use that as well if it works for you. Although I would say that logFC can be estimated more directly by the differential expression method.

@amkhasawneh
Copy link

Hello.

What I'm doing is actually comparing post-treatment samples to pre-treatment. I'm separating positive and negative differentially-expressed genes (DEGs) and running fgsea on each. For the DEGs with positive log2FC, I take the GO terms with positive NES.
If I want to find out the enriched GO terms in the latter group, should I just take the DEGs that have a negative log2FC, and then flip the ranks' sign to positive? Or should I just keep the negative ranks, but look at the negative enrichment after the GO analysis?

Thanks a lot.

@tmontserrat
Copy link

tmontserrat commented Jun 28, 2024

Hi @asmariyaz23 ,

In the paper https://www.ahajournals.org/doi/pdf/10.1161/CIRCRESAHA.118.312804 (figure 7) I was checking the consistency of pathway enrichment: bulk RNA-seq vs scRNA-seq. But in my case, differential expression was done one cluster vs other (and not one cluster vs all the other, as it seems in your case) and compared to the same populations in bulk RNA-seq.

image

I tried several techniques to accomplish this, but the most robust was to just use avg_logFC to rank genes. Other approach would be to use -sign(avg_logFC) * log10(p.adj). We once had a conversation with @assaron about which score to use, but we couldn't decide which one is better. I still use plain avg_logFC.

My code was similar to this:

# whole is a Seurat object, I was comparing cluster 4 vs cluster 1
clustering <- whole@ident
cellsIn <- names(clustering[clustering == 4])
cellsOut <- names(clustering[clustering == 1])
markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST")

ranks <- markers$avg_logFC
names(ranks) <- rownames(markers)

# allPathways is predefined list of signatures
fgseaRes <- fgsea(pathways = allPathways, 
                  stats = ranks,
                  minSize=10,
                  maxSize=500,
                  nperm=1000000)

Important thing in this snippet is to set thresholds to 0 to not artificially exclude genes from the analysis. Running FindMarkers like this (with 0 thresholds) is the closest you can get to classical DE in bulk RNA-seq (however, genes that are not expressed in both groups will be excluded anyway, and this is good).

In case of one vs all other clusters, I must agree with @ToledoEM, negative enrichment won't tell you much information about this particular cluster in most of the cases, but for positive enrichment you can use snippet above.

Cheers, Konstantin

Dear all,

By doing markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") you will leave all genes not expressed in any cell of both clusters, and I think this is not what we want.

Don't you think it would be better to put some min.pct, for example the 0.01 (the default in Seurat5) or 0.1 to remove from the analysis genes not expressed in any of the two groups of cells? As far as I know, we should remove these genes before using MAST.

Moreover, if we use markers <- FindMarkers(whole, cellsIn, cellsOut, logfc.threshold = 0, min.pct = 0, test.use="MAST") all genes with 0 counts in both groups will tie in the ranking. And one more thing, I've detected some strange behaviour in Seurat5 for the log2FC of non-expressed genes:

> fchanges <- FoldChange(seurat_t, ident.1 = "cluster1", ident.2 = "cluster2")
> head(fchanges)
        avg_log2FC pct.1 pct.2
Xkr4    -2.6394103 0.000 0.000
Gm37381 -2.6394103 0.000 0.000
Mrpl15   1.0763732 0.235 0.077
Lypla1   0.7276294 0.185 0.077
Tcea1   -0.1477601 0.383 0.308
Rgs20   -2.6394103 0.000 0.000

As you can see, for Xkr4, for example, it seems to me that the avg_log2FC is wrong, since there is no expression in any of the groups:

> sum(seurat_t_subset[["RNA"]]$data["Xkr4", ])
[1] 0

For the fgsea, if we use the FindMarkers strategy and then the -sign(avg_logFC) * log10(p.adj) method to rank the genes, then I suppose the effect will be minimum since log10(1) is zero, and the whole value will turn to zero:

> markers <- FindMarkers(
        seurat_t, ident.1 = "cluster1", ident.2 = "cluster2",
        test.use = "MAST",
        logfc.threshold = 0, min.pct = 0
    )
> markers["Xkr4", ]
   p_val avg_log2FC pct.1 pct.2 p_val_adj
Xkr4    1    -2.6394103            0    0        1

@Sirin24
Copy link

Sirin24 commented Sep 7, 2024

any updates please on how to rank genes from Seurat's findmarkers() on disease compared to control? Thank you.

@ToledoEM
Copy link
Contributor

ToledoEM commented Sep 7, 2024

@Sirin24
Copy link

Sirin24 commented Sep 7, 2024

@ToledoEM thank you. I meant to ask about how to rank, if the log2fc is sufficient to rank them without taking into consideration or pvalue or adj_pvalue as seurat has changed the way findmarkers() compute DEGS and now it will return more genes than before.

@assaron
Copy link
Member

assaron commented Sep 9, 2024

@Sirin24 I think @tmontserrat suggestion to add filter for min.pct is a good idea. I've seen other reports about Seurat returning non-zero avgLogFC for not expressed genes, which definitely messes up with GSEA analysis. Using -sign(avg_logFC) * log10(p.adj) also seems reasonable, probably leading to a bit different accents on what pathways are ranked higher, but that should be fine.

@Sirin24
Copy link

Sirin24 commented Sep 9, 2024

Thanks @assaron, I gave " -sign(avg_logFC) * log10(p.adj)" a try and it led to Infinite values which would be problematic I think. May I should try to modify pct... by pct you mean pct.1 only?


          X         p_val avg_log2FC pct.1 pct.2     p_val_adj      ranking_metric
1   ENSG00000229807  0.000000e+00  0.7280135 0.638 0.295  0.000000e+00            Inf
2   ENSG00000067057  0.000000e+00  1.1452056 0.713 0.406  0.000000e+00            Inf
3   ENSG00000172987  0.000000e+00  1.4056596 0.800 0.507  0.000000e+00            Inf

@assaron
Copy link
Member

assaron commented Sep 9, 2024

@Sirin24

May I should try to modify pct... by pct you mean pct.1 only?

No, you should be able to use min.pct=0 in FindMarkers, which will keep only the genes that are expressed in either of the clusters.

I gave " -sign(avg_logFC) * log10(p.adj)" a try and it led to Infinite values which would be problematic I think

Yes, it's not great, as the ranking can't differentiate these genes. Technically though, it should be OK, fgsea should replace the infinite values with a finite ones, producing a warning.

@Sirin24
Copy link

Sirin24 commented Oct 2, 2024

@assaron unfortunately I do not think fgsea can handle infinite values.

msigdbr_7.5.1 fgsea_1.30.0

fgseaRes <- fgsea(pathways = pathways, 
+                            stats    = gene_list,
+                            nperm    =10000,
+                            minSize  = 15,
+                            maxSize  = 500)
Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  : 
  Not all stats values are finite numbers
In addition: Warning message:
In fgsea(pathways = pathways, stats = gene_list, nperm = 10000,  :
  You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.

@Flu09
Copy link

Flu09 commented Oct 2, 2024

I added min and max values for those Infinite values and only 1 pathway passed with adj p value 0.05 (I ranked my gene list by -sign(avg_logFC) * log10(p.adj) ) I am not sure if this approach is working after findmarkers() with MAST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants