Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell-type Annotation with STdeconvolve outputs #36

Closed
inhyeoklee opened this issue Apr 14, 2023 · 14 comments
Closed

Cell-type Annotation with STdeconvolve outputs #36

inhyeoklee opened this issue Apr 14, 2023 · 14 comments

Comments

@inhyeoklee
Copy link

Hi,

Thank you very much for this amazing tool!
After running STdeconvolve on human femoral & carotid plaque Visium data, I'm thinking of doing cell-type annotation with Azimuth. Do you have any suggestions on how to implement Azimuth on STdeconvolve outputs?

Thanks,
Daniel Lee

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 14, 2023

Hi @inhyeoklee,

Thanks for using STdeconvolve! I haven't used Azimuth enough to give that much advice on how to implement it, but I believe it's primary purpose is reference-based annotation of single cell RNA-seq datasets. If you're trying to use it to annotate the deconvolved cell types then theoretically you could use the beta matrix of the deconvolved cell type transcriptional profiles, but these are the profiles of the cell types and not individual cells. My concern is that using this as input and treating the cell types as individual "cells" as input may not be a sufficient dataset for annotation. Alternatively, you could apply Azimuth to the individual spots but this wouldn't be appropriate given that the spots are assumed to be multi-cellular. We do provide tutorials to give some examples of how users might go about annotating the deconvolved cell types: Annotating deconvolved cell-types.

Hope this helps a little,
Brendan

@inhyeoklee
Copy link
Author

Thank you Brendan! That actually really helped my understanding of the STdeconvolve algorithm. Would that be why we use results$beta instead of results$theta (which shows expression levels) because the reference-free deconvolution inherently comes with the hypothesis that a certain cell type would have distinct, and yet uniform gene expressions (which are not necessarily true in single-cell rna-seq data)?

Also, if I'm not exactly sure which cell types I have (unlike the 5 predefined tissue layer labels in the tutorial), can I instead use the entire human cell type - top canonical marker genes reference table available on PanglaoDB?

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 15, 2023

Hi @inhyeoklee,

STdeconvolve outputs two matrices. The first is the beta matrix, which is a matrix of the deconvolved cell type transcriptional profiles. Basically it is a cell type x gene matrix that contains the probability of a deconvolved cell type expressing a given gene. The second is the theta matrix, which is the proportion of each cell type in each of the spots or capture locations.

STdeconvolve is built on latent Dirichlet allocation, which is looking for clusters of genes that are co-expressed together across the capture locations, so it works best when cell types have distinct transcriptional profiles, as in certain genes are specifically expressed together in different cell types. If cell types have very similar transcriptional profiles, then it will be difficult to identify clusters of genes that are co-expressed specifically together and accurately deconvolve the contributions of the different cell types.

In terms of identifying the deconvolved cell types, you can certainly explore their deconvolved transcriptional profiles and see if the top expressed genes are marker genes for known cell types annotated in databases such as PanglaoDB. A more statistical approach, which we tried to demonstrate, is looking for enrichment of cell type marker genes using gene set enrichment analysis. But this is just one suggestion - ultimately it will be up to users to decide how they would like to annotate the deconvolved cell types.

Hope this helps!
Brendan

@inhyeoklee
Copy link
Author

Sorry about the confusion on beta and theta matrices. And thank you Brendan - it makes sense to use gene set enrichment analysis to label cell types rather than identifying them by comparing what top marker genes both the ground truth table and my deconvolved cell types have. For gene set enrichment analysis, would it be any beneficial to use the ground truth table of all the human cell types instead of just a couple of cell types expected?

Best,
Daniel

@inhyeoklee inhyeoklee changed the title Azimuth Cell-type Annotation with STdeconvolve outputs Cell-type Annotation with STdeconvolve outputs Apr 17, 2023
@bmill3r
Copy link
Collaborator

bmill3r commented Apr 17, 2023

Hi @inhyeoklee

Because the GSEA function in STdeconvolve assesses enrichment of each cell type gene set iteratively I don't see any reason why you couldn't use all human cell types. However, from an interpretation standpoint, some of the cell type hits may be cell types you would not expect to be in the tissue you are analyzing. So some pre-filtering of the cell type gene sets down to those you would expect to be in your tissue sample is probably a good idea.

Hope this helps,
Brendan

@inhyeoklee
Copy link
Author

inhyeoklee commented Apr 17, 2023

Thank you for clarifying the GSEA inputs! I've managed to run the GSEA, but I'm a little confused about its results (shared below). Could you please give any pointers on how I might obtain better classification results?

Here's the prediction results I've got:
Screenshot 2023-04-17 at 4 01 59 PM

And, here's the steps I took:

  1. I renamed the Ensembl IDs in the results$beta into gene symbols.

Create a mapping between Ensembl IDs and gene symbols using the biomaRt package

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
ensembl_gene_mapping <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"), mart = ensembl)

Write a function to convert Ensembl IDs to gene symbols using the mapping

convert_ensembl_to_symbol <- function(ensembl_id, mapping) {
gene_symbol <- mapping$external_gene_name[match(ensembl_id, mapping$ensembl_gene_id)]
return(gene_symbol)
}

Convert the Ensembl IDs in the colnames(results$beta) into gene symbols

colnames(results$beta) <- sapply(colnames(results$beta), convert_ensembl_to_symbol, mapping = ensembl_gene_mapping)

  1. Here's the reference gene set list I prepared:
    gset <- list(
    Adipocytes = c("METRNL", "MYO1C", "METRN", "MMP2", "NQO1", "COL6A2", "SYAP1", "CRNDE", "CAVIN1", "MT2A"),
    B_cells = c("CD79A", "CD74", "HLA-DRA", "CD79B", "CD52", "IGHM", "IGKC", "CXCR4", "CD69", "BCL11A"),
    B_cells_memory = c("CD80", "CCR6", "CLCA3P", "CXCR5", "FCRL2", "GUSBP11", "IFNA10", "IGLL3P", "NMBR", "PTPRCAP"),
    B_cells_naive = c("MS4A1", "CD79B", "IGHM", "IGKC", "LTB", "CD37", "CD52", "CD74", "LINC00926", "TNFRSF13C"),
    Basal_cells = c("HEBP2", "KRT17", "PLP2", "ACADVL", "S100A14", "ITGA2", "ANXA11", "KRT5", "KRT15", "LAMB3"),
    Basophils = c("IL3RA", "CCR3", "ENPP3", "ITGB7", "IL4", "HTR1B", "GRM6", "CEBPA", "E2F8", "MBOAT1"),
    Cardiac_stem_and_precursor_cells = c("MYH6", "FUT4", "TBX4", "TBX18", "NKX2-5", "KIT", "ADGRL2", "GATA4", "ABCG2", "KITLG"),
    Cardiomyocytes = c("NPPA", "MYL2", "RBM20", "RBM24", "MYH7B", "TCAP", "RYR2", "MYL3", "MYH6", "JPH2"),
    Dendritic_cells = c("HLA-DPA1", "HLA-DPB1", "HLA-DRA", "HLA-B", "HLA-A", "FTL", "HLA-DRB1", "HLA-C", "CTSS", "AIF1"),
    Endothelial_cells = c("GNG11", "CLDN5", "S100A10", "TM4SF1", "ID3", "EGFL7", "IGFBP7", "SPARC", "ECSCR", "ENG"),
    Eosinophils = c("FUT4", "CCR3", "CD244", "PTGDR2", "SIGLEC8", "PGLYRP1", "EPX", "PRG3", "IL5RA", "RNASE3"),
    Epithelial_cells = c("TSPAN1", "TSPAN8", "ELF3", "PHGR1", "PIGR", "EPCAM", "CEACAM5", "LGALS4", "AGR2", "GUCA2A"),
    Fibroblasts = c("DCN", "VIM", "LUM", "SPARC", "C1S", "COL6A2", "S100A4", "COL1A2", "IGFBP6", "MMP2"),
    Gamma_delta_T_cells = c("GNLY", "NKG7", "CCL5", "CST7", "CTSW", "GZMB", "GZMA", "CD7", "KLRD1", "HOPX"),
    Macrophages = c("TYROBP", "CYBB", "AIF1", "S100A4", "MS4A7", "MAF", "TREM2", "SLC7A7", "GPNMB", "CLEC7A"),
    Mast_cells = c("TPSAB1", "TPSB2", "CPA3", "HPGDS", "RGS13", "RGS1", "MS4A2", "VWA5A", "KIT", "HDC"),
    Megakaryocytes = c("MPL", "GP5", "CXCR1", "CXCR2", "TLR9", "ALOX12", "IL21R", "GP1BA", "ITGB3", "TREML1"),
    Mesothelial_cells = c("RSPO1", "MSLN", "PKHD1L1", "MUC16", "CALB2", "MEDAG", "SULF1", "COBLL1", "WT1", "MIR22HG"),
    Monocytes = c("IFIT1", "OAS1", "IFIT3", "MX1", "PLSCR1", "TNFSF10", "PTPRC", "PSAP", "S100A9", "HLA-DRA"),
    Myeloid_derived_suppressor_cells = c("FUT4", "CEACAM8", "CCR2", "CD80", "CXCR1", "ADGRE1", "CD1D", "ITGAM", "ICAM1", "IL4R"),
    Myofibroblasts = c("TAGLN", "GFAP", "DES", "TNS1", "ACTA2", "CDH11", "PALLD", "MYL9", "CALD1"),
    NK_cells = c("NKG7", "GZMA", "CCL5", "CST7", "GNLY", "S100A4", "IL32", "CTSW", "KLRD1", "IL2RG"),
    Natural_killer_T_cells = c("IL12RB2", "IL12RB1", "TNFRSF8", "TBX21", "STYK1", "NR4A1", "IL17RA", "SLAMF7", "S1PR1", "ZBTB16"),
    Neutrophils = c("ELANE", "PRTN3", "AZU1", "CTSG", "MPO", "SERPINB1", "S100A4", "LYZ", "S100A8", "S100A9"),
    Nuocytes = c("CRLF2", "IL17RB", "ARG1", "IL5", "ICOS", "BCL11B", "RORA", "IL1RL1", "IL13", "GATA3"),
    Plasma_cells = c("IGKC", "SSR4", "MZB1", "CD74", "IGHA1", "IGLC2", "JCHAIN", "TNFRSF17", "CYCS", "CD79A"),
    Plasmacytoid_dendritic_cells = c("LILRA4", "IRF8", "IRF7", "JCHAIN", "GZMB", "SERPINF1", "ITM2C", "MZB1", "BCL11A", "TCF4"),
    Platelets = c("PF4", "PPBP", "CAVIN2", "CCL5", "NCOA4", "GNG11", "SRGN", "ARHGDIB", "GPX1", "HLA-A"),
    Red_pulp_macrophages = c("TREML4", "CCR3", "SPIC", "SIGLEC1", "ADGRE1", "MERTK", "SLC11A1", "VCAM1", "NR1H3", "CD86"),
    Smooth_muscle_cells = c("MYH11", "MYLK", "MYL9", "ACTA2", "SOD3", "LMOD1", "BGN", "PLN", "NOTCH3", "CNN1"),
    Stromal_cells = c("ICAM3", "B4GALNT1", "TLR1", "SNED1", "TLR4", "GDF10", "TLR2", "MADCAM1", "KIT", "FAP"),
    T_cells = c("IL32", "TRAC", "CD3D", "TRBC2", "CD52", "S100A4", "LTB", "CD3E", "CD2", "CXCR4"),
    T_cytotoxic_cells = c("GZMB", "CD2", "TRAC", "CD5", "CD69", "CD28", "CD8A", "CD27"),
    T_follicular_helper_cells = c("CXCR5", "TNFSF4", "SLAMF1", "BCL6", "PDCD1", "IL6R", "ICOS", "P2RX7", "CD84", "CD40LG"),
    T_helper_cells = c("IL2", "TBX21", "CCR5", "IL4", "IL10", "CCR8", "CCR3", "PTGDR2", "HAVCR1", "IL17A"),
    T_memory_cells = c("CD3D", "TRBC2", "LTB", "PTPRC", "LCK", "TRAC", "CD3G", "CD3E", "CD7", "CD2"),
    T_regulatory_cells = c("FOXP3", "IZUMO1R", "CNGB1", "IL10", "IKZF2", "CCR4", "ENTPD1", "NT5E", "CTLA4", "FOLR1"),
    Vascular_smooth_muscle_cells = c("TBX18", "SEMA3D", "MYH11", "ACTA2", "WT1", "PDGFRB")
    )

  2. Then, I ran the GSEA using these codes:
    celltype_annotations <- annotateCellTypesGSEA(beta = results$beta, gset = gset, qval = 0.05)
    celltype_annotations$predictions

By the way, thank you very much for providing guidance on this. I just started my bioinformatics career last month, and it really means a lot to me that you make time and effort to help out researchers using your tool.

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 19, 2023

Hi @inhyeoklee

No problem - happy to help!

To answer your questions about the GSEA results, I'll direct you to a similar question/response in another issue.

Let me know if you still have any questions,
Brendan

@inhyeoklee
Copy link
Author

@bmill3r Thanks to your help, I have beautifully deconvolved (and annotated!!) clusters now. I'm nearing to the closure of this thread, but if you don't mind, could you please give any advice/suggestion on the next steps I'm taking. For the downstream analysis (e.g. DEG) for these outputs, what packages or any workflow would you recommend? I normally use Scanpy, Scanorama on Python, so I'm relatively less familiar with Seurat and other downstream analysis on R.

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 20, 2023

Hi @inhyeoklee

Congrats on the annotated cell types and glad everything worked out for you! In terms of other analyses it depends on what types of biological questions you are interested in answering. If you're interested in the differentially expressed genes in each of the deconvolved cell types, you could follow a similar analyses we did in this Analysis of 10X Visium data where we get the differentially expressed genes for each deconvolved cell-type transcriptional profile.

Hope this helps,
Brendan

@inhyeoklee
Copy link
Author

inhyeoklee commented Apr 21, 2023

Thank you! For the DEG analysis, I want to generate violin plots of marker genes for clusters (cell types) across different samples, but I was wondering 1) how I should go about in terms of their integration. For example, if you have three classical monocyte clusters in one sample, and four classical monocyte clusters in another sample, should I manually compare their transcriptional profiles or do you have any suggestions?

  1. Also, are gene expression values inside results$beta raw or normalized?

  2. Are the enrichment scores provided by the annotateCellTypesGSEA function normalized for the gene set size?

Thank you,
Daniel

@bmill3r
Copy link
Collaborator

bmill3r commented Apr 22, 2023

Hi @inhyeoklee

If I understand your first question correctly, I think you're wondering if you can combine the classical monocyte clusters in a given sample together into one single monocyte cluster. You certainly can and one strategy to do this could be summing the transcriptional profiles and normalizing to 1 then scale to 1000 for better interpretability, for example.

Recall that the transcriptional profiles are probabilities of a cell type expressing those genes so in that sense they are already normalized to sum to 1 (then scaled by some factor if desired).

For the GSEA performed in annotateCellTypesGSEA(), I would encourage you to check out the paper Subramanian, Tamayo, et al. to understand how the enrichment score is calculated.

Hope this helps,
Brendan

@inhyeoklee
Copy link
Author

Thank you very much Brendan! :)

@Umaarasu
Copy link

@inhyeoklee @bmill3r I have been trying to annotate using annotateCellTypesGSEA(). The issue I am having is that the top genes from topGenes(beta)...they do not actually give different cluster names as I expect. For example I have 8 topics and I see that there are genes that could define them as 8 different cell types..So i make a gset and run the command annotateCellTypesGSEA(). But I get a result like this.

celltype_annotations$predictions
1 2 3 4 5 6
"Macrophage" "Monocyte" "Progenitor_cell" "Fibroblast" "Fibroblast" "Progenitor_cell"
7 8
"Fibroblast" "B_cell"

As you can see some cell types are repeated. Could someone help me solve this. Thanks!

@bmill3r
Copy link
Collaborator

bmill3r commented Jun 30, 2024

Hi @Umaarasu,

During GSEA, it's possible that different topics could be assigned to the same cell type. Additionally, each topic could be significantly enriched with gene sets of different cell types but the returned prediction is the cell type gene set that ranked the highest based on a given set of criteria (you can see how this is done by looking at the annotateCellTypesGSEA() function itself). You can explore the GSEA statistics that are returned and choose to assign a cell type gene set annotation differently, which might yield a different final cell type annotation.

Hope this helps,
Brendan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants