This repository contains the data curation, processing, and analysis scripts used by the article "SOAR elucidates disease mechanisms and empowers drug discovery through spatial transcriptomics" [bioRxiv preprint] | [Website].
To query the Gene Expression Omnibus (GEO) for potential human and mouse spatial transcriptomics datasets, please run the Python script using different keywords.
python3 geo-query.py
The retrieved GDS list with annotated meta-information will be stored in ./<%Y%m%d>/all.csv
.
The data processing scripts are available under data_processing/
. The scripts automatically perform spot and gene quality control, data transformation, normalization, and dimensionality reduction.
10x Visium data in standard format can be processed using process_visium_standard.R
. The script assumes that the directory contains one option from the below:
- (a)
filtered_feature_bc_matrix.h5
or MEX files and (b) the image data in a subdirectory calledspatial
- A Visium Seurat object with
data@images
properly added
Please note that 10x Visium data with only counts and coordinates and no spatial/
folder data should be processed using the non-Visium scripts.
Other types of spatial transcriptomics data transformed into a standard format (counts.csv
and coordinates.csv
, please see below for the guidelines) can be processed using process_non_visium_standard.R
. The script can also be used on Visium data with no h5 + spatial data provided for public download.
counts.csv
- Comma-delimited
- Header:
gene,<spot ID 1>,...,<spot ID n>
- Each row = one gene
- Gene symbols should be used (not Ensembl IDs, etc.)
coordinates.csv
- Comma-delimited
- Header:
barcode,row,col
- The example file has more columns but only these three columns are required
- Use the spot coordinates (row, col) instead of pixel coordinates (imagerow, imagecol in the example file) if possible
- Each row = one spot
The barcode column of coordinates.csv
should be exactly the same as the counts.csv
header (after removing "gene"), i.e. the spot IDs should match.
The overall flow of data analysis is as below.
- Perform spatial clustering
- Perform whole-tissue spatial variability analysis
- Check if the spatial transcriptomics technology is at single-cell level
- If so (e.g. MERFISH), perform cell type annotation
- If not (e.g. 10x Visium), perform cell type deconvolution
- Perform cell-type-specific spatial variability analyis
- Perform cell-cell interaction analysis
The scripts for performing spatial clustering are stored in data_analysis/spatial_clustering/
. Please refer to the README for the details.
In the data_analysis/cell_typing/
folder, scripts are available for performing cellular deconvolution (spatial transcriptomics technologies with multiple cells per capture location, e.g. 10x Visium) and cell type annotation (single-cell-level spatial transcriptomics technologies, e.g. MERFISH).
To identify scRNA-seq references for cell typing, users may utilize the GEO query helper script. ref_data_processing_example.R
is an example script for processing the downloaded scRNA-seq data. Please note that cell quality control needs to be performed case-by-case, i.e. the thresholds should be chosen manually based on the QC plots.
annotation_example.R
is an example script for performing cell type annotation on single-cell-level spatial transcriptomics datasets (e.g. MERFISH) using scRNA-seq reference datasets and SingleR.
Heuristics-guided cell type annotation for brain datasets (click me)
runBrainCellTypeAnnotation-CluHeu.R
- Usage:
./runBrainCellTypeAnnotation-CluHeu.R > runBrainCellTypeAnnotation-CluHeu.log
- Description
- This script automatically annotates the cell types of brain Visium datasets using a cluster-based approach guided by some heuristics.
- Note that:
- This script requires processed mouse and human scRNA-seq references as the input, and the file paths are currently hard-wired:
/share/fsmresfiles/SpatialT/ref/Brain/Adult/aibs_human_ctx_smart-seq
aibs_human_ctx_smart-seq_neuronal.RDS
aibs_human_ctx_smart-seq_non_neuronal.RDS
supp.RData
/share/fsmresfiles/SpatialT/ref/Brain/Adult/aibs_mouse_ctx-hpf_10x
aibs_mouse_ctx-hpf_10x_neuronal.RDS
aibs_mouse_ctx-hpf_10x_non_neuronal.RDS
supp.RData
- This script also reads a table listing the DSID, species, and technology (
brain_DSID_list.txt
) and loops over its rows. Line 52 uses a hard-wired path to this file. - The annotations follow the Common Cell Type Nomenclature (CCN).
seurat_object[["cell_type_annotation"]]
contains the annotated subclasses, andseurat_object[["cell_type_annotation_class"]]
contains the annotated classes (i.e., glutamatergic, GABAergic, or non_neuronal).
- This script requires processed mouse and human scRNA-seq references as the input, and the file paths are currently hard-wired:
Analysis scripts are available for the cell type deconvolution of spatial transcriptomics datasets using scRNA-seq reference datasets and BayesPrism.
Steps for running deconvolution
- Sample script for processing scRNA-seq reference:
process_reference_example.R
- Deconvolution script:
quest_deconvolution_jobarray.R
- Script for preparing deconvolution results for subsequent analysis (only run this after the deconvolution is complete):
create_input_files.R
The scripts for the spatial variability (SV) analysis of spatial transcriptomics data are stored in data_analysis/spatial_variability/
.
To perform whole-tissue SV analysis, use the script quest_SpatialDE_jobarray.py
. To run the analysis, use the command python quest_SpatialDE_jobarray.py $sample_directory
.
To perform cell-type-specific SV analysis, use the script quest_SpatialDE_ct_specific.py
. To run the analysis, use the command python quest_SpatialDE_ct_specific.py $sample_directory $cell_type
.
To perform neighborhood-based cell-cell interaction analysis, use the script adj-analysis.R
. To run the analysis, use the command ./adj-analysis.R $sample_directory
.
To perform distance-based cell-cell interaction analysis, run the bash script cci-analysis-COMMOT-DGE.sh
to call different analysis scripts in the pipeline.
Scripts for drug discovery analysis are stored under data analysis/drug_discovery
.
The four types of analysis are:
-
Differential gene expression analysis.
Scripts for deconvoluted and annotated samples
. -
Protein-protein interaction (PPI) network for spatially variable, differentially expressed (DE-SV) genes by cell type.
Script for generating PPI network
. -
CMAP L1000 drug enrichment (compounds with top overall positive/negative enrichment score on SV-DE gene sets of a cell type).
Script for CMAP drug enrichment analysis
. -
CMAP L1000 drug perturbation (top gene targets perturbed by the top postiively/negatively enriched compounds).
Script for CMAP drug perturbation analysis
(contained in the same script as above).