GitHub - LANL-Bioinformatics/scRNA-seq

scRNA-seq

Introduction

Workflow

This workflow is automated using a Nextflow workflow described in sc_pipeline.nf.

Requirements

Running sc_pipeline.nf requires an environment with Nextflow installed, in addition to container management software such as Singularity or Charliecloud (see Containers). This workflow was developed and tested using Nextflow v24.04.4.

Containers

All necessary scripts, third-party software, and dependencies are included in a Docker image hosted at https://hub.docker.com/repository/docker/apwat/10x_sc/. The Dockerfile and Conda environment file used to create the image are archived in this repository. Nextflow will attempt to pull and convert this image to one of the runtimes below, and then run its processes in containers.

Singularity

The default container software used to run processes in sc_pipeline.nf. Setting the environment variable NXF_SINGULARITY_CACHEDIR will control where images are downloaded.

Tested with SingularityCE 3.9.5 and up -- older versions of Singularity may be unable to properly set up containers.

Charliecloud

Use not recommended until nextflow-io/nextflow#5300 is included in a Nextflow release.

To instead use Charliecloud as a container runtime, comment out the singularity scope in nextflow.config, and uncomment the charliecloud scope.

As of Nextflow v24.04.4, workflows with multiple processes running in the same Charliecloud container will attempt multiple pulls of the same image, which fail. To resolve this, download the image ahead of time:

export CH_IMAGE_STORAGE=/path/to/image_storage/
export NXF_CHARLIECLOUD_CACHEDIR=/path/to/image_storage/
ch-image pull apwat/10x_sc:0.6

-and then run the workflow as usual.

Tested with Charliecloud 0.37.

Other Container Runtimes

See the Nextflow documentation for containers.

Configuration Files

In addition to data, sc_pipeline.nf expects two configuration files in JSON format:

A file listing samples and associated metadata. For an example, see Example_Input.json. Locations for the data files must be given as absolute paths.
A file setting workflow parameters. For an example, see Example_Input_Settings.json. All possible parameters are listed in the params scope of nextflow.config. Most parameters are optional except for sampleInfo, which must be the path to the sample description file described in 1. The use of absolute paths is also recommended here, as relative paths will be interpreted relative to the execution directory.

Running the Workflow

To run the workflow:

nextflow run /path/to/sc_pipeline.nf [-with-report] -params-file /path/to/Input_Settings.json

Input Descriptions

Command	Description
sampleInfo	Full path to input file that describes the files and thier conditions (ex: reference_files/Example_Input.json)
clusterResolution	The leiden clustering resolution parameter (float, default: 0.4)
minCellsPerGene	Used to filtered on the individual samples, removes genes that do not exist in at least a number of cells (int, default: 3)
minGenesPerCell	Used to filtered on the individual samples, removes cells that do not have in at least a number of genes (int, default: 200)
mitochondrialContentMax	The maximum mitochondrial content allowed as an int, 20 means all cells with less than 20% mitochondrial content are kept (int, default: 20)
removeMitochondrialGenes	Removes mitochondrial genes (boolean, default: False)
removeRibosomalGenes	Removes ribosomal genes (boolean, default: False)
hvgNumTopGenes	Scanpy calculated which genes are the most highly variables and selects the top number of specified, used downstream for PCA (int, default: 5000)
pcaNComps	Number of principal components selected in the principal component analysis (PCA) (int, default: 20)
neighborsNumber	Number of neighbors used when Scanpy calculated the nearest neighbors distance matrix and a neighborhood of graph observations (int, default: 30)
neighborsNpcs	Number of princpal components used when Scanpy calculated the nearest neighbors distance matrix and a neighborhood of graph observations (int, default: 20)
percentile	The gene expression value at this percentile is taken, all cells with an expression level greater than it are counted as having high expression for that marker gene (float, default: 90)
pAdj	Adjusted p-value used in the exploratory analysis (float, default: 0.05)
log2FC	Log2 fold change used in the exploratory analysis (float, default: 0.5)
heatmapNumSigGenes	The maximum number of signifant top genes used for the heatmap in the differential gene expression analysis(int, default: 50)
minCellsPerGroup	The number of cell required per treatment group inorder to run differential gene expression (int, default: 100)
nTopTermEnrichPlot	The number of top genes used in the enrichment plot (int, default: 5)
dotplotCutoff	The theshold for the FDR cutoff used in the enrichment dotplot (float, default: 0.25)
controlName	The name of the control sample group (string, default: "Control")
enrichTerms	An array of the datasets you would like to run gene set enrichment on, a full list of the options can be found below(string array, default: ["GO_Biological_Process_2023","GO_Molecular_Function_2023"])
outputFolder	Name of output folder (str, default: example_output)
cellType	A object with key value pairs that contain a cell type and a comma seperated string of marker genes(object key value, default: {"CD4 T cells": "CD4", "CD8 T cells": "CD8A, CD8B", "NK cells": "FCGR3A,TROBP", "B cells": "MS4A1,CD19,CD74,CD79A,IGHM", "Plasma cells": "JCHAIN,MZB1,IGHG1", "Proliferating lymphocytes: "MKI67,CD3G,FCGR3A", "Monocytes": CD14,FCGR3A,LYZ", "cDCs": "HLA-DQA1,SLC38A1", "pDCs": "BST2,MAP3K2,TRADD", "Platelets": "PF4,PPBP,ITGA2B", "Erythrocytes": HBA2,HBA1,HBB"} )

Full list of enrichment genesets from GSEApy

ARCHS4_Cell-lines
ARCHS4_IDG_Coexp
ARCHS4_Kinases_Coexp
ARCHS4_TFs_Coexp
ARCHS4_Tissues
Achilles_fitness_decrease, Achilles_fitness_increase
Aging_Perturbations_from_GEO_down, Aging_Perturbations_from_GEO_up
Allen_Brain_Atlas_10x_scRNA_2021, Allen_Brain_Atlas_down, Allen_Brain_Atlas_up
Azimuth_2023
Azimuth_Cell_Types_2021
BioCarta_2013, BioCarta_2015, BioCarta_2016
BioPlanet_2019
BioPlex_2017
CCLE_Proteomics_2020
CORUM
COVID-19_Related_Gene_Sets, COVID-19_Related_Gene_Sets_2021
Cancer_Cell_Line_Encyclopedia
CellMarker_2024
CellMarker_Augmented_2021
ChEA_2013, ChEA_2015, ChEA_2016, ChEA_2022
Chromosome_Location, Chromosome_Location_hg19
ClinVar_2019
DGIdb_Drug_Targets_2024
DSigDB
Data_Acquisition_Method_Most_Popular_Genes
DepMap_CRISPR_GeneDependency_CellLines_2023
DepMap_WG_CRISPR_Screens_Broad_CellLines_2019
DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019
Descartes_Cell_Types_and_Tissue_2021
Diabetes_Perturbations_GEO_2022
DisGeNET
Disease_Perturbations_from_GEO_down, Disease_Perturbations_from_GEO_up
Disease_Signatures_from_GEO_down_2014, Disease_Signatures_from_GEO_up_2014
DrugMatrix
Drug_Perturbations_from_GEO_2014, Drug_Perturbations_from_GEO_down, Drug_Perturbations_from_GEO_up
ENCODE_Histone_Modifications_2013, ENCODE_Histone_Modifications_2015
ENCODE_TF_ChIP-seq_2014, ENCODE_TF_ChIP-seq_2015
ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X
ESCAPE
Elsevier_Pathway_Collection
Enrichr_Libraries_Most_Popular_Genes
Enrichr_Submissions_TF-Gene_Coocurrence
Enrichr_Users_Contributed_Lists_2020
Epigenomics_Roadmap_HM_ChIP-seq
FANTOM6_lncRNA_KD_DEGs
GO_Biological_Process_2013, GO_Biological_Process_2015, GO_Biological_Process_2017, GO_Biological_Process_2017b, GO_Biological_Process_2018, GO_Biological_Process_2021, GO_Biological_Process_2023
GO_Cellular_Component_2013, GO_Cellular_Component_2015, GO_Cellular_Component_2017, GO_Cellular_Component_2017b, GO_Cellular_Component_2018, GO_Cellular_Component_2021, GO_Cellular_Component_2023
GO_Molecular_Function_2013, GO_Molecular_Function_2015, GO_Molecular_Function_2017, GO_Molecular_Function_2017b, GO_Molecular_Function_2018, GO_Molecular_Function_2021, GO_Molecular_Function_2023
GTEx_Aging_Signatures_2021
GTEx_Tissue_Expression_Down, GTEx_Tissue_Expression_Up
GTEx_Tissues_V8_2023
GWAS_Catalog_2019, GWAS_Catalog_2023
GeDiPNet_2023
GeneSigDB
Gene_Perturbations_from_GEO_down, Gene_Perturbations_from_GEO_up
Genes_Associated_with_NIH_Grants
Genome_Browser_PWMs
GlyGen_Glycosylated_Proteins_2022
HDSigDB_Human_2021
HDSigDB_Mouse_2021
HMDB_Metabolites
HMS_LINCS_KinomeScan
HomoloGene
HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression
HuBMAP_ASCTplusB_augmented_2022
HumanCyc_2015, HumanCyc_2016
Human_Gene_Atlas
Human_Phenotype_Ontology
IDG_Drug_Targets_2022
InterPro_Domains_2019
Jensen_COMPARTMENTS
Jensen_DISEASES
Jensen_TISSUES
KEA_2013, KEA_2015
KEGG_2013, KEGG_2015, KEGG_2016'
KEGG_2019_Human, KEGG_2021_Human
KEGG_2019_Mouse
KOMP2_Mouse_Phenotypes_2022
Kinase_Perturbations_from_GEO_down, Kinase_Perturbations_from_GEO_up
L1000_Kinase_and_GPCR_Perturbations_down, L1000_Kinase_and_GPCR_Perturbations_up
LINCS_L1000_CRISPR_KO_Consensus_Sigs
LINCS_L1000_Chem_Pert_Consensus_Sigs
LINCS_L1000_Chem_Pert_down, LINCS_L1000_Chem_Pert_up
LINCS_L1000_Ligand_Perturbations_down, LINCS_L1000_Ligand_Perturbations_up
Ligand_Perturbations_from_GEO_down, Ligand_Perturbations_from_GEO_up
MAGMA_Drugs_and_Diseases
MAGNET_2023
MCF7_Perturbations_from_GEO_down, MCF7_Perturbations_from_GEO_up
MGI_Mammalian_Phenotype_2013, MGI_Mammalian_Phenotype_2017, MGI_Mammalian_Phenotype_Level_3, MGI_Mammalian_Phenotype_Level_4, MGI_Mammalian_Phenotype_Level_4_2019, MGI_Mammalian_Phenotype_Level_4_2021, MGI_Mammalian_Phenotype_Level_4_2024
MSigDB_Computational
MSigDB_Hallmark_2020
MSigDB_Oncogenic_Signatures
Metabolomics_Workbench_Metabolites_2022
Microbe_Perturbations_from_GEO_down, Microbe_Perturbations_from_GEO_up
MoTrPAC_2023
Mouse_Gene_Atlas
NCI-60_Cancer_Cell_Lines
NCI-Nature_2016
NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions, NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions, NIH_Funded_PIs_2017_Human_AutoRIF, NIH_Funded_PIs_2017_Human_GeneRIF
NURSA_Human_Endogenous_Complexome
OMIM_Disease
OMIM_Expanded
Old_CMAP_down, Old_CMAP_up
Orphanet_Augmented_2021
PFOCR_Pathways, PFOCR_Pathways_2023
PPI_Hub_Proteins
PanglaoDB_Augmented_2021
Panther_2015, Panther_2016
Pfam_Domains_2019
Pfam_InterPro_Domains
PheWeb_2019
PhenGenI_Association_2021
Phosphatase_Substrates_from_DEPOD
ProteomicsDB_2020
Proteomics_Drug_Atlas_2023
RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO
RNAseq_Automatic_GEO_Signatures_Human_Down, RNAseq_Automatic_GEO_Signatures_Human_Up
RNAseq_Automatic_GEO_Signatures_Mouse_Down, RNAseq_Automatic_GEO_Signatures_Mouse_Up
Rare_Diseases_AutoRIF_ARCHS4_Predictions, Rare_Diseases_AutoRIF_Gene_Lists, Rare_Diseases_GeneRIF_ARCHS4_Prediction, Rare_Diseases_GeneRIF_Gene_Lists
Reactome_2013, Reactome_2015, Reactome_2016, Reactome_2022
Rummagene_kinases
Rummagene_signatures
Rummagene_transcription_factors
SILAC_Phosphoproteomics
SubCell_BarCode
SynGO_2022, SynGO_2024
SysMyo_Muscle_Gene_Sets
TF-LOF_Expression_from_GEO
TF_Perturbations_Followed_by_Expression
TG_GATES_2020
TRANSFAC_and_JASPAR_PWMs
TRRUST_Transcription_Factors_2019
Table_Mining_of_CRISPR_Studies
Tabula_Muris
Tabula_Sapiens
TargetScan_microRNA, TargetScan_microRNA_2017
The_Kinase_Library_2023
Tissue_Protein_Expression_from_Human_Proteome_Map
Tissue_Protein_Expression_from_ProteomicsDB
Transcription_Factor_PPIs
UK_Biobank_GWAS_v1
Virus-Host_PPI_P-HIPSTer_2020
VirusMINT
Virus_Perturbations_from_GEO_down, Virus_Perturbations_from_GEO_up
WikiPathway_2021_Human, WikiPathway_2023_Human
WikiPathways_2013, WikiPathways_2015, WikiPathways_2016
WikiPathways_2019_Human
WikiPathways_2019_Mouse
dbGaP
huMAP
lncHUB_lncRNA_Co-Expression
miRTarBase_2017

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
preprocessing_postprocessing		preprocessing_postprocessing
reference_files		reference_files
scripts		scripts
testing		testing
workflow		workflow
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
environment_v3.yml		environment_v3.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scRNA-seq

Introduction

Workflow

Requirements

Containers

Singularity

Charliecloud

Other Container Runtimes

Configuration Files

Running the Workflow

Input Descriptions

Full list of enrichment genesets from GSEApy

About

Releases

Packages

Contributors 2

Languages

LANL-Bioinformatics/scRNA-seq

Folders and files

Latest commit

History

Repository files navigation

scRNA-seq

Introduction

Workflow

Requirements

Containers

Singularity

Charliecloud

Other Container Runtimes

Configuration Files

Running the Workflow

Input Descriptions

Full list of enrichment genesets from GSEApy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages