This workflow is automated using a Nextflow workflow described in sc_pipeline.nf
.
Running sc_pipeline.nf
requires an environment with Nextflow installed, in addition to container management software such as Singularity or Charliecloud (see Containers). This workflow was developed and tested using Nextflow v24.04.4.
All necessary scripts, third-party software, and dependencies are included in a Docker image hosted at https://hub.docker.com/repository/docker/apwat/10x_sc/. The Dockerfile and Conda environment file used to create the image are archived in this repository. Nextflow will attempt to pull and convert this image to one of the runtimes below, and then run its processes in containers.
The default container software used to run processes in sc_pipeline.nf
.
Setting the environment variable NXF_SINGULARITY_CACHEDIR
will control where images are downloaded.
Tested with SingularityCE 3.9.5 and up -- older versions of Singularity may be unable to properly set up containers.
Use not recommended until nextflow-io/nextflow#5300 is included in a Nextflow release.
To instead use Charliecloud as a container runtime, comment out the singularity
scope in nextflow.config
, and uncomment the charliecloud
scope.
As of Nextflow v24.04.4, workflows with multiple processes running in the same Charliecloud container will attempt multiple pulls of the same image, which fail. To resolve this, download the image ahead of time:
export CH_IMAGE_STORAGE=/path/to/image_storage/
export NXF_CHARLIECLOUD_CACHEDIR=/path/to/image_storage/
ch-image pull apwat/10x_sc:0.6
-and then run the workflow as usual.
Tested with Charliecloud 0.37.
See the Nextflow documentation for containers.
In addition to data, sc_pipeline.nf
expects two configuration files in JSON format:
- A file listing samples and associated metadata. For an example, see Example_Input.json. Locations for the data files must be given as absolute paths.
- A file setting workflow parameters. For an example, see Example_Input_Settings.json. All possible parameters are listed in the
params
scope of nextflow.config. Most parameters are optional except forsampleInfo
, which must be the path to the sample description file described in1
. The use of absolute paths is also recommended here, as relative paths will be interpreted relative to the execution directory.
To run the workflow:
nextflow run /path/to/sc_pipeline.nf [-with-report] -params-file /path/to/Input_Settings.json
Command | Description |
---|---|
sampleInfo | Full path to input file that describes the files and thier conditions (ex: reference_files/Example_Input.json) |
clusterResolution | The leiden clustering resolution parameter (float, default: 0.4) |
minCellsPerGene | Used to filtered on the individual samples, removes genes that do not exist in at least a number of cells (int, default: 3) |
minGenesPerCell | Used to filtered on the individual samples, removes cells that do not have in at least a number of genes (int, default: 200) |
mitochondrialContentMax | The maximum mitochondrial content allowed as an int, 20 means all cells with less than 20% mitochondrial content are kept (int, default: 20) |
removeMitochondrialGenes | Removes mitochondrial genes (boolean, default: False) |
removeRibosomalGenes | Removes ribosomal genes (boolean, default: False) |
hvgNumTopGenes | Scanpy calculated which genes are the most highly variables and selects the top number of specified, used downstream for PCA (int, default: 5000) |
pcaNComps | Number of principal components selected in the principal component analysis (PCA) (int, default: 20) |
neighborsNumber | Number of neighbors used when Scanpy calculated the nearest neighbors distance matrix and a neighborhood of graph observations (int, default: 30) |
neighborsNpcs | Number of princpal components used when Scanpy calculated the nearest neighbors distance matrix and a neighborhood of graph observations (int, default: 20) |
percentile | The gene expression value at this percentile is taken, all cells with an expression level greater than it are counted as having high expression for that marker gene (float, default: 90) |
pAdj | Adjusted p-value used in the exploratory analysis (float, default: 0.05) |
log2FC | Log2 fold change used in the exploratory analysis (float, default: 0.5) |
heatmapNumSigGenes | The maximum number of signifant top genes used for the heatmap in the differential gene expression analysis(int, default: 50) |
minCellsPerGroup | The number of cell required per treatment group inorder to run differential gene expression (int, default: 100) |
nTopTermEnrichPlot | The number of top genes used in the enrichment plot (int, default: 5) |
dotplotCutoff | The theshold for the FDR cutoff used in the enrichment dotplot (float, default: 0.25) |
controlName | The name of the control sample group (string, default: "Control") |
enrichTerms | An array of the datasets you would like to run gene set enrichment on, a full list of the options can be found below(string array, default: ["GO_Biological_Process_2023","GO_Molecular_Function_2023"]) |
outputFolder | Name of output folder (str, default: example_output) |
cellType | A object with key value pairs that contain a cell type and a comma seperated string of marker genes(object key value, default: {"CD4 T cells": "CD4", "CD8 T cells": "CD8A, CD8B", "NK cells": "FCGR3A,TROBP", "B cells": "MS4A1,CD19,CD74,CD79A,IGHM", "Plasma cells": "JCHAIN,MZB1,IGHG1", "Proliferating lymphocytes: "MKI67,CD3G,FCGR3A", "Monocytes": CD14,FCGR3A,LYZ", "cDCs": "HLA-DQA1,SLC38A1", "pDCs": "BST2,MAP3K2,TRADD", "Platelets": "PF4,PPBP,ITGA2B", "Erythrocytes": HBA2,HBA1,HBB"} ) |
- ARCHS4_Cell-lines
- ARCHS4_IDG_Coexp
- ARCHS4_Kinases_Coexp
- ARCHS4_TFs_Coexp
- ARCHS4_Tissues
- Achilles_fitness_decrease, Achilles_fitness_increase
- Aging_Perturbations_from_GEO_down, Aging_Perturbations_from_GEO_up
- Allen_Brain_Atlas_10x_scRNA_2021, Allen_Brain_Atlas_down, Allen_Brain_Atlas_up
- Azimuth_2023
- Azimuth_Cell_Types_2021
- BioCarta_2013, BioCarta_2015, BioCarta_2016
- BioPlanet_2019
- BioPlex_2017
- CCLE_Proteomics_2020
- CORUM
- COVID-19_Related_Gene_Sets, COVID-19_Related_Gene_Sets_2021
- Cancer_Cell_Line_Encyclopedia
- CellMarker_2024
- CellMarker_Augmented_2021
- ChEA_2013, ChEA_2015, ChEA_2016, ChEA_2022
- Chromosome_Location, Chromosome_Location_hg19
- ClinVar_2019
- DGIdb_Drug_Targets_2024
- DSigDB
- Data_Acquisition_Method_Most_Popular_Genes
- DepMap_CRISPR_GeneDependency_CellLines_2023
- DepMap_WG_CRISPR_Screens_Broad_CellLines_2019
- DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019
- Descartes_Cell_Types_and_Tissue_2021
- Diabetes_Perturbations_GEO_2022
- DisGeNET
- Disease_Perturbations_from_GEO_down, Disease_Perturbations_from_GEO_up
- Disease_Signatures_from_GEO_down_2014, Disease_Signatures_from_GEO_up_2014
- DrugMatrix
- Drug_Perturbations_from_GEO_2014, Drug_Perturbations_from_GEO_down, Drug_Perturbations_from_GEO_up
- ENCODE_Histone_Modifications_2013, ENCODE_Histone_Modifications_2015
- ENCODE_TF_ChIP-seq_2014, ENCODE_TF_ChIP-seq_2015
- ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X
- ESCAPE
- Elsevier_Pathway_Collection
- Enrichr_Libraries_Most_Popular_Genes
- Enrichr_Submissions_TF-Gene_Coocurrence
- Enrichr_Users_Contributed_Lists_2020
- Epigenomics_Roadmap_HM_ChIP-seq
- FANTOM6_lncRNA_KD_DEGs
- GO_Biological_Process_2013, GO_Biological_Process_2015, GO_Biological_Process_2017, GO_Biological_Process_2017b, GO_Biological_Process_2018, GO_Biological_Process_2021, GO_Biological_Process_2023
- GO_Cellular_Component_2013, GO_Cellular_Component_2015, GO_Cellular_Component_2017, GO_Cellular_Component_2017b, GO_Cellular_Component_2018, GO_Cellular_Component_2021, GO_Cellular_Component_2023
- GO_Molecular_Function_2013, GO_Molecular_Function_2015, GO_Molecular_Function_2017, GO_Molecular_Function_2017b, GO_Molecular_Function_2018, GO_Molecular_Function_2021, GO_Molecular_Function_2023
- GTEx_Aging_Signatures_2021
- GTEx_Tissue_Expression_Down, GTEx_Tissue_Expression_Up
- GTEx_Tissues_V8_2023
- GWAS_Catalog_2019, GWAS_Catalog_2023
- GeDiPNet_2023
- GeneSigDB
- Gene_Perturbations_from_GEO_down, Gene_Perturbations_from_GEO_up
- Genes_Associated_with_NIH_Grants
- Genome_Browser_PWMs
- GlyGen_Glycosylated_Proteins_2022
- HDSigDB_Human_2021
- HDSigDB_Mouse_2021
- HMDB_Metabolites
- HMS_LINCS_KinomeScan
- HomoloGene
- HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression
- HuBMAP_ASCTplusB_augmented_2022
- HumanCyc_2015, HumanCyc_2016
- Human_Gene_Atlas
- Human_Phenotype_Ontology
- IDG_Drug_Targets_2022
- InterPro_Domains_2019
- Jensen_COMPARTMENTS
- Jensen_DISEASES
- Jensen_TISSUES
- KEA_2013, KEA_2015
- KEGG_2013, KEGG_2015, KEGG_2016'
- KEGG_2019_Human, KEGG_2021_Human
- KEGG_2019_Mouse
- KOMP2_Mouse_Phenotypes_2022
- Kinase_Perturbations_from_GEO_down, Kinase_Perturbations_from_GEO_up
- L1000_Kinase_and_GPCR_Perturbations_down, L1000_Kinase_and_GPCR_Perturbations_up
- LINCS_L1000_CRISPR_KO_Consensus_Sigs
- LINCS_L1000_Chem_Pert_Consensus_Sigs
- LINCS_L1000_Chem_Pert_down, LINCS_L1000_Chem_Pert_up
- LINCS_L1000_Ligand_Perturbations_down, LINCS_L1000_Ligand_Perturbations_up
- Ligand_Perturbations_from_GEO_down, Ligand_Perturbations_from_GEO_up
- MAGMA_Drugs_and_Diseases
- MAGNET_2023
- MCF7_Perturbations_from_GEO_down, MCF7_Perturbations_from_GEO_up
- MGI_Mammalian_Phenotype_2013, MGI_Mammalian_Phenotype_2017, MGI_Mammalian_Phenotype_Level_3, MGI_Mammalian_Phenotype_Level_4, MGI_Mammalian_Phenotype_Level_4_2019, MGI_Mammalian_Phenotype_Level_4_2021, MGI_Mammalian_Phenotype_Level_4_2024
- MSigDB_Computational
- MSigDB_Hallmark_2020
- MSigDB_Oncogenic_Signatures
- Metabolomics_Workbench_Metabolites_2022
- Microbe_Perturbations_from_GEO_down, Microbe_Perturbations_from_GEO_up
- MoTrPAC_2023
- Mouse_Gene_Atlas
- NCI-60_Cancer_Cell_Lines
- NCI-Nature_2016
- NIH_Funded_PIs_2017_AutoRIF_ARCHS4_Predictions, NIH_Funded_PIs_2017_GeneRIF_ARCHS4_Predictions, NIH_Funded_PIs_2017_Human_AutoRIF, NIH_Funded_PIs_2017_Human_GeneRIF
- NURSA_Human_Endogenous_Complexome
- OMIM_Disease
- OMIM_Expanded
- Old_CMAP_down, Old_CMAP_up
- Orphanet_Augmented_2021
- PFOCR_Pathways, PFOCR_Pathways_2023
- PPI_Hub_Proteins
- PanglaoDB_Augmented_2021
- Panther_2015, Panther_2016
- Pfam_Domains_2019
- Pfam_InterPro_Domains
- PheWeb_2019
- PhenGenI_Association_2021
- Phosphatase_Substrates_from_DEPOD
- ProteomicsDB_2020
- Proteomics_Drug_Atlas_2023
- RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO
- RNAseq_Automatic_GEO_Signatures_Human_Down, RNAseq_Automatic_GEO_Signatures_Human_Up
- RNAseq_Automatic_GEO_Signatures_Mouse_Down, RNAseq_Automatic_GEO_Signatures_Mouse_Up
- Rare_Diseases_AutoRIF_ARCHS4_Predictions, Rare_Diseases_AutoRIF_Gene_Lists, Rare_Diseases_GeneRIF_ARCHS4_Prediction, Rare_Diseases_GeneRIF_Gene_Lists
- Reactome_2013, Reactome_2015, Reactome_2016, Reactome_2022
- Rummagene_kinases
- Rummagene_signatures
- Rummagene_transcription_factors
- SILAC_Phosphoproteomics
- SubCell_BarCode
- SynGO_2022, SynGO_2024
- SysMyo_Muscle_Gene_Sets
- TF-LOF_Expression_from_GEO
- TF_Perturbations_Followed_by_Expression
- TG_GATES_2020
- TRANSFAC_and_JASPAR_PWMs
- TRRUST_Transcription_Factors_2019
- Table_Mining_of_CRISPR_Studies
- Tabula_Muris
- Tabula_Sapiens
- TargetScan_microRNA, TargetScan_microRNA_2017
- The_Kinase_Library_2023
- Tissue_Protein_Expression_from_Human_Proteome_Map
- Tissue_Protein_Expression_from_ProteomicsDB
- Transcription_Factor_PPIs
- UK_Biobank_GWAS_v1
- Virus-Host_PPI_P-HIPSTer_2020
- VirusMINT
- Virus_Perturbations_from_GEO_down, Virus_Perturbations_from_GEO_up
- WikiPathway_2021_Human, WikiPathway_2023_Human
- WikiPathways_2013, WikiPathways_2015, WikiPathways_2016
- WikiPathways_2019_Human
- WikiPathways_2019_Mouse
- dbGaP
- huMAP
- lncHUB_lncRNA_Co-Expression
- miRTarBase_2017