hpc_seq_process

Scripts for processing sequencing data on HPC clusters

Author: Brian Ward
Email: [email protected]
Github: etnite
License: GPLv3

Description

This repository is a collection of scripts for processing sequencing data (currently just DNA) on high-performance computing (HPC) clusters. It is currently geared towards clusters using SLURM job schedulers.

Parallel execution

Many scripts in this repository end with "_parallel.sh". These scripts are intended to work on files corresponding to a single sample. They may be used with a dispatcher script, which coordinates parallel or sequential execution for multiple samples. These scripts are located in code/parallel_dispatch. Generally, on clusters arrayer.sh is the preferred script for this purpose.

All of the dispatching scripts simply supply integers within a range starting at one. This range of integers is then used to subset lines out of a text file, listing each sample (the path to this file is defined in whatever script is being controlled).

Execution on local machines

The code in this repository can be run outside of a HPC setting. In this use case, parallelizer.sh or looper.sh would be used for parallel (or sequential) dispatch. Lines within scripts which start with "module" must be commented out. These lines also show the dependencies for each script. Outside of an HPC setting, it is usually easiest to use Bioconda environments (https://bioconda.github.io/user/install.html) to handle dependencies.

Typical Workflow - Short Variant Identification

This repository is primarily set up to use the samtools/bcftools variant calling pipeline. Steps below would typically be run using a dispatcher script for parallel execution. Note that for samtools/bcftools variant calling, read groups (e.g. different flowcell/lane combinations) can be concatenated into single fastq files for each sample. However, for modern GATK workflows, it is more common to work with separate files for individual read groups.

For short variant identification, a common pre-processing workflow would be:

Run code/fastq_filt_trim/bbduk_filt_trim_[PE/SE]_parallel.sh to clean fastq reads for paired-end or single-end reads, respectively
Run code/alignment/bowtie2_[PE/SE]_align_parallel.sh to align reads for paired end or single end reads using bowtie2 and create single sample bam alignment files
(optional) run code/bam_manipulations/merge_bams_parallel.sh to merge and filter BAM files (e.g. to create one filtered BAM file per sample)
(optional) - for some applications, e.g. practical haplotype graph creation, convert BAM files into single-sample gVCFs using code/gvcf_creation/create_single_samp_gvcf_parallel.sh

For variant calling, the following steps may then be performed after bam file generation:

Run code/call_variants/mpileup_call_parallel.sh - this script can run in parallel on multiple regions of the genome, for instance those created with code/make_regions_file.py
Run code/call_variants/concat_sort_vcfs.sh to concatenate individual region VCFs/BCFs into one file
(optional) Run code/call_variants/rename_annotate_split_vcf.sh to rename samples, predict variant transcriptional effects using a genome annotation file, and create subset VCF files consisting of just indels and just SNPs.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
code		code
sample_lists/v1_hapmap_bioproj		sample_lists/v1_hapmap_bioproj
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TruSeq_paired_adapters.fa		TruSeq_paired_adapters.fa
bbduk_38.84_Illumina_adapters.fa		bbduk_38.84_Illumina_adapters.fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hpc_seq_process

Description

Parallel execution

Execution on local machines

Typical Workflow - Short Variant Identification

About

Releases

Packages

Languages

License

etnite/hpc_seq_process

Folders and files

Latest commit

History

Repository files navigation

hpc_seq_process

Description

Parallel execution

Execution on local machines

Typical Workflow - Short Variant Identification

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages