Skip to content

Latest commit

 

History

History
789 lines (548 loc) · 39.9 KB

week3_tutorial.md

File metadata and controls

789 lines (548 loc) · 39.9 KB

Week 3 - Variant calling workflow

Learning objectives

At the end of this workshop, students will know how to:

  • Set up conda environments
  • Assessing read quality using FastQC
  • Trimming and filtering using trimmomatic
  • Aligning reads to reference genome using BWA
  • Variant calling using bcftools
  • Writing a bash script of the workflow
  • Using slurm to run jobs on the server
  • Writing a sbatch file
  • Parallel processing

Use conda environment for projects

It is important we use conda to build the environment for each project because different project requires difference software with different versions. A conda environment can store all the dependent software for a project so it doesn't mixed up with others.

Install miniconda

As we have miniconda installed on Dayhoff, so we don't have to install it. If you would like to install it on your own machine, google how to do it.

Create a conda environment for your project

Use the following code to create a new conda environment for your first genomics project.

conda create --name intro-to-linux

Type yes to proceed when prompted with questions.

After you succesfully created an environment, activate it by:

conda activate intro-to-linux

After that, you will see there is your environment name with a pair of brackets at the beginning of the prompt. It means now you are in the (intro-to-linux) conda environment.

Now, you can start to install software.

Data Wrangling and Processing for Genomics

Required Software

  • FastQC
  • Trimmomatic
  • BWA
  • SAMtools
  • BCFtools
conda install -c bioconda fastqc
conda install -c bioconda trimmomatic
conda install -c bioconda bwa
conda install -c bioconda samtools
conda install -c bioconda bcftools 

To validate the software are installed successfully

We can test if a software was successfully installed by calling the help document of the software or calling out the software:

fastqc -h
trimmomatic -h 
bwa # bwa is not set up for the -h option for printing the help document 
samtools # samtools is not set up for the -h option for printing the help document 
bcftools -h 

Some other installation when error occurs

conda install -c conda-forge openjdk # when using fastqc 
conda install openssl=1.0 # when using bcftools

The Data

The data we are going to use is part of a long-term evolution experiment led by Richard Lenski.

The experiment was designed to assess adaptation in E. coli. A population was propagated for more than 40,000 generations in a glucose-limited minimal medium (in most conditions glucose is the best carbon source for E. coli, providing faster growth than other sugars). This medium was supplemented with citrate, which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points revealed that spontaneous citrate-using variant (Cit+) appeared between 31,000 and 31,500 generations, causing an increase in population size and diversity. In addition, this experiment showed hypermutability in certain regions. Hypermutability is important and can help accelerate adaptation to novel environments, but also can be selected against in well-adapted populations.

Assessing Read Quality

Bioinformatic Workflows

When working with high-throughput sequencing data, the raw reads you get off of the sequencer will need to pass through a number of different tools in order to generate your final desired output.

An example of the workflow we will be using for our variant calling analysis is provided below with a brief description of each step.

example

  1. Quality control - Assessing quality using FastQC
  2. Quality control - Trimming and/or filtering reads (if necessary)
  3. Align reads to reference genome
  4. Perform post-alignment clean-up
  5. Variant calling

These workflows in bioinformatics adopt a plug-and-play approach in that the output of one tool can be easily used as input to another tool without any extensive configuration. Having standards for data formats is what makes this feasible. Standards ensure that data is stored in a way that is generally accepted and agreed upon within the community. The tools that are used to analyze data at different stages of the workflow are therefore built under the assumption that the data will be provided in a specific format.

Downloading the Data

We are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 50,000 generations in a glucose-limited minimal medium. We will be working with three samples from this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how with our variant calling workflow.

The -p option for mkdir allows it to create the new directory, even if one of the parent directories does not already exist. It also supresses errors if the directory already exists, without overwriting that directory.

The data are paired-end, so we have two files for each sample. The files are big and it takes some time to download, I have it downloaded already so you can copy it to your folder using commands below.

mkdir -p ~/intro_to_linux/data/untrimmed_fastq/
cd ~/intro_to_linux/data/untrimmed_fastq/ 
cp /home/u1133824/intro_to_linux/data/untrimmed_fastq/*.gz .

The data comes in a compressed format, which is why there is a .gz at the end of the file names. This makes it faster to transfer, and allows it to take up less space on our computer. Let’s unzip one of the files so that we can look at the fastq format.

gunzip SRR2584863_1.fastq.gz

Quality Control

To assess the quality of our sequence reads.

How to Read the FASTQ Format File

  • Line 1 - Always begins with an '@' and then information about the read.
  • Line 2 - The actual DNA sequence.
  • Line 3 - Always begins with a '+' and sometimes the same information as Line 1.
  • Line 4 - Has a string of characters which represent the quality scores, must have same number of characters as line 2.

Here are the quality value characters in left-to-right increasing order of quality. ! is the lowest quality and ~ is the highest.

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ 

Note: Different sequencing machines use different encoding system, sometimes a # may not means a poor quality base call.

We can view the first complete read in one of the files our dataset by using head to look at the first four lines.

head -n 4 SRR2584863_1.fastq

The output will look like this:

@SRR2584863.1 HWI-ST957:244:H73TDADXX:1:1101:4712:2181/1
TTCACATCCTGACCATTCAGTTGAGCAAAATAGTTCTTCAGTGCCTGTTTAACCGAGTCACGCAGGGGTTTTTGGGTTACCTGATCCTGAGAGTTAACGGTAGAAACGGTCAGTACGTCAGAATTTACGCGTTGTTCGAACATAGTTCTG
+
CCCFFFFFGHHHHJIJJJJIJJJIIJJJJIIIJJGFIIIJEDDFEGGJIFHHJIJJDECCGGEGIIJFHFFFACD:BBBDDACCCCAA@@CA@C>C3>@5(8&>C:9?8+89<4(:83825C(:A#########################

Exercise: What is the last read in the SRR2584863_1.fastq file? How confident are you in this read?

Assessing Quality using FastQC

In real life, you will not be assessing the quality of your reads by visually inspecting your FASTQ files. Rather, you will be using a software program to assess read quality and filter out poor quality reads. We will first use a program called FastQC to visualize the quality of our reads. Later in our workflow, we will use another program to filter out poor quality reads.

FastQC has a number of features which can give you a quick impression of any problems your data may have, so you can take these issues into consideration before moving forward with your analyses. Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates a very high quality sample:

fastqc example

The x-axis displays the base position in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long. This is much shorter than the reads we are working with in our workflow. For each position, there is a box-and-whisker plot showing the distribution of quality scores for all reads at that position. The horizontal red line indicates the median quality score and the yellow box shows the 1st to 3rd quartile range. This means that 50% of reads have a quality score that falls within the range of the yellow box at that position. The whiskers show the absolute range, which covers the lowest (0th quartile) to highest (4th quartile) values.

For each position in this sample, the quality values do not drop much lower than 32. This is a high quality score. The plot background is also color-coded to identify good (green), acceptable (yellow), and bad (red) quality scores.

Now let’s take a look at a quality plot on the other end of the spectrum.

fastqc example 2

Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the “bad” range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.

Running FastQC

We will now assess the quality of the reads that we downloaded. First, make sure you are still in the untrimmed_fastq directory.

cd ~/intro_to_linux/data/untrimmed_fastq/

Let's try this:

fastqc SRR2584863_1.fastq.gz

For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. We will be working with these output files soon. The .html file is a stable webpage displaying the summary report for each of our samples.

We want to keep our data files and our results files separate, so we will move these output files into a new directory within our results/ directory.

mkdir -p ~/intro_to_linux/results/fastqc_untrimmed_reads
mv *.zip ~/intro_to_linux/results/fastqc_untrimmed_reads/
mv *.html ~/intro_to_linux/results/fastqc_untrimmed_reads/

Now we can navigate into this results directory and do some closer inspection of our output files.

cd ~/intro_to_linux/results/fastqc_untrimmed_reads/ 

Viewing the FastQC Results

To view the HTML files, first we need to download it to our local computer because we can't view the HTML files in out Linux system.

To transfer files from remote computers to local computers, we can use scp command.

First let's create a new directory on our local computer to store the HTML files. Open a new terminal on your local computer, and then run the code below:

mkdir -p /mnt/c/Users/u_id/Desktop/fastqc_html

Now we can download the HTML files from the remote computer using scp.

scp [email protected]:~/intro_to_linux/results/fastqc_untrimmed_reads/*.html /mnt/c/Users/u_id/Desktop/fastqc_html 

Note: if you're using a MacOS machine, it is likely that a no matches found will be displayed. The reason for this is that the wildcard * is not correctly interpreted. To fix this problem the wildcard needs to be escaped with a \:

scp [email protected]:~/intro_to_linux/results/fastqc_untrimmed_reads/\*.html /mnt/c/Users/u_id/Desktop/fastqc_html 

Alternatively, you can also change the shell to bash by running chsh -s /bin/bash.

You should see a status output like this after running the scp command:

SRR2584863_1_fastqc.html                      100%  626KB   4.0MB/s   00:00    
SRR2584863_2_fastqc.html                      100%  632KB   4.8MB/s   00:00     
SRR2584866_1_fastqc.html                      100%  631KB   5.2MB/s   00:00     
SRR2584866_2_fastqc.html                      100%  626KB   4.5MB/s   00:00     
SRR2589044_1_fastqc.html                      100%  626KB   5.1MB/s   00:00     
SRR2589044_2_fastqc.html                      100%  627KB   5.5MB/s   00:00

Now we can go to our new directory and open the 6 HTML files.

Decoding Other FastQC Outputs

We have now looked at quite a few "Per base sequence quality" FastQC graphs, but there are nine other graphs that we have not talked about yet. Please see the FastQC documentation for detailed explanation of each graph.

Working with the FastQC Text Output

Now let's look at the other output files. Go back to the terminal that connected to Dayhoff and make sure you are in the results directory.

cd ~/intro_to_linux/results/fastqc_untrimmed_reads/
ls 

Output:

SRR2584863_1_fastqc.html  SRR2584866_1_fastqc.html  SRR2589044_1_fastqc.html
SRR2584863_1_fastqc.zip   SRR2584866_1_fastqc.zip   SRR2589044_1_fastqc.zip
SRR2584863_2_fastqc.html  SRR2584866_2_fastqc.html  SRR2589044_2_fastqc.html
SRR2584863_2_fastqc.zip   SRR2584866_2_fastqc.zip   SRR2589044_2_fastqc.zip 

The .zip files contain multiple different types of output files for a single input FASTQ file. Let extract them first to see what are these files. We will use the command unzip to do it, unzip cannot take multiple input files at once so we'll use a for loop to iterate through the list.

for filename in *.zip
do 
    unzip $filename 
done 

The unzip program is decompressing the .zip files and creating a new directory (with subdirectories) for each of our samples, to store all of the different output that is produced by FastQC. There are a lot of files here. The one we are going to focus on is the summary.txt file.

Let's see what files are present within one of these output directories.

ls SRR2584863_1_fastqc/

Use less to preview the summary.txt file for this sample:

less SRR2584863_1_fastqc/summary.txt 

You will get something look like this:

PASS    Basic Statistics        SRR2584863_1.fastq
PASS    Per base sequence quality       SRR2584863_1.fastq
PASS    Per tile sequence quality       SRR2584863_1.fastq
PASS    Per sequence quality scores     SRR2584863_1.fastq
WARN    Per base sequence content       SRR2584863_1.fastq
WARN    Per sequence GC content SRR2584863_1.fastq
PASS    Per base N content      SRR2584863_1.fastq
PASS    Sequence Length Distribution    SRR2584863_1.fastq
PASS    Sequence Duplication Levels     SRR2584863_1.fastq
PASS    Overrepresented sequences       SRR2584863_1.fastq
WARN    Adapter Content SRR2584863_1.fastq

The summary gives us a list of tests that FastQC ran, and tells us wether this sample passes, failed, or is borderline (WARN).

Documenting the work

We can make a record of the results we obtained for all of the samples by concatenating all of the summary.txt files into a single file using the cat command, and name it as fastqc_summaries.txt and move it to the ~/intro_to_linux/docs directory.

mkdir ~/intro_to_linux/docs
cat */summary.txt > ~/intro_to_linux/docs/fastqc_summaries.txt 

Exercise: Which sample failed at least one of FastQC's quality test? What test did those samples fail?

We can get the list of all failed tests using grep:

cd ~/intro_to_linux/docs
grep FAIL fastqc_summaries.txt 

Trimming and Filtering

Cleaning Reads

Previously, we checked the quality of our samples using a tool called FastQC. We looked at graphs that showed the quality of each base in the samples, and found out which samples failed certain quality checks. Just because some samples failed some checks, it doesn't mean we should get rid of them. It's normal for some checks to fail and it might not be a problem for what we want to do with the samples. For our next step, we will remove some of the low quality sequences to lower the chance of getting false results due to errors in the sequencing.

We will use a program called Trimmomatic to filter poor quality reads and trim poor quality bases from our samples.

Trimmomatic Options

Trimmomatic has a variety of options to trim your reads. If we run the following command, we can see some of our options.

trimmomatic

Which will give you the following output:

Usage:
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or:
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
   or:
       -version

To use it, you first need to specify whether you're using paired end (PE) or single end (SE) reads. Then, you can specify optional settings called flags, such as the number of processors to use. These flags can give you more control over the command, but they're not required. After the flags, you need to specify the input and output files. For paired end mode, you need to provide two input files and two output files. For single end mode, you need to provide one input file, optional settings, and one output file. The order in which you specify these arguments is important.

Option Meaning
<inputFile1> Input reads to be trimmed. Typically the file name will contain an _1 or _R1 in the name.
<inputFile2> Input reads to be trimmed. Typically the file name will contain an _2 or _R2 in the name.
<outputFile1P> Output file that contains surviving pairs from the _1 file.
<outputFile1U> Output file that contains orphaned reads from the _1 file.
<outputFile2P> Output file that contains surviving pairs from the _2 file.
<outputFile2U> Output file that contains orphaned reads from the _2 file.

The last thing trimmomatic expects to see is the trimming parameters:

Step Meaning
ILLUMINACLIP Perform adapter removal.
SLIDINGWINDOW Perform sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING Cut bases off the start of a read, if below a threshold quality.
TRAILING Cut bases off the end of a read, if below a threshold quality.
CROP Cut the read to a specified length
HEADCROP Cut the specified number of bases from the start of the read.
MINLEN Drop an entire read if it is below a specified length.
TOPHRED33 Convert quality scores to Phred-33.
TOPHRED64 Convert quality scores to Phred-64.

We will only use a few options and trimming steps in our analysis, it is important to understand the steps you use to clean your data. For more information, see the Trimmomatic manual.

A complete Trimmomatic command example will look like the following code:

trimmomatic PE -threads 4 SRR_1056_1.fastq SRR_1056_2.fastq \
            SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
            SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
            ILLUMINACLIP:SRR_adapters.fa SLIDINGWINDOW:4:20 

Note: When writing a very long code, we can use \ to separate long codes to different lines so it is easier to read.

Running Trimmomatic

Let's start running Trimmomatic on our data. First, let's navigate to the untrimmed_fastq folder.

cd ~/intro_to_linux/data/untrimmed_fastq

Our data are pair-ended data, we can see it from the name of our data. In the reports of FastQC, we can see that the Nextera adapters are present in our samples. The adpater sequences came in with the installation of the software so we can copy them from the installation path to our folder or we can use it directly from the installation folder by using the full path.

For now, we will copy the adapter sequence to our untrimmed_fastq folder to use it.

cp ~/.conda/pkgs/trimmomatic-0.39-hdfd78af_2/share/trimmomatic-0.39-2/adapters/NexteraPE-PE.fa .

Please Note: the file path might be different depend on the Trimmomatic version you've downloaded. You can go to the ~/.conda/pkgs/ directory to check out.

The code we are going to run for Trimmomatic as follows, it may take a few minutes to run:

trimmomatic PE -threads 2 SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz \
                SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz \
                SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz \
                SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15

We are using the option SLIDINGWINDOW:4:20 for our data, it means the size of the sliding window is 4 bases long and any windows that has a score lower than 20 will be removes from the read. And the option MINLEN:25, it means after the sliding window step, if the total length of the read are shorter than 25 bases will be removed. Additionally, we use the option ILLUMINACLIP:NexteraPE-PE.fa:2:40:15, it tells the software how to handle sequences that match with adapter sequence but we won't cover how it works in this course, we just use it as a default setting. For more information, you can see the Trimmomatic manual.

The correct output should look like this:

TrimmomaticPE: Started with arguments:
 SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 1107090 Both Surviving: 885220 (79.96%) Forward Only Surviving: 216472 (19.55%) Reverse Only Surviving: 2850 (0.26%) Dropped: 2548 (0.23%)
TrimmomaticPE: Completed successfully 

Exercise: What percent of reads did we remove from out sample? and what percent did we keep for both pairs?

From the results we got, you may have noticed that Trimmomatic has automatically detected the encoding method of our data, but it is always good to double check if it is correct.

We can check our output files by using ls SRR2589044*, the correct output should look like this:

SRR2589044_1.fastq.gz       SRR2589044_1un.trim.fastq.gz  SRR2589044_2.trim.fastq.gz
SRR2589044_1.trim.fastq.gz  SRR2589044_2.fastq.gz         SRR2589044_2un.trim.fastq.gz

The output files are also FASTQ files but it should be smaller than the original file since we have removed some reads. We can confirm this by running ls -lh SRR2589044*. The correct output should look like this:

-rw-rw-r-- 1 u1122333 domain users 124M Jan 15  2016 SRR2589044_1.fastq.gz
-rw-rw-r-- 1 u1122333 domain users  94M Jan 24 15:22 SRR2589044_1.trim.fastq.gz
-rw-rw-r-- 1 u1122333 domain users  18M Jan 24 15:22 SRR2589044_1un.trim.fastq.gz
-rw-rw-r-- 1 u1122333 domain users 128M Jan 15  2016 SRR2589044_2.fastq.gz
-rw-rw-r-- 1 u1122333 domain users  91M Jan 24 15:22 SRR2589044_2.trim.fastq.gz
-rw-rw-r-- 1 u1122333 domain users 271K Jan 24 15:22 SRR2589044_2un.trim.fastq.gz

Now, we have successfully run Trimmomatic on one of our files. However, Trimmomatic can only work on one sample at a time so we have to use for loop to iterate the command over our files.

We have unzipped one of the files before, let's zip it again so all of our files have universal names.

gzip SRR2584863_1.fastq 

The for loop to run Trimmomatic on all samples:

for infile in *_1.fastq.gz 
do 
    base=$(basename ${infile} _1.fastq.gz)
    trimmomatic PE ${infile} ${base}_2.fastq.gz \
                    ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
                    ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
                    SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15
done 

Exercise: What does the command basename do?

It may take more than a few minutes to run, once it finishes running, take a look at your folder. You should get output files for each of our samples.

Exercise: We trimmed our fastq files with Nextera adapters, but there are other adapters that are commonly used. What other adapter files came with Trimmomatic?

Now, let's move all the trimmed files to a new folder trimmed_fastq.

mkdir ~/intro_to_linux/data/trimmed_fastq
cd ~/intro_to_linux/data/untrimmed_fastq
mv *.trim.* ../trimmed_fastq
cd ../trimmed_fastq
ls 

Variant Calling Workflow

The data we are working with is come from a long-term evolution study of an E. coli population. Now that we have cleaned the data, we can perform variant calling to see how the population changed over time. We want to know how the population changed compare to the original population E. coli strain REL606, therefore we will align each of our samples to the E. coli REL606 genome to see what's the difference between them.

Alignment to a reference genome

workflow

We perform read alignment or mapping to determine where in the genome our reads originated from.

There are a number of tools to choose from, some tools are better suited for particular NGS analyses. In this experiment, we will use the Burrows Wheeler Aligner (BWA) which is a software for mapping low-divergent sequences against a large reference genome.

The alignment process consists of two steps:

  1. Indexing the reference genome
  2. Aligning the reads to the reference genome.

Setting up

Download the reference genome of E. coli REL606:

mkdir -p ~/intro_to_linux/data/ref_genome
cd ~/intro_to_linux/data/ref_genome
curl -L -o ecoli_rel606.fasta.gz http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/985/GCA_000017985.1_ASM1798v1/GCA_000017985.1_ASM1798v1_genomic.fna.gz 
gunzip ecoli_rel606.fasta.gz

We will also need to download a set of trimmed fastq file, it is a small subset of the original trimmed fastq file to let us run the variant calling workflow quickly.

mkdir ~/intro_to_linux/data/trimmed_fastq_small
cd ~/intro_to_linux/data/trimmed_fastq_small
curl -L -o sub.tar.gz https://ndownloader.figshare.com/files/14418248
tar xvf sub.tar.gz
ls 

There is a new directory called sub that includes all the files in. To reduce the unnecessary layers of path to our files, we can move all the files in the sub directory to the trimmed_fastq_small directory.

ls 
mv sub/* .
ls

The output should look like this:

SRR2584863_1.trim.sub.fastq  SRR2584866_2.trim.sub.fastq  sub.tar.gz
SRR2584863_2.trim.sub.fastq  SRR2589044_1.trim.sub.fastq
SRR2584866_1.trim.sub.fastq  SRR2589044_2.trim.sub.fastq 

We will also need to create some new directories for our output files from the workflow. mkdir can take multiple arguments as input to create many new directories at once.

cd ~/intro_to_linux/results
mkdir -p sam bam bcf vcf

Index the reference genome

The first step of alignment is to index the reference genome, it allows the aligner to find potential alignment sites for query sequences quickly. Indexing the genome only has to be run once, but if you work on a different genome or using a different alignment tool, you have to do the index again.

cd ~/intro_to_linux/data/ref_genome
bwa index ecoli_rel606.fasta 

The output on screen should look like this:

[bwa_index] Pack FASTA... 0.04 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.05 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.57 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index data/ref_genome/ecoli_rel606.fasta
[main] Real time: 1.765 sec; CPU: 1.715 sec 

Align reads to reference genome

After finishing index out reference genome, the next step is to map our reads against the reference genome. There are a few algorithms to choose from in the BWA, for our sequence data we will use BWA-MEM algorithm. BWA-MEM is good for 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences.

An example of how bwa command works:

bwa mem ref_genome.fasta input_file_R1.fastq input_file_R2.fastq > output.sam 
  • mem means the algorithm we are using
  • ref_genome.fasta is our reference genome
  • input_file_R1.fastq and input_file_R2.fastq are our sequencing files
  • output.sam is the output file from the alignment

There are many more options you can use for the alignment, please see the BWA manual page.

We are going to use only one sample SRR2584866 in our dataset for the alignment. Later, we will iterate the process on all of our samples.

cd ~/intro_to_linux
bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq > results/sam/SRR2584866.aligned.sam 

The correct output should look like this:

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem data/ref_genome/ecoli_rel606.fasta data/trimmed_fastq_small/SRR2584866_1.trim.sub.fastq data/trimmed_fastq_small/SRR2584866_2.trim.sub.fastq
[main] Real time: 10.141 sec; CPU: 10.094 sec

You can inspect the SAM file we got by using the head command.

SAM/BAM format

SAM stands for Sequence Alignment Map. It is a tab-delimited text file that stores information for each individual read and its alignment to the genome. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. BAM enables efficient random access of the data contained within the file.

SAM file begins with a head, although it is optional. The header is used to describe the source of data, reference genome, method of alignment etc. Following the header is the alignment section, each line corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information, and a variable number of other fields for aligner specific informaiton.

An example line of the SAM format is displayed below:

sam-photo1 sam-photo2

Now, let's convert our SAM file to BAM format. We will use samtools with view command and option -S and -b to do that. -S means the input file is in SAM format, -b means we want the output file to be in BAM format.

cd ~/intro_to_linux
samtools view -S -b results/sam/SRR2584866.aligned.sam > results/bam/SRR2584866.aligned.bam 

We can also inspect the BAM file we got using command head.

Sort BAM file by coordinates

Next, we sort the BAM file using the sort command from samtools. -o option tells the command where to write the output.

samtools sort -o results/bam/SRR2584866.aligned.sorted.bam results/bam/SRR2584866.aligned.bam 

SAM/BAM files can be sorted in multiple ways, such as by location of alignement, by read name etc. It is important to know that different alignment tools will output differently sorted SAM/BAM files, and different downstream tools require differently sorted SAM/BAM files as input.

We can use samtools to learn more about this sorted BAM file as well.

samtools flagstat results/bam/SRR2584866.aligned.sorted.bam

The output should look like this:

351169 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1169 + 0 supplementary
0 + 0 duplicates
351103 + 0 mapped (99.98% : N/A)
350000 + 0 paired in sequencing
175000 + 0 read1
175000 + 0 read2
346688 + 0 properly paired (99.05% : N/A)
349876 + 0 with itself and mate mapped
58 + 0 singletons (0.02% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Variant Calling

A variant call is a finding that there is a difference in a single letter of DNA in a sequence compared to the reference genome. The difference is called a Single Nucleotide Variant (SNV). When making a variant call, a tool is used to compare the sample's sequence to a reference genome and determine the frequency of the variant and the confidence in the call. There are many tools available for making variant calls.

Here, we will use bcftools. There are a few things need to do before the actual calling step.

variant-calling-workflow

Step 1: Calculate the read coverage of positions in the genome

We will use bcftools with command mpileup to generate a file with coverage information for every base.

bcftools mpileup -O b \
                -o results/bcf/SRR2584866_raw.bcf \
                -f data/ref_genome/ecoli_rel606.fasta \
                results/bam/SRR2584866.aligned.sorted.bam 
  • -O tells bcftools what type of output file you want
  • -o tells bcftools where to write the ouput file
  • -f tells bcftools the path to the reference genome
  • the last file is the input file

Note:: there might be an error saying:

bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

Try conda install openssl=1.0 to solve the problem.

Step 2: Detect the single nucleotide variants (SNVs)

Using bcftools with call.

bcftools call --ploidy 1 \
                -m -v \
                -o results/vcf/SRR2584866_variants.vcf \
                results/bcf/SRR2584866_raw.bcf
  • --ploidy to specify how many ploidy does this genome has
  • -m allows for multiallelic and rare-variant calling
  • -v tells the program to output variant site only (not every site in the genome)
  • -o specifies where to write the output file
  • the last one is the path to the input file

Step 3: Filter and report the SNVs in variant calling format (VCF)

Filtering short/false positive SNVs for the final output in VCF format using vcfutils.pl:

vcfutils.pl varFilter results/vcf/SRR2584866_variants.vcf > results/vcf/SRR2584866_final_variants.vcf 

Explore the VCF format

less -S results/vcf/SRR2584866_final_variants.vcf 

You will see information such as the file format, the version of bcftools used, the command line that has been used, and some other information in the beginning of the vcf file.

Then, following that there are variants information:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  results/bam/SRR2584866.aligned.sorted.bam
CP000819.1      1521    .       C       T       207     .       DP=9;VDB=0.993024;SGB=-0.662043;MQSB=0.974597;MQ0F=0;AC=1;AN=1;DP4=0,0,4,5;MQ=60        GT:PL   1:237,0
CP000819.1      1612    .       A       G       225     .       DP=13;VDB=0.52194;SGB=-0.676189;MQSB=0.950952;MQ0F=0;AC=1;AN=1;DP4=0,0,6,5;MQ=60        GT:PL   1:255,0
CP000819.1      9092    .       A       G       225     .       DP=14;VDB=0.717543;SGB=-0.670168;MQSB=0.916482;MQ0F=0;AC=1;AN=1;DP4=0,0,7,3;MQ=60       GT:PL   1:255,0
CP000819.1      9972    .       T       G       214     .       DP=10;VDB=0.022095;SGB=-0.670168;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,2,8;MQ=60      GT:PL   1:244,0
CP000819.1      10563   .       G       A       225     .       DP=11;VDB=0.958658;SGB=-0.670168;MQSB=0.952347;MQ0F=0;AC=1;AN=1;DP4=0,0,5,5;MQ=60       GT:PL   1:255,0
CP000819.1      22257   .       C       T       127     .       DP=5;VDB=0.0765947;SGB=-0.590765;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,2,3;MQ=60      GT:PL   1:157,0
CP000819.1      38971   .       A       G       225     .       DP=14;VDB=0.872139;SGB=-0.680642;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,4,8;MQ=60      GT:PL   1:255,0
CP000819.1      42306   .       A       G       225     .       DP=15;VDB=0.969686;SGB=-0.686358;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,5,9;MQ=60      GT:PL   1:255,0
CP000819.1      45277   .       A       G       225     .       DP=15;VDB=0.470998;SGB=-0.680642;MQSB=0.95494;MQ0F=0;AC=1;AN=1;DP4=0,0,7,5;MQ=60        GT:PL   1:255,0
CP000819.1      56613   .       C       G       183     .       DP=12;VDB=0.879703;SGB=-0.676189;MQSB=1;MQ0F=0;AC=1;AN=1;DP4=0,0,8,3;MQ=60      GT:PL   1:213,0
  • CHROM: contig location where the variation occurs
  • POS: position within the contig where the variation occurs
  • ID: a . until we add annotation information
  • REF: reference genotype (forward strand)
  • ALT: sample genotype (forward strand)
  • QUAL: Phred-scaled probability that the observed variant exists at this site (higher is better)
  • FILTER: a . if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed

Ideally, the information in the QUAL column would be enough for us to filter out bad variant calls. But in reality, there are other metrics we need to considerate.

The last two columns contain the genotypes and can be tricky to decode:

  • FORMAT: lists in order the metrics presented in the final column
  • results: lists the values associated with those metrics in order

For our file, the metrics presented are GT:PL.

  • GT: the genotype for the sample at this location. For a diploid, the GT field indicates the two alleles carried by the sample, 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele etc. 0/0 means homozygous reference, 0/1 means heterozygous, and 1/1 means homozygous for the alternate allele.
  • PL: the likelihood for the given genotypes

The Broad Institute's VCF Guide is an excellent place to learn more about the VCF file format.

Exercise: Use the grep and wc commands to assess how many variants are in the vcf file?

grep -v "#" results/vcf/SRR2584866_final_variants.vcf | wc -l 

The correct output should be 766, there are 766 variants in this file.

References