forked from sw1/teaching
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07_dada2.Rmd
417 lines (302 loc) · 15.8 KB
/
07_dada2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
# Dada2 {#dada}
```{r,warning=FALSE,message=FALSE}
library(tidyverse)
library(dada2)
library(gridExtra)
library(DECIPHER)
library(ape)
library(phangorn)
library(phyloseq)
```
First we need to use some qiime functions via the command line, so we'll save them as variable names.
```{r}
validate_mapping_file <- '/data/sw1/anaconda3/envs/qiime1/bin/validate_mapping_file.py'
split_libraries_fastq <- '/data/sw1/anaconda3/envs/qiime1/bin/split_libraries_fastq.py'
split_sequence_file_on_sample_ids <- '/data/sw1/anaconda3/envs/qiime1/bin/split_sequence_file_on_sample_ids.py'
```
## FASTQ Prep
We're going to use the same data as we did with QIIME. Dada2 requires individual fastq files for each sample, so we'll split our single fastq file using a few QIIME commands.
```{r}
data_dir <- 'data/data_moving_pictures'
MAP <- read_delim(file.path(data_dir,'map.tsv'),'\t')
```
Note that now we are forcing the split libraries command to also return a demultiplexed fastq file. We going to also add a bunch of arguments to ensure that QIIME does *no* filtering. We want to deal with that using dada2.
```{r}
system2(split_libraries_fastq,args=c('-o',file.path(data_dir,'fastq_out_3'),
'-i',file.path(data_dir,'forward_reads.fastq.gz'),
'-b',file.path(data_dir,'barcodes.fastq.gz'),
'-m',file.path(data_dir,'map.tsv'),
'-r','999',
'-n','999',
'-q','0',
'-p','0.0001',
'--store_demultiplexed_fastq'))
```
Next, we'll split these fastq file into separate files for each sample:
```{r}
system2(split_sequence_file_on_sample_ids,args=c('-i',file.path(data_dir,'fastq_out_3','seqs.fastq'),
'-o',file.path(data_dir,'fastq_out_3','sequences'),
'--file_type','fastq'))
```
## OTU Picking
We're now going to run through the dada2 workflow. Dada2 is a *reference free* method, so this is analogous to de novo OTU picking had we used QIIME. Still, we can cluster our resulting count table into OTUs using a reference database; hence, we can compare our results.
Dada2 captures metagenomic variation by exploiting illumina sequencing errors. Briefly, everything is based on an error model (a Poisson distribution). A given read is defined as a *sample* sequence and is compared to all other reads. The model calculates the probability these reads were generated from the sample sequence given the error model -- that is, the probability that these reads resulted from independent sequencing errors that were generated based on given transition probabilities and quality scores. If a set of reads are too abundant to be explained by this error model, then they are separated into their own partition. The game is to continuously partition the reads until each partition is consistent with the error model, allowing us to separate true biological variation from the variation solely due to sequencing error. At this point, the abundance of reads within a partition can be calculated, giving us an abundance table. We can then compare the sequences associated with a given partition to a reference database to assign taxonomy.
```{r}
fqs <- list.files(file.path(data_dir,'fastq_out_3','sequences'),full.names=TRUE)
```
The first thing we'll do is plot the quality scores as a function of base position. If you recall the definition of Q from the QIIME tutorial, this should make sense to you, and it should also help you appriciate setting those parameters for quality filtering before.
```{r}
sample_idx <- sample(length(fqs),5)
for (i in sample_idx) print(plotQualityProfile(fqs[i]))
```
Per the Holmes group, illumina datasets often have errors in the first 10 base positions. Also, given the plots above, there seems to be a drop in quality towards the end of each read. Hence, we'll trim our reads such that we keep bases 10 through 130.
```{r}
fqs_filt <- gsub('sequences','filtered',fqs)
dir.create(gsub('(filtered).*','\\1',fqs_filt[1]),showWarnings=FALSE)
for (i in seq_along(fqs)){
fastqFilter(fqs[i],fqs_filt[i],
trimLeft=10, truncLen=130,
maxN=0, maxEE=2, truncQ=2,
compress=TRUE)
}
```
The following command performs dereplication, returning a set of unique sequences and abundances from our set of fastq files.
```{r}
derep <- derepFastq(fqs_filt)
names(derep) <- sapply(strsplit(basename(fqs_filt), "_"), `[`, 1)
```
Dada2's error model depends on the fact that there are 16x41 transition probabilities, but if these values are unknown, we can simply estimate them from the data. However, estimating the error rates to parameterize the model is costly, so it's recommended to do this on a subset of the data, and then use these parameter estimates for the complete dataset:
```{r}
dd_err <- dada(derep[1:10], err=NULL, selfConsist=TRUE)
```
We can visualize the error estimates. This shows the frequency of each base transition as a function of quality score.
```{r}
plotErrors(dd_err,err_in=TRUE,nominalQ=TRUE)
```
We'll now fit the full error model, using the estimated error rates from our subset of data. This step can be parallelized using the multihread argument. We're also going to pool across samples, since this improves the detection of variants that are rare in a specific sample but less rare overall, but at a computational cost.
Note that we can pass a lot of arguments into this function that control the error model. As briefly described above, the game is to partition these sequences until each partition is consistent with being generated soley by illumina and amplification error variation, and not due to biological variation. A p-value is calculated to test the sequences that form new partitions, with signififcant p-values at a given threshold leading to new parittions.
Now, say we had a problem where rare sequence variants were of most interest. It would then make sense to use a less conservative significance threshold, leading to *more* significant p-values, more partitions, and hence more rare variants. We can do this by increasing the **OMEGA_A** value from its default value of $1\times10^{-40}$. We'll fit a second model and change the threshold to $1\times10^{-20}$.
```{r}
dd <- dada(derep, err=dd_err[[1]]$err_out, pool=TRUE)
dd_rare <- dada(derep, err=dd_err[[1]]$err_out, pool=TRUE, OMEGA_A=1e-20)
```
Finally, we'll make our sequence table.
```{r}
seqtab_all <- makeSequenceTable(dd)
seqtab_all_rare <- makeSequenceTable(dd_rare)
```
And then we'll remove chimeras. This function compares sequences against one another, and removes sequences that can be generated by joining two abundant sequences.
```{r}
seqtab <- removeBimeraDenovo(seqtab_all)
seqtab_rare <- removeBimeraDenovo(seqtab_all_rare)
```
First, note the difference in dimensions; there are more sequences in the rare table:
```{r}
ncol(seqtab)
ncol(seqtab_rare)
```
Also note that despite now having a sequence abundance table that is similar in form to the OTU table we generated in QIIME, our 'taxonomic variants' are unique sequences and *not* OTUs:
```{r}
seqtab[1:5,1:5]
```
If we want to assign taxonomy from a reference database to these sequences, we can use the following command that applies a naive Bayes classifer to compare our sequences to classified sequences in a training set. First, we'll use a GreenGenes training set:
```{r,eval=FALSE}
ref_fasta <- 'data/data_stability/references/gg_13_8_train_set_97.fa.gz'
taxtab_gg <- assignTaxonomy(seqtab, refFasta=ref_fasta)
colnames(taxtab_gg) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
```
Now, instead, we can try a Silva training set. Note that Silva does not give species level assignments
```{r,eval=FALSE}
ref_fasta <- 'data/data_stability/references/silva_nr_v123_train_set.fa.gz'
taxtab_silva <- assignTaxonomy(seqtab, refFasta=ref_fasta)
colnames(taxtab_silva) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
```
If we want species level assignments, we can do the following.
```{r,eval=FALSE}
ref_fasta <- 'data/data_stability/references/rdp_species_assignment_14.fa.gz'
sptab_silva <- assignSpecies(seqtab, refFasta=ref_fasta, allowMultiple=FALSE, verbose=TRUE)
```
Next, we might want to build a phylogenetic tree. First, we perform a multiple sequence alignment:
```{r,eval=FALSE}
seqs <- getSequences(seqtab)
names(seqs) <- seqs
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
```
Then, we'll build a tree, specifically, a maximum likelihood tree from a NJ tree.
```{r,eval=FALSE}
phang_align <- phyDat(as.matrix(alignment), type="DNA")
dm <- dist.ml(phang_align)
treeNJ <- NJ(dm) # Note, tip order != sequence order
fit = pml(treeNJ, data=phang_align)
fit_gtr <- update(fit, k=4, inv=0.2)
fit_gtr <- optim.pml(fit_gtr, model="GTR", optInv=TRUE, optGamma=TRUE,
rearrangement = "stochastic", control = pml.control(trace = 0))
```
And finally, we can build our phyloseq object for our GreenGenes table:
```{r,eval=FALSE}
TREE <- phy_tree(fit_gtr)
META <- as.data.frame(MAP)
rownames(META) <- META$`#SampleID`
OTU <- seqtab
rownames(OTU) <- gsub('\\.fastq','',rownames(OTU))
OTU <- OTU[rownames(META),]
OTU <- otu_table(OTU,taxa_are_rows=FALSE)
META <- sample_data(META)
TAXA <- taxtab_gg[colnames(OTU),]
TAXA <- tax_table(TAXA)
PS <- phyloseq(OTU,TAXA,META,TREE)
```
## Running Dada2 on Proteus
Now let's redo the analysis above, but on a cluster. I'll explain below how to run dada2 via two ways. The first will involve simply using packages installed in a shared group folder. The second will cover installing dada2 in a local folder. If you lack access to the shared folder, or, if for some reason it no longer exists, go to method 2.
### Method 1: Using nameGrp Shared R Library
First, we'll create a bash script that runs some QIIME commands to do the demultiplexing and splitting, and then runs the dada anlysis. We'll assume that the moving pictures data is in your home directory. As before, the folder should contain three files: (1) the reads, (2) the barcodes, and (3) the mapping file.
We'll name the script **prep_sub.sh**:
```{r,eval=FALSE}
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -M [email protected]
#$ -l h_rt=01:00:00
#$ -P namePrj
#$ -l mem_free=12G
#$ -l h_vmem=16G
#$ -q all.q
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load sge/univa
module load gcc/4.8.1
module load qiime/gcc/64/1.9.1
export ref_seqs=/mnt/HA/opt/qiime/gcc/64/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
export ref_tax=/mnt/HA/opt/qiime/gcc/64/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
data_dir=/home/user_name/dirname/dada2/moving_pictures
work_dir=/scratch/user_name/moving_pictures
mkdir -p $work_dir
cp $data_dir/* $work_dir
out_dir=$work_dir/fastq_out
seqs=$work_dir/forward_reads.fastq.gz
bc=$work_dir/barcodes.fastq.gz
map=$work_dir/map.tsv
split_libraries_fastq.py -o $out_dir -i $seqs -b $bc -m $map -r 999 -n 999 -q 0 -p 0.0001 --store_demultiplexed_fastq
split_sequence_file_on_sample_ids.py -i $out_dir/seqs.fastq -o $out_dir/sequences --file_type fastq
mv $out_dir/sequences $data_dir
rm -r $work_dir/*
exit 0
```
This gives us our demultiplexed, split sequences. Now, we'll make a dada2 R script that performs the actual analysis. We'll call this **dada.R**.
```{r, eval=FALSE}
require(dada2)
scratch_path <- '/scratch/user_name/moving_pictures'
ref_fasta <- file.path(scratch_path,'silva_nr_v123_train_set.fa.gz')
fq_dir <- file.path(scratch_path,'sequences')
fqs <- list.files(fq_dir,full.names=TRUE)
fqs_filt <- gsub('sequences','filtered',fqs)
dir.create(gsub('(filtered).*','\\1',fqs_filt[1]),showWarnings=FALSE)
for (i in seq_along(fqs)){
fastqFilter(fqs[i],fqs_filt[i],
trimLeft=10, truncLen=130,
maxN=0, maxEE=2, truncQ=2,
compress=TRUE)
}
derep <- derepFastq(fqs_filt)
names(derep) <- sapply(strsplit(basename(fqs_filt), "_"), `[`, 1)
dd_err <- dada(derep[1:10],err=NULL,selfConsist=TRUE, multithread=TRUE,VERBOSE=TRUE)
dd <- dada(derep, err=dd_err[[1]]$err_out, pool=TRUE, multithread=TRUE,VERBOSE=TRUE)
seqtab_all <- makeSequenceTable(dd)
seqtab <- removeBimeraDenovo(seqtab_all,tableMethod='pooled',verbose=TRUE, multithread=TRUE)
taxtab_silva <- assignTaxonomy(seqtab,refFasta=ref_fasta,verbose=TRUE)
colnames(taxtab_silva) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")
saveRDS(seqtab,file.path(scratch_path,'seqtab.rds'))
saveRDS(taxtab_silva,file.path(scratch_path,'taxtab.rds'))
```
Finally, we'll create the submission script. **Note the following change that allows you to use the packages in the shared folder.** In this submission script, immediately after you load your modules, you must add the line:
```{r,eval=FALSE}
export R_LIBS=/mnt/HA/groups/nameGrp/r_libs
```
This gives us the following submission script:
```{r, eval=FALSE}
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -M [email protected]
#$ -l h_rt=01:00:00
#$ -P namePrj
#$ -pe shm 16
#$ -l mem_free=12G
#$ -l h_vmem=16G
#$ -q all.q
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load sge/univa
module load gcc/4.8.1
export R_LIBS=/mnt/HA/groups/nameGrp/r_libs
data_dir=/home/user_name/dirname/dada2
work_dir=/scratch/user_name/moving_pictures
mkdir -p $work_dir
cp -r $data_dir/moving_pictures/sequences $work_dir
cp $data_dir/dada.R $work_dir
cp ~/references/silva_nr_v123_train_set.fa.gz $work_dir
R CMD BATCH $work_dir/dada.R
mv $work_dir/*.Rout $data_dir
mv $work_dir/*.rds $data_dir
rm -rf $work_dir
exit 0
```
### Method 2: Creating a Local Library
We first need to make two scripts. One will be an R package script that installs our packages into a personal library folder. The other will run this script, but it will first make said folder and also unload any preloaded gcc modules that may cause conflicts during pacakge installation.
First, make sure you're in your home directory. We'll now make the R package installer script. We'll call it **install_r_pkgs.R**.
```{r,eval=FALSE}
MYLIB <- Sys.getenv('R_LIBS_USER')
source('https://bioconductor.org/biocLite.R')
biocLite('dada2',lib=MYLIB)
```
Next, we'll make the bash script that runs this, which we'll call **run_install_r_pkgs.sh**:
```{r,eval=FALSE}
#!/bin/bash
module unload gcc
Rscript -e "dir.create(Sys.getenv('R_LIBS_USER'),showWarnings=FALSE,recursive=TRUE)"
R CMD BATCH install_r_pkgs.R
```
Finally, we'll run the bash script by entering the following at the command line (this will take a few minutes to run):
```{r,eval=FALSE}
chmod +x run_install_r_pkgs.sh
./run_install_r_pkgs.sh
```
You should now have a R folder in your home directory, and if you navigate through it, you should see a dada folder. To ensure your installation worked, in your home directly, type **R** to enter the R environment. Then, run **library(dada2)**. Assuming everything loads correctly, we can proceed to submitting a job.
**Note that in the submission script, you must remove the line where we changed the R_LIBS path**:
```{r,eval=FALSE}
export R_LIBS=/mnt/HA/groups/nameGrp/r_libs
```
which gives us the following submission script:
```{r, eval=FALSE}
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -M [email protected]
#$ -l h_rt=01:00:00
#$ -P namePrj
#$ -pe shm 16
#$ -l mem_free=12G
#$ -l h_vmem=16G
#$ -q all.q
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load sge/univa
module load gcc/4.8.1
data_dir=/home/user_name/dirname/dada2
work_dir=/scratch/user_name/moving_pictures
mkdir -p $work_dir
cp -r $data_dir/moving_pictures/sequences $work_dir
cp $data_dir/dada.R $work_dir
cp ~/references/silva_nr_v123_train_set.fa.gz $work_dir
R CMD BATCH $work_dir/dada.R
mv $work_dir/*.Rout $data_dir
mv $work_dir/*.rds $data_dir
rm -rf $work_dir
exit 0
```