-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binned quality scores and their effect on (non-decreasing) trans rates #1307
Comments
Hi Jake, I consider this the issue for binned quality scores, and will direct future related issues this way. |
A follow up to my original post:I ended up trying a couple of different additional approaches/variants of a modified Two generalizable solutions have been offered so far.
1 ) altering
They used:
while the original/standard function uses total counts for weights and the default value (0.75) for the span parameter.
I was curious to see what the impact of varying just the weights argument was, as opposed to changing the span value at the same time (two things moving at the same time makes it hard to predict the effects of just one). 2 ) Enforcing monotonicity in the error rates
It sounds like it worked for them, so maybe it would work for us too! TrialsSummarizing what trials I wanted to test (for clarity):
alter
|
@JacobRPrice really impressive work! Would be great to get @benjjneb thoughts, though this argues that one could at least try different approaches with some follow-up QA to see what works best. |
Hi Jake, |
I sent an email briefly describing the data and shared a google drive link to the raw data as a tarball. Please let me know if you have any questions or I can provide clarification. Outline of QC protocol: cutadapt was used to remove primers
After removing the primers, we carried out QC using the following
I can send you my exact scripts if you need more detailed information. Jake |
@JacobRPrice @benjjneb I'm a bioinformatics analyst from the University of Maryland. I've experimented with error modeling on a NovaSeq dataset using the following loess function, which weights by totals and uses a degree of 1. I figured that a degree of 1 would force the underlying fits to be linear, thus "discouraging" the final model from changing direction. (The resulting models still changed direction, but oh well.) I didn't enforce monotonicity, but I think it would help to do so, to remove the dips in the models during learnErrors().
My understanding is that the "totals" here are highest for the three quality scores that actually appear in my NovaSeq reads: 11, 23, and 37. Allowing those three datapoints to dominate the loess fit seems to maintain an overall downward slope in the post-DADA fits, which seems desirable to me. I've also tried weighting by
Unfortunately, I'm not at liberty to share my own dataset, but like yourself, I am curious to find out if these parameters are generalizable to other datasets. Here are the results when I used @JacobRPrice's "Trial 1" model function on my dataset.
|
Hi @jonalim, Thanks for your comments. I am running some tests with Jack's data. I am wondering if you would mind sharing the "trans" matrix with me? It would be very helpful if I can run the same tests on both matrices. Thank you in advance. |
I have the same issue. I do not have binned quality scores. The solution of @JacobRPrice's "Trial 1" does not work for me.
|
@Y-Q-Si Certainly. Here's a link to an RDS and the script used to create it. The RDS contains a list of three lists, one for each error model function. Each internal list contains:
https://drive.google.com/drive/folders/11GThjrUKdcL_n64EbXdxf1Ww_gXqPE87?usp=sharing |
@andzandz11 |
@andzandz11 - you are also trying to use 1e20 bp, which seems an unreasonably large number. Overall, are there any solid recommendations yet? We've received word from our sequencing center that they are about to run a few hundred of our samples on a NovaSeq (instead of the usual MiSeq), so we are about to be in the same situation of fitting error models against binned quality scores. Also, I first ran up against this issue with Illumina RTA3 not with amplicon data, but with variant calling and the need there to distinguish between low-frequency variants and sequencing error. I wonder if the folks developing those tools, such as LoFreq (@CSB5) or similar, have some clever ways to use the bins. |
I'll second Jeff's request. I'd love to hear any recommendations official or otherwise. We have several NovaSeq datasets that are currently tabled. |
Hi All! Just rejoining this conversation after a long hiatus. First @JacobRPrice thank you for all the work you've been doing on this. It's extremely helpful. I will go ahead and try your three methods on a recent NovaSeq soils dataset and report back. Second, I am now working with a new sequencing center that due to pandemic-related issues will be sequencing all amplicon data only on NovaSeq, so this challenge has recently become much more pressing on my end. The good news is that in my group we may actually have the resources to create a dataset with the same samples sequenced on HiSeq and resequenced on NovaSeq. @benjjneb are you still looking for a comparison dataset for this? If so, I can see if we can push that project forward. If not, is the comparison dataset you're currently using public? |
Hi all, we're encountering some issues in two recent NovaSeq runs so will be jumping on this wagon too! Just FYI @hhollandmoritz @JacobRPrice @benjjneb @Y-Q-Si we have some 16S and 18S seawater microbial community mocks of known composition that are not denoising well from two recent NovaSeq runs and they've been sequenced many times before (HiSeq, MiSeq, you name it). In these recent NovaSeq runs we're getting lots of 1-mismatches to the mocks that represent a good fraction of the total reads (up to something like 8-10% of the mock which is a bit disturbing and not at all usual), and it seems likely that it's related to the issues discussed here. @JacobRPrice I also agree this is a pretty big issue for qiime2 users - we normally use qiime2 and have had good luck with default DADA2 parameters before but with our recent issues there's really no way to troubleshoot natively in qiime2. I am working with someone in the lab who is much more savvy with R than I am who will help implement the code, and we'll report back after trying the "combined" solution discussed above. BTW if there's interest in others getting the mock data to play with I could ask Jed if he'd mind us sharing... In any case, thanks everyone for all the really useful info and suggestions! |
So time to report back. Like @JacobRPrice I ran the three options, but I also added a 4th trial based on @jonalim's method. The samples are from arctic soils so pretty high diversity environment. I'm not quite sure I understand what the differences are between the errors generated from the pre-learning step and from after dada2, but I'd be happy to provide those two if anyone thinks they'd be helpful. Option 1: Alters the weights and span in loess, also enforce monotonicity loessErrfun_mod1 <- function(trans) {
qq <- as.numeric(colnames(trans))
est <- matrix(0, nrow=0, ncol=length(qq))
for(nti in c("A","C","G","T")) {
for(ntj in c("A","C","G","T")) {
if(nti != ntj) {
errs <- trans[paste0(nti,"2",ntj),]
tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
rlogp <- log10((errs+1)/tot) # 1 psuedocount for each err, but if tot=0 will give NA
rlogp[is.infinite(rlogp)] <- NA
df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
# original
# ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
# mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
# # mod.lo <- loess(rlogp ~ q, df)
# Gulliem Salazar's solution
# https://github.com/benjjneb/dada2/issues/938
mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)
pred <- predict(mod.lo, qq)
maxrli <- max(which(!is.na(pred)))
minrli <- min(which(!is.na(pred)))
pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
pred[seq_along(pred)<minrli] <- pred[[minrli]]
est <- rbind(est, 10^pred)
} # if(nti != ntj)
} # for(ntj in c("A","C","G","T"))
} # for(nti in c("A","C","G","T"))
# HACKY
MAX_ERROR_RATE <- 0.25
MIN_ERROR_RATE <- 1e-7
est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
# enforce monotonicity
# https://github.com/benjjneb/dada2/issues/791
estorig <- est
est <- est %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(est) <- rownames(estorig)
colnames(est) <- colnames(estorig)
# Expand the err matrix with the self-transition probs
err <- rbind(1-colSums(est[1:3,]), est[1:3,],
est[4,], 1-colSums(est[4:6,]), est[5:6,],
est[7:8,], 1-colSums(est[7:9,]), est[9,],
est[10:12,], 1-colSums(est[10:12,]))
rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
colnames(err) <- colnames(trans)
# Return
return(err)
}
# check what this looks like
errF_1 <- learnErrors(
filtFs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod1,
verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## selfConsist step 9
## selfConsist step 10
errR_1 <- learnErrors(
filtRs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod1,
verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## selfConsist step 9
## selfConsist step 10 Option 2: Only enforce monotonicity loessErrfun_mod2 <- function(trans) {
qq <- as.numeric(colnames(trans))
est <- matrix(0, nrow=0, ncol=length(qq))
for(nti in c("A","C","G","T")) {
for(ntj in c("A","C","G","T")) {
if(nti != ntj) {
errs <- trans[paste0(nti,"2",ntj),]
tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
rlogp <- log10((errs+1)/tot) # 1 psuedocount for each err, but if tot=0 will give NA
rlogp[is.infinite(rlogp)] <- NA
df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
# original
# ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
# # mod.lo <- loess(rlogp ~ q, df)
# Gulliem Salazar's solution
# https://github.com/benjjneb/dada2/issues/938
# mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)
pred <- predict(mod.lo, qq)
maxrli <- max(which(!is.na(pred)))
minrli <- min(which(!is.na(pred)))
pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
pred[seq_along(pred)<minrli] <- pred[[minrli]]
est <- rbind(est, 10^pred)
} # if(nti != ntj)
} # for(ntj in c("A","C","G","T"))
} # for(nti in c("A","C","G","T"))
# HACKY
MAX_ERROR_RATE <- 0.25
MIN_ERROR_RATE <- 1e-7
est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
# enforce monotonicity
# https://github.com/benjjneb/dada2/issues/791
estorig <- est
est <- est %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(est) <- rownames(estorig)
colnames(est) <- colnames(estorig)
# Expand the err matrix with the self-transition probs
err <- rbind(1-colSums(est[1:3,]), est[1:3,],
est[4,], 1-colSums(est[4:6,]), est[5:6,],
est[7:8,], 1-colSums(est[7:9,]), est[9,],
est[10:12,], 1-colSums(est[10:12,]))
rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
colnames(err) <- colnames(trans)
# Return
return(err)
}
# check what this looks like
errF_2 <- learnErrors(
filtFs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod2,
verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## Convergence after 8 rounds.
errR_2 <- learnErrors(
filtRs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod2,
verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## Convergence after 7 rounds. Option 3: Only alter loess weights and also enforce monotonicity loessErrfun_mod3 <- function(trans) {
qq <- as.numeric(colnames(trans))
est <- matrix(0, nrow=0, ncol=length(qq))
for(nti in c("A","C","G","T")) {
for(ntj in c("A","C","G","T")) {
if(nti != ntj) {
errs <- trans[paste0(nti,"2",ntj),]
tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
rlogp <- log10((errs+1)/tot) # 1 psuedocount for each err, but if tot=0 will give NA
rlogp[is.infinite(rlogp)] <- NA
df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
# original
# ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
# mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
# # mod.lo <- loess(rlogp ~ q, df)
# Gulliem Salazar's solution
# https://github.com/benjjneb/dada2/issues/938
# mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),span = 2)
# only change the weights
mod.lo <- loess(rlogp ~ q, df, weights = log10(tot))
pred <- predict(mod.lo, qq)
maxrli <- max(which(!is.na(pred)))
minrli <- min(which(!is.na(pred)))
pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
pred[seq_along(pred)<minrli] <- pred[[minrli]]
est <- rbind(est, 10^pred)
} # if(nti != ntj)
} # for(ntj in c("A","C","G","T"))
} # for(nti in c("A","C","G","T"))
# HACKY
MAX_ERROR_RATE <- 0.25
MIN_ERROR_RATE <- 1e-7
est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
# enforce monotonicity
# https://github.com/benjjneb/dada2/issues/791
estorig <- est
est <- est %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(est) <- rownames(estorig)
colnames(est) <- colnames(estorig)
# Expand the err matrix with the self-transition probs
err <- rbind(1-colSums(est[1:3,]), est[1:3,],
est[4,], 1-colSums(est[4:6,]), est[5:6,],
est[7:8,], 1-colSums(est[7:9,]), est[9,],
est[10:12,], 1-colSums(est[10:12,]))
rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
colnames(err) <- colnames(trans)
# Return
return(err)
}
# check what this looks like
errF_3 <- learnErrors(
filtFs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod3,
verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## selfConsist step 9
## Convergence after 9 rounds.
# check what this looks like
errR_3 <- learnErrors(
filtRs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod3,
verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## selfConsist step 9
## selfConsist step 10 Option 4: Alter loess function arguments (weights, span, and degree) also enforce monotonicity. loessErrfun_mod4 <- function(trans) {
qq <- as.numeric(colnames(trans))
est <- matrix(0, nrow=0, ncol=length(qq))
for(nti in c("A","C","G","T")) {
for(ntj in c("A","C","G","T")) {
if(nti != ntj) {
errs <- trans[paste0(nti,"2",ntj),]
tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
rlogp <- log10((errs+1)/tot) # 1 psuedocount for each err, but if tot=0 will give NA
rlogp[is.infinite(rlogp)] <- NA
df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
# original
# ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
# mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
# # mod.lo <- loess(rlogp ~ q, df)
# jonalim's solution
# https://github.com/benjjneb/dada2/issues/938
mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
pred <- predict(mod.lo, qq)
maxrli <- max(which(!is.na(pred)))
minrli <- min(which(!is.na(pred)))
pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
pred[seq_along(pred)<minrli] <- pred[[minrli]]
est <- rbind(est, 10^pred)
} # if(nti != ntj)
} # for(ntj in c("A","C","G","T"))
} # for(nti in c("A","C","G","T"))
# HACKY
MAX_ERROR_RATE <- 0.25
MIN_ERROR_RATE <- 1e-7
est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
# enforce monotonicity
# https://github.com/benjjneb/dada2/issues/791
estorig <- est
est <- est %>%
data.frame() %>%
mutate_all(funs(case_when(. < X40 ~ X40,
. >= X40 ~ .))) %>% as.matrix()
rownames(est) <- rownames(estorig)
colnames(est) <- colnames(estorig)
# Expand the err matrix with the self-transition probs
err <- rbind(1-colSums(est[1:3,]), est[1:3,],
est[4,], 1-colSums(est[4:6,]), est[5:6,],
est[7:8,], 1-colSums(est[7:9,]), est[9,],
est[10:12,], 1-colSums(est[10:12,]))
rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
colnames(err) <- colnames(trans)
# Return
return(err)
}
# check what this looks like
errF_4 <- learnErrors(
filtFs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod4,
verbose = TRUE
)
## 287310600 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## selfConsist step 9
## selfConsist step 10
errR_4 <- learnErrors(
filtRs,
multithread = TRUE,
nbases = 1e10,
errorEstimationFunction = loessErrfun_mod4,
verbose = TRUE
)
## 280925920 total bases in 1276936 reads from 7 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .......
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## selfConsist step 8
## Convergence after 8 rounds. Now for the side-by-side plots: Forward Reads:Option 1: Alters the weights and span in loess, also enforce monotonicity Option 3: Only alter loess weights and also enforce monotonicity Option 4: Alter loess function arguments (weights, span, and degree) also enforce monotonicity. Reverse Reads:Option 1: Alters the weights and span in loess, also enforce monotonicity Conclusions:So far @jonalim's solution seems the best for this data. All methods are still not doing great with the A2G, and T2C transitions, but there also just doesn't seem to be a lot of data for those transitions. I'd welcome anyone else's input on whether or not they think the Option 4 plot is acceptable or still not trustworthy enough to draw conclusions from. We're trying to develop a lab pipeline and seeing as it looks like we'll be dealing with NovaSeq data going forward, my current recommendation for users of NovaSeq data is going to be to try out all four methods of error learning and choose the plots that look best for the data. |
Yes, if you have mock community data from NovaSeq, I'd love to be able to look at it. @jcmcnch |
That's amazing work @hhollandmoritz ! Yeah we've set up a 'NovaSeq' fix in our Nextflow implementation that essentially does option 1, but I agree it's worth making this more flexible based on the above options. |
Hi everybody, An update from me and Liv ( @livycy ) who has done some testing of the improved error model from @JacobRPrice on our mock communities (16S and 18S; clone sequences of nearly-full rRNA) for which we know the true result and have been able to reliably get good denoising from q2-dada2 in previous runs (HiSeq MiSeq etc). TL;DR - while the error model looks better, we still see little change in the abundance of some artifactual sequences (defined here as 1-mismatches to the true sequence), in particular A=>G and T=>C transitions which had very "flat" error model profiles. Some more detail: For our mocks, we normally get < 0.5% of total sequences falling out the exact expected mock sequences after denoising with DADA2 (consistent between q2-dada2 and the native R version) which we've been very happy with. In our past few NovaSeq runs, we're getting up to 10% of reads falling outside the true mocks which is alarming and something we'd really like to resolve. For these sequences that differ from the reference, we know from BLASTing them that that a significant fraction are 1 base away from the true sequences and therefore likely denoising artifacts. We see the same type of error dips and issues so Liv ran the new error model from @JacobRPrice , with similar results: Here's the original error model: Then now the improved one: So everything looks nicer. No more dips etc. However, it doesn't appreciably change the percentage of total ASV counts that differ from the known sequences (3.7% originally vs 3% with the new error model). I then looked more carefully at the 1-mismatch artifacts from the most abundant member of our 16S mock and did some quick counting by eye (looking at alignments in a viewer, I'm sure there's a smarter way using biopython or something). The location of the variants are scattered across the ASVs though they are definitely more abundant in the reverse read. From this I can conclude a couple of things (so far this is specific to this particular member of the mock community which happens to be a SAR11):
When looking at the error profiles I do see why DADA2 might not be doing well with A=>G and T=>C transitions (the error profile is pretty flat) but the rest is quite mysterious to me... @benjjneb @hhollandmoritz @jonalim @Y-Q-Si - any suggestions about the next step we could do to try and resolve the poor performance on our mocks? Thanks, |
Hi all, I just wanted to send a brief update on this. We dug into some of our previous datasets and found that while our two most recent NovaSeq runs had issues with DADA2 denoising (i.e. many 1-mismatches to the mock, suggeting denoising issues) which were not resolved by improving the error model as indicated above, a previous NovaSeq run had excellent performance with DADA2, and worked nearly as well as a HiSeq rapidrun (which does not have binned quality scores). The reason why our two most recent runs have had such comparatively poor performance is mysterious but it could be some property of the sequencing pools we've prepared and/or the ratio of amplicon to metagenome reads in those pools. We have noticed similar situations in the past where DADA2 did fail for inexplicable reasons and we were not able to figure out why this was happening even after consulting with @benjjneb. So I would conclude from this that NovaSeq binned quality scores are not inherently a problem for DADA2, at least for our simple mock communities. FYI @cjfields If anyone would like access to these mock community datasets for testing, I would be happy to provide them. @benjjneb would your group be interested to take a closer look? This would include a number of mocks derived from the same exact sequencing libraries run on HiSeq and then on NovaSeq, both of which worked well in terms of denoising and another NovaSeq run (in addition to the one we already shared) which had quite serious issues with the same mocks. -Jesse |
Thanks all for this resourceful thread! I am too working on a large dataset produced with NovaSeq and I am about to test out these models. I am considering doing the testing on a randomly subsetted dataset (maybe 1/10) but don't know if this is sufficient to decide about which model to use for the full dataset. Any comments? |
@nejcstopno , IMHO testing on a 10% subset should give you a pretty good idea of what the outcome would be for the full dataset. As far as your choice of
|
I'm not sure how I got here, so I'm hesitant to comment, but I'm pretty sure setting span = 2 in loess isn't correct. My understanding is that span is the percent of the data to use to smooth. Often called alpha and usually .75 by default. 75%. Reducing the span (adjusting it closer to 0) will make the line "more wiggly" and less straight. There's a tutorial somewhere on optimizing the smoothing parameter of loess which could be a good place to gather information. As well as the geom_smooth() documentation for ggplot2 which is where I would have probably first heard of loess regression. |
Hello! I have already analysed by the past metagenomic data with dada2 but this time I have the same issue that you, with probably binned quality score but with illumina MiSeq. I read all the discussion and thank you for information provided.
|
@cjfields That's really interesting information. We see them quite frequently and haven't connected them to either low input or short amplicons (so far, it's seemingly random, so I just assumed it was an artifact of the other samples that were sequenced on the same run). Your edit aligns with my interpretation of Issue #1730 which is that the variable length messes with the error learning. So, in theory, fungal ITS amplicons which commonly have variable length could also pose this issue regardless of the sequencing method used. However, in practice, I've not noticed any difference between error rate estimations from runs with lots of trimmed polyGs and runs with very few trimmed polyGs. I'm not clear if this is simply because the polyG reads are subsequently filtered out (they typically have error profiles that may lead to subsequent filtering in the filter and trim step), or if it's because the error models are not as sensitive to the truncated sequences as they are to the binned quality scores. I've only learned about the low-complexity filtering option in reading the issue a couple days ago, so I don't know how good it is at catching them, but I'm eager to try. And if anyone wants a couple of NovaSeq samples with lots of polyGs to play with, I'm happy to share some. (ETA: we have had some particularly egregious examples lately - see attached). |
@hhollandmoritz very interesting! I agree re: ITS, those in particular are thorny for a number of reasons, including the potential for no overlap. Those are definitely worth a look; I'm wondering whether the merging step (which I think has the overhang trimming) might get rid of these. |
@cjfields The merging step is also a possibility however, if a sufficiently large number of G's are left on the end of a sequence, I'd hypothesize that the merging would go fine and create artificial ASVs with lots of G's in the center. That was part of our original reasoning behind removing them, once we finally figured out what the dips were from. (I certainly panicked the first time I saw a student with a plot like this). |
@hhollandmoritz thank you very much for offering to help. I have shared a few files for a trial run with you now. |
@pdhrati02, with @hhollandmoritz we maybe found the problem (it works for my dataset) for learning error rates with NovaSeq data: it's necessary to load library(dplyr); packageVersion("dplyr") in addition to Dada2 package. I hope it will work for you too. |
Ah I see what you mean, that the polyG would be longer than the amplified fragment:
Probably a number of ways to handle that, but I agree the easiest is to simply remove it. |
Hi all. Figured I'd chime in re: the poly Gs since we have had to deal with those and had lengthy talks with illumina about this topic.
In general, I'm usually not too worried about things like those getting included with the outputs because I typically expect them to fail at the merge step, or if they somehow get through that, after I discard ASVs with low quality annotation (nothing below kingdom/domain, etc). Maybe it's a bigger issue on the nextseq, though. |
@hhollandmoritz regarding those mid-read polyG dips, have you determined if those are likely on-target short inserts with readthrough because the target is short, or if those are off-target short inserts (i.e. garbage), or something else? That might inform you on how to further deal with them. Maybe throwing them out completely without trying to trim and save them is the best bet. |
@davised It depends a bit on the amplicon we're using. Certainly for ITS it's more often the case that they are on-target short inserts (given the ITS amplicon variation in fungal libraries). Even for 16S where amplicon length variation is not really an issue, the majority of them do appear to be on-target (just going by observations of sequence similarity to other reads around them - I haven't done any thorough quantification). However we encounter the issue a lot less frequently with 16S and I think it probably wouldn't be an issue in terms of usable reads to throw out polyG's for 16S at all. For fungal ITS the cutadapt filtering approach appears to be working pretty well, and I'm hesitant to throw them out completely given the amplicon variation. Thanks for sharing the information from your discussions with Illumina. Very helpful! Do you use a pre-filtering step to remove the 0 quality scores in light of what you've learned? (Since the filtN setting won't catch them?) |
I'll preface with some somewhat important details first to better answer your question. I am part of the bioinformatics analyst group at the Center for Quantitative Life Sciences (CQLS) at Oregon State University. The CQLS has a core lab facility with a MiSeq and NextSeq 2000, and therefore I often help when something seems to be an error with a sequencing run or something seems out of whack with the data. We were alerted to the poly G situation when a lab was doing some sort of amplicon screening for eukaryotic targets from some environmental samples. They noticed that there was an unusual number of poly Gs in their outputs. I think it happened to be from a combination of the short inserts reading into the no-call space and those getting called as G as well as the problem of the Ns getting called as G instead of N. We walked through the data with illumina and got word of those two references on Illumina's website during that discussion. So while that was an amplicon issue, it didn't relate directly to 16S analysis or using dada2 at all. In my own work, I've only done 16S or other metabarcoding analysis using data produced on the MiSeq. We have one trial run through on the NextSeq and we came across this issue of the error learning step not working quite right on the NextSeq. When we did that run on the NextSeq, we had already upgraded our software (again, as far as I understand) so we did have reads with N getting filtered out as usual using the filtN setting, to the best of my knowledge. I didn't process the data directly, but I was in contact with the user who did the analysis, and they have used dada2 extensively in the past. They compared the NextSeq and a past MiSeq run of the same data, and while the outputs were not exactly the same, doing a beta diversity analysis of the two runs together indicated that the same samples did seem to cluster in e.g. NMDS and PCoA based on both Bray-Curtis and Jaccard. We decided that going forward, we would feel confident comparing NextSeq data to other NextSeq data and MiSeq to past MiSeq, but weren't especially comfortable yet comparing between the two sequencing technologies. And as I said, I didn't touch those data, but I'd expect that the longer and/or shorter 16S reads were either off-target, targeting eukaryotes, or otherwise filtered out during the merge step, and therefore were not causing problems. I usually do a post-merge length filter with my 16S data (e.g. ~250-256 with EMP data), and that would remove any of those ASVs that did happen to merge on a polyG region, unless the insert was incredibly tiny and almost the entire sequence was polyG. I often also make a phylogenetic tree using a sina -> fasttree workflow and that would also show any incredibly weird sequences that I would look into further and discard if the other filtering steps did not remove them. Back to the ITS. When I ran ITS data in the past, I typically used itsxpress which I think would be able to deal with the polyG without having to modify workflow. Are you using itsxpress or something else to deal with the differences in length? |
@davised Thanks for the information! In our lab we've had folks use itsxpress in the past, but it has depended on the student. A couple have run comparisons and not found much difference if they used it or not - but that could just be habitat-specific, or perhaps, because we're mostly using the Taylor et al. 2016 primers which have less length variation. I can't recall if the student with the worst polyG issues in their fungal results was using ITS1/ITS2. When we've used itsxpress, it hasn't totally filtered out the polyGs so we typically still use Cutadapt to filter those out even after itsxpress. In the past I've used the Sina and fasttree to remove odd ASVs but typically I find that the venn diagram of phylogenetically-odd sequences and those that are declared "NA" at the phylum level is a circle, so I often do that instead and only make a tree if a sequence is particularly prominent, or contributing strongly to differences among communities. |
Hi all, I read through this thread as I am also having similar issues with dada2 error model with NovaSeq data. Thanks @hhollandmoritz so much for such a thorough exploration into this! Also, I tried the "Option4" function ( loessErrfun_mod4,) with the error model, suggested by @hhollandmoritz, but i got similar error message to @pdhrati02 and @cmgautier . errF <- learnErrors(filtFs, multithread=TRUE, nbases = 1e10, I got this error message: Error rates could not be estimated (this is usually because of very few reads). Has anyone figured out what this means yet? |
@nuorenarra I didn't really do anything other than package the hard work of @JacobRPrice, @jonalim and others into a lab pipeline. :) But glad you find it useful. As for the getErrors "matrix is NULL" issue, @cmgautier and @pdhrati02 found that the issue is caused by not loading dplyr directly. Specifically, I ran both of their data without issue through our lab pipeline and shared it with them. @cmgautier then used my code and found the following:
She found that it worked after loading the dplyr package directly and I believe that @pdhrati02's code also worked after that fix. I haven't looked deeply into the model 4 function to figure out which line is causing the error but since @benjjneb has hinted that a more official solution is on its way, I haven't prioritized fixing it. |
Same for us. We have an in-house fix using the above that works for our workflow when we have NovaSeq data, which should work also for the NovaSeq X (fingers crossed). |
We are also experiencing this problem (on Nextseq) In the meantime, is there a more official solution? When I read the posts above, it seems like many people are choosing option / model 4? |
That does seem like the preferred model, though I don't think it's been noted as the 'official' one. My impression is that the results could use some level of evaluation using a truth set (mock community). Should also note it does require loading dplyr currently; it's mentioned above somewhere, but this ticket is long! |
@benjjneb Dear Ben, Thank you for the wonderful software, we use it a lot - its awesome. I was wondering if there was an update on this issue/ thread (Illumina with binned quality scores) as we also have the issue. regards, Pete |
Hi Andreas,
You're correct, the loessErrFun_mod4 has not been optimized for
performance. In our lab, it's very typical for it to take hours on large
datasets.
…-Hannah
--
Hannah Holland-Moritz, PhD
Ernakovich Lab
Department of Natural Resources & the Environment
University of New Hampshire
she/her pronouns
***@***.***
https://hhollandmoritz.github.io
I may email you outside your working hours. Please don't feel pressure to
respond until your working hours.
<https://secure-web.cisco.com/1nTIVc1yBgvWcr9ZPOBTASp9_B57lnQ2aj5sai1lo304Yz6TAYV1KgaIjy8mdd6ecxxkAuyveefGtuo9q-x_tPBIQou65oUF1ejWVINiDhLx1aoll_9TM-3YMx6GtsWx931K0aQIW1RC6R_C-kG7snK823Jq9eiPb_UJMfiq2WSEql7gVYJNpN1upiP5-BM5X7eR0rXGhQm_s0F9dbvfiNpTzXlIaFX3s2MqXFP7TXHpdPziIhb4_M4GWqaqWzSfmqV7CqeEEr5ure6eRjV6oVFqKBRHTIZV_5CotcOyf0zbbnoyPuJ3KAOyJ_gdGqLWoH1BtS5Ezvhvp13K32T-sfpYcBee6pJcdFI7RtfL3223-lSBJswahOQtYVgKxyhyRAN8rUw3FckK6se-HGVppDOI0jBaWAj92AQgUJgsGIO9fgK8vOJk0BMY8_65FW55B2rnNMrPH41SUusHSZvOPzw/https%3A%2F%2Fgotguts.org>
On Wed, Aug 7, 2024 at 8:35 PM AndreasK ***@***.***> wrote:
loessErrfun_mod4
Does this function impact performance negatively?
I am running it on 1e8 bases, and it does not use multiple cores and it
also did not finish after a couple of hours.
—
Reply to this email directly, view it on GitHub
<#1307 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4MVPFGLJPPIWJ3MZAXT33ZQK4ODAVCNFSM4ZXW2SA2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMRXGQ3DENJRHA4A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, the same here. Also note we're using the same function for binned-quality PacBio Revio+Kinnex runs (#1892), which also takes a long time. EDIT: also note the comment there from @jonalim regarding whether that error model is the best: #1892 (comment) |
Overview
We are encountering some issues in obtaining appropriate estimates of error rates; I believe these problems are coming from our new(-er) sequencing facility sending us fastq files containing binned quality scores. Searching the issues, it seems like this is not an uncommon situation, but I'm not sure that a single answer or solution has been offered yet.
My observations
Example of forward read quality profiles:
And the corresponding reverse read quality profiles:
Our initial attempts followed a typical pipeline. Instead of pre-learning error rates, we wanted to use the full dataset to avoid bias due to which sample(s) were used. This particular dataset is a fairly large and (what I'd consider) very deeply sequenced (300,000-60,000 reads per sample). In the fitted error rates, we saw the characteristic dips that are often caused by binned quality scores.
A2G is especially nasty!
In issue #938, @hjruscheweyh laid out a near identical problem (to the one I'm having), with his colleague @GuillemSalazar graciously offering up the substitute function they used to address/correct this issue [link to comment containing function]. Their approach attempted to enforce monotonicity by changing the arguments of the
loess
function to have span equal to 2 and weights equal to the log-transformed totals.I then tested out their approach on our data, this time thinking that I should follow the tutorial's suggestion to pre-learn error rates. For both
learnErrors
anddada
I passed theirloessErrfun_mod
to the errorEstimationFunction parameter.The results from learnErrors indicated that this might have done the trick!
But when I pulled the final error model from dadaFs I was disappointed to see that all of the fitted parameters seemed to have "flattened" (indicating less of a response in error frequency to the quality score), and furthermore the error frequencies for A2G actually increased with increasing quality scores, which should never happen in reality.
It's possible that these results came from pre-learning the error rates (and any bias that may result in) or, perhapse more likely, from not allowing dada to selfConsist. To check this, I reran the same data a second time but passing TRUE to selfConsist.
While the self consistency loop for dada() terminated before we saw convergence, the fluctuations were small enough that it is probably ok to proceed with the results (or at least check them out).
The error model obtained from
learnErrors
was identical to the previous trial (same arguments/samples were being passed, so that is expected).Much to my dismay, the error model, while not identical, was very similar, with the same increasing error frequency for A2G.
Illumina Customer Service on Binned Quality Scores
I spent some time on the phone with Illumina and both of the techs I spoke with indicated that binned quality scores are here to stay and that the quality score values that comprise each bin may vary according to Illumina platform and software version. Neither one was able to find me any information about whether or not binning can be turned off by the sequencing facility themselves.
That said, I'm not sure how prevalent the use of binned quality scores are across all sequencing facilities, this particular dataset is the first time I've encountered them. I agree with @wangjiawen2013 comment in #971 that this has the potential to become more and more of an issue.
Questions (Finally, Jake, stop jabbering!)
How severe of a problem are models with increasing error frequencies?
It isn't intuitive that this should be possible (or at least likely). I haven't come across anyone in the issues plotting the final error models (coming out of dada), so I don't know how unique our particular dataset is.
"Official" recommendations for handling binned quality data
I'd like to suggest, or more properly, encourage, that recommendations for how to handle binned quality data be established. @benjjneb has commented 2 that they're waiting on the appropriate data needed to do this development.
Updating learnErrors() (and hence, dada())
Are there plans in place to add functionality to learnErrors enabling users to enforce monotonicity or otherwise alter its behavior if binned quality scores are present?
Communication with other developers/users
Lastly, the dada2 plugin for QIIME2 does not offer much in the way of intermediate output or the ability to validate the results. If sequencing data with binned quality scores is truely an issues, unwary users may be getting bad results. This last item is more for investigators/labs who treat QIIME, dada2, other pipelines as black boxes and are primarily concerned with the output. I understand the Callahan Lab isn't responsible for the development decisions of the QIIME team, but it may be worthwhile to communicate these challenges so that they can be addressed.
Other Issues regarding this question:
Included for others who may be interested as well.
#791 #938 #964 #1083 #1135 #1228
The text was updated successfully, but these errors were encountered: