Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual quality profiles and error plots #1402

Closed
emilyvansyoc opened this issue Aug 30, 2021 · 2 comments
Closed

Unusual quality profiles and error plots #1402

emilyvansyoc opened this issue Aug 30, 2021 · 2 comments

Comments

@emilyvansyoc
Copy link

Hello,
I am running into issues that I've not seen before with 16S data and I'm hoping you have some guidance (not related to a bug or coding problem so feel free to close the issue).

This is commercial 16S data that was pre-filtered in an "in-house script" and I think something's going wrong with the Phred quality encoding that is throwing off the error learning algorithm.

Attached is a representative example of the quality profile
and forward and reverse error plots.

Do you have any idea what could cause banding like this in the quality profile plot and if it's related to the weird error learning plots? My biggest concern is that dada2 assigned 25,000+ ASVs to this dataset (animal fecal samples) which seems way off.

Thanks for your help!

reverse-error-plot.pdf
qualityprofile
forward-error-plot.pdf

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 parallel stats graphics grDevices
[6] utils datasets methods base

other attached packages:
[1] Biostrings_2.60.1 GenomeInfoDb_1.28.0
[3] XVector_0.32.0 IRanges_2.26.0
[5] S4Vectors_0.30.0 BiocGenerics_0.38.0
[7] forcats_0.5.1 stringr_1.4.0
[9] dplyr_1.0.7 purrr_0.3.4
[11] readr_1.4.0 tidyr_1.1.3
[13] tibble_3.1.2 ggplot2_3.3.4
[15] tidyverse_1.3.1 phyloseq_1.36.0
[17] dada2_1.20.0 Rcpp_1.0.6
[19] BiocManager_1.30.16

loaded via a namespace (and not attached):
[1] nlme_3.1-152 fs_1.5.0
[3] bitops_1.0-7 matrixStats_0.59.0
[5] lubridate_1.7.10 RColorBrewer_1.1-2
[7] httr_1.4.2 tools_4.1.1
[9] backports_1.2.1 utf8_1.2.1
[11] R6_2.5.0 vegan_2.5-7
[13] DBI_1.1.1 mgcv_1.8-36
[15] colorspace_2.0-1 permute_0.9-5
[17] rhdf5filters_1.4.0 ade4_1.7-17
[19] withr_2.4.2 tidyselect_1.1.1
[21] compiler_4.1.1 cli_2.5.0
[23] rvest_1.0.0 Biobase_2.52.0
[25] xml2_1.3.2 DelayedArray_0.18.0
[27] labeling_0.4.2 scales_1.1.1
[29] digest_0.6.27 Rsamtools_2.8.0
[31] jpeg_0.1-8.1 pkgconfig_2.0.3
[33] MatrixGenerics_1.4.0 dbplyr_2.1.1
[35] readxl_1.3.1 rlang_0.4.11
[37] rstudioapi_0.13 farver_2.1.0
[39] generics_0.1.0 hwriter_1.3.2
[41] jsonlite_1.7.2 BiocParallel_1.26.0
[43] RCurl_1.98-1.3 magrittr_2.0.1
[45] GenomeInfoDbData_1.2.6 biomformat_1.20.0
[47] Matrix_1.3-4 munsell_0.5.0
[49] Rhdf5lib_1.14.1 fansi_0.5.0
[51] ape_5.5 lifecycle_1.0.0
[53] stringi_1.6.2 MASS_7.3-54
[55] SummarizedExperiment_1.22.0 zlibbioc_1.38.0
[57] rhdf5_2.36.0 plyr_1.8.6
[59] grid_4.1.1 crayon_1.4.1
[61] lattice_0.20-44 haven_2.4.1
[63] splines_4.1.1 multtest_2.48.0
[65] hms_1.1.0 pillar_1.6.1
[67] igraph_1.2.6 GenomicRanges_1.44.0
[69] reshape2_1.4.4 codetools_0.2-18
[71] reprex_2.0.0 glue_1.4.2
[73] ShortRead_1.50.0 latticeExtra_0.6-29
[75] data.table_1.14.0 RcppParallel_5.1.4
[77] modelr_0.1.8 png_0.1-7
[79] vctrs_0.3.8 foreach_1.5.1
[81] cellranger_1.1.0 gtable_0.3.0
[83] assertthat_0.2.1 broom_0.7.8
[85] survival_3.2-11 iterators_1.0.13
[87] GenomicAlignments_1.28.0 cluster_2.1.2
[89] ellipsis_0.3.2

@benjjneb
Copy link
Owner

This is binned quality score data, which is common on the high-throughput Illumina machines (e.g. NovaSeq). See this thread for discussion: #1307

My biggest concern is that dada2 assigned 25,000+ ASVs to this dataset (animal fecal samples) which seems way off.

If taht is from just 128k reads, then yes that looks like a problem. Any potential issues with binned quality scores (probably) aren't causing that though. Could there be un-removed primer/adapter bases on these reads? Or perhaps a non-typical library preparation strategy like heterogeneity spacers that introduce variation in the start position of the reads?

@emilyvansyoc
Copy link
Author

Thank you so much for your quick reply. Yes, it seems that we're experiencing similar issues with the binned quality scores. I will also explore your other suggestions (adapters, heterogeneity spacers) and reply here in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants