Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd fits at learn errors step #964

Closed
gdunshea opened this issue Mar 2, 2020 · 4 comments
Closed

Odd fits at learn errors step #964

gdunshea opened this issue Mar 2, 2020 · 4 comments

Comments

@gdunshea
Copy link

gdunshea commented Mar 2, 2020

error-fit-example.pdf
Hi,

I have a similar issue that others have come across relating to random orientations of sequences due to our experimental design in the lab and I think it is impacting error learning.

I tried to fix the orientation problem with "Decipher::OrientNucleotides()", however when using "Decipher::readDNAStringset()" with fastq files, the quality scores are ignored and re-writing the files to fastq after "Decipher::OrientNucleotides()" ends up with nonsense quality scores. I saw that in post #434 you mentioned running through the dada2 pipeline, then checking for reverse compliments by hand, so thanks for that.

I am actually a bit concerned with how my error plots look though. The estimated error rates seem to experience a pretty severe dog-leg up at higher quality scores, in some instances dipping below the line of expected error rates - and the estimated fit in this area is generally quite poor.

Have you seen this before and do you have any suggestion for parameters I could tweak to address this?

Thanks for your time and your excellent package!

Edit: Added example figure of estimated error rates

@benjjneb
Copy link
Owner

benjjneb commented Mar 3, 2020

Do you know if this data has binned quality scores? And what machine was used to generate the data (e.g. MiSeq, NovaSeq...)?

@gdunshea
Copy link
Author

gdunshea commented Mar 3, 2020

Thanks your for quick reply :)

It's a NextSeq paired-end 2x 150 run. But yes, now that you say, looking at the heatmap on the plot I posted above, it does appear that the quality scores are binned. Below is an example of the first sequence in one of the files:

@NB501850:81:HKM2LBGX5:1:11101:19604:1276 2:N:0:GTGGCC
CGTCGCTCCTACCGATTGAGTGATCCGGTGAATAATTCGGACTGCAGCAATGTTTGGATCCCGAACGTTGCAGCGGAAAGTTTAGTGAACCTTATCACTTAGAGGAAGGAGAAGTCGTAACA
+
EEEEEEEEEEEEEEEAE/EEEE/EEEEEEEAAEEEEEEEEEAEEEAEE<EEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEAEEEEAEEAEE

@benjjneb
Copy link
Owner

benjjneb commented Mar 9, 2020

See suggestions here and in the following posts on how to deal with error-rate fitting when binned quality scores are present: #938 (comment)

@gdunshea
Copy link
Author

gdunshea commented Mar 10, 2020

That's great, thanks very much Ben

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants