Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bismark silently outputs incorrect results when UMIs are added using Illuminas bcl-convert #699

Open
lars-work-sund opened this issue Sep 11, 2024 · 2 comments

Comments

@lars-work-sund
Copy link

lars-work-sund commented Sep 11, 2024

This is not really a bug as the documentation clearly states how deduplicate_bismark expects UMIs to be handled, but it is an easy mistake to make.
As documented in deduplicate_bismark, Bismark expects UMIs of the form:
@A00001:001:HN2F7DRX1:1:1101:1452:1000 1:N:0:AATGACGC:CAAGAG
But if Illuminas bcl-convert is used with OverrideCycles to handle UMIs, the read ID looks like this
@A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG 1:N:0:AATGACGC
The UMI is highlighted in bold.
This means the sample index is used as a UMI, and no warning or error is emitted.

I propose running a pre-flight check to detect this scenario, and potentially to support the UMI location chosen by Illumina.

EDIT: I might have been completely off. I'll close it for now.

@FelixKrueger
Copy link
Owner

Thanks Lars, I'll await further developments. Hope all is well?

@lars-work-sund
Copy link
Author

lars-work-sund commented Sep 12, 2024

Hi Felix, I'm doing well, I hope you are too :)

I double checked and this is an issue. Tools like Illuminas bcl-convert and umi-tools places the UMI like this (umi-tools by default uses _ instead of : for separation):
@A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG 1:N:0:AATGACGC
Normally bowtie2 (and other aligners it seems) drops everything after the space, so the corresponding sam record ID would be:
A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG

The function fix_IDs

Bismark/bismark

Line 6207 in 37e2cad

sub fix_IDs{

replaces spaces with underscores so the sam record ID is
A00001:001:HN2F7DRX1:1:1101:1452:1000:CAAGAG_1:N:0:AATGACGC
This causes index to be treated as the UMI. No warning or error is given (either by deduplicate_bismark or umi-tools dedup) and the estimated number of duplicates is massively inflated.

This is extra problematic as this means the workflow

  1. using umi-tools to extract UMIs
  2. aligning with bismark
  3. deduplicating with umi-tools

will also cause this problem.

Here is an example:

UMI placement method % Unique reads
Manually placed at the end 77.33%
Added by bcl-convert 5.69%
Added by bcl-convert + magical --icpc flag 77.34%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants