Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[help] Can Bismark handle diploid reference genome please? #697

Open
lizhe-gis opened this issue Sep 6, 2024 · 2 comments
Open

[help] Can Bismark handle diploid reference genome please? #697

lizhe-gis opened this issue Sep 6, 2024 · 2 comments

Comments

@lizhe-gis
Copy link

Dear Felix,

Thank you so much for developing this great tool! I rely Bismark heavily for my research :)

Recently we developed a haplotype-resolved diploid human genome, where one copy is paternal and the other copy is maternal. I imagine if I map WGBS data to this reference genome, most reads will have secondary alignment due to the high similarity of pat-/mat-genome. I understand bowtie2 and HISAT2 are both able to randomly assign, but from reading the previous posts and the alignment flags, I understand that currently --ambig_bam will not give any methylation information.

Would it be possible to ask Bismark to randomly assign to one location and include methylation information if the two alignments have exactly the same and highest possible match please?

Thank you very much!

Best Regards,
Zhe

@FelixKrueger
Copy link
Owner

I am afraid there is currently no functionality to randomly assign reads to repetitive regions (which is in effect is what you have if both alleles contain the exact same sequence).

All I can think of currently is using a sequential approach where you first align the data to your haplotype-resolved diploid genome, while specifying --unmapped. Reads aligning specifically to one of the two alleles should align, while reads aligning to regions that are exactly shared will be rejected as ambiguous and end up as new 'unmapped' FastQ files.
In a second round, the unmapped reads could be aligned to only one of the copies, or maybe even the standard reference genome, and assume that there is an even split between the two alleles. This approach might suffer from a discrepancy between coordinate system, however...

@lizhe-gis
Copy link
Author

Dear Felix,

Thank you very much for your kind reply!

Indeed I am also worried about the coordinate system :P Currently I can only think of mapping all the reads separately to pat- and mat- reference genome for the homologous regions to solve the coordinate problem, but the results will be haplotype-average methylation status (which is akin to the effect of random assignment, just double the coverage), I need to be cautious in result interpretation for the imprinted regions (e.g. be alert for ~50% methylated regions, but this could also be caused by cell culture heterogeneity I suppose). Similarly I am worried about the centromere region, where the sequences are highly repetitive but known to have different methylation status.

Best Regards,
Zhe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants