Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems genotyping loci containing multiple SNPs #43

Open
ericopolo opened this issue Jul 19, 2024 · 1 comment
Open

Problems genotyping loci containing multiple SNPs #43

ericopolo opened this issue Jul 19, 2024 · 1 comment

Comments

@ericopolo
Copy link

ericopolo commented Jul 19, 2024

I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.

I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).

What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?

Thank you very much in advance!

Érico.

@pierrepeterlongo
Copy link
Collaborator

Dear Érico.

Thanks for your message.

While discovering a bubble, discoSnp and discoSnpRad always extend both paths with the same nucleotide. If a bubble closes this way, other potential other paths are not explored.
Hence a bubble cummulating close heterozygous and homozygous SNPs (eg an individual ..A..C.. + ..T..G.. and an individual ..A..C.. + ..T..C.. will create this graph
image
In this case two distinct bubbles would be found: the one with he first SNP A/T and the one with the second SNP C/G.

In a more complex case, if all branching are possibles, this generates what we call "symmetrically branching bubble", as in this case:
image
Here the two SNPs are dumped as two distinct bubbles.

Hence as you can see none of these bubbles contains both heterozygous and homozygous SNPs. Thoses SNPs are found but separately. This can be seen as a limitation of the tool, but this is also a way to avoid combinatorial explosions of the number of distinct bubbles reported.

I hope this help.
Pierre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants