Problems genotyping loci containing multiple SNPs #43

ericopolo · 2024-07-19T23:10:35Z

I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.

I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).

What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?

Thank you very much in advance!

Érico.

pierrepeterlongo · 2024-08-12T15:36:17Z

Dear Érico.

Thanks for your message.

While discovering a bubble, discoSnp and discoSnpRad always extend both paths with the same nucleotide. If a bubble closes this way, other potential other paths are not explored.
Hence a bubble cummulating close heterozygous and homozygous SNPs (eg an individual ..A..C.. + ..T..G.. and an individual ..A..C.. + ..T..C.. will create this graph

In this case two distinct bubbles would be found: the one with he first SNP A/T and the one with the second SNP C/G.

In a more complex case, if all branching are possibles, this generates what we call "symmetrically branching bubble", as in this case:

Here the two SNPs are dumped as two distinct bubbles.

Hence as you can see none of these bubbles contains both heterozygous and homozygous SNPs. Thoses SNPs are found but separately. This can be seen as a limitation of the tool, but this is also a way to avoid combinatorial explosions of the number of distinct bubbles reported.

I hope this help.
Pierre

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems genotyping loci containing multiple SNPs #43

Problems genotyping loci containing multiple SNPs #43

ericopolo commented Jul 19, 2024 •

edited

Loading

pierrepeterlongo commented Aug 12, 2024

Problems genotyping loci containing multiple SNPs #43

Problems genotyping loci containing multiple SNPs #43

Comments

ericopolo commented Jul 19, 2024 • edited Loading

pierrepeterlongo commented Aug 12, 2024

ericopolo commented Jul 19, 2024 •

edited

Loading