You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.
I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).
What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?
Thank you very much in advance!
Érico.
The text was updated successfully, but these errors were encountered:
While discovering a bubble, discoSnp and discoSnpRad always extend both paths with the same nucleotide. If a bubble closes this way, other potential other paths are not explored.
Hence a bubble cummulating close heterozygous and homozygous SNPs (eg an individual ..A..C.. + ..T..G.. and an individual ..A..C.. + ..T..C.. will create this graph
In this case two distinct bubbles would be found: the one with he first SNP A/T and the one with the second SNP C/G.
In a more complex case, if all branching are possibles, this generates what we call "symmetrically branching bubble", as in this case:
Here the two SNPs are dumped as two distinct bubbles.
Hence as you can see none of these bubbles contains both heterozygous and homozygous SNPs. Thoses SNPs are found but separately. This can be seen as a limitation of the tool, but this is also a way to avoid combinatorial explosions of the number of distinct bubbles reported.
I've just noticed something very odd while observing the resulting VCF file (the first one, containing both linked and unlinked SNPs). For a given locus containing multiple SNPs (i.e. more than one polymorphism in the same "bubble"), each and every individual, diploid sample will be either heterozygous or homozygous for ALL SNPs, what isn't realistic at all.
I have a ddRAD dataset containing over a million loci (of which about 60% have the number of polymorphisms >= 2), 25 samples, and not a single sample has a combination of both heterozygous and homozygous sites in the same locus. I observed this using both discoSnp and discoSnpRad (the parameters I used in this particular case were k_31_c_3_D_0_P_5_m_5). Looking in the .fa file (the first generated output) I can see why the VCF couldn't get any further information about exact bases on each SNP, since for each locus we have only one genotype (either 0/0, 0/1, 1/1 or ./.) for each sample , forcing any converting tools to assume the same genotype combination for every SNP on that locus. I'm using the last available version (2.6.2).
What's happening? Maybe that information is being lost between file conversions? Or maybe it's irretrievable somehow? Or am I missing something, like a parameter in the pipeline?
Thank you very much in advance!
Érico.
The text was updated successfully, but these errors were encountered: