-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforcing a standard for reporting REF and ALT allele depths #78
Comments
I have to deal with this DP issue in my review last year. It was a big headache. I had to treat the VCF generated by each SNP caller as a different format and parses it conditionally (see this code block). This means that script won't work with new SNP callers or when the existing SNP callers change their tags. As to the proposal, I also prefer to get counts on both strands. Nonetheless, for strand-specific counts, we'd better not reuse |
Base counts sound specific to SNPs, why not AD, ADF and ADR? |
I agree that "BC" is not a good name. I have no preference over ADF+ADR vs ADS. I'd just like to see SNP callers all use the same tag(s) to report read depth. |
I think you should keep the forward and reverse counts in separate tags so you can use |
Good point, @mcshane. So, you propose keep |
Agree |
Obviously from the GATK perspective, keeping AD the same is the easiest for us. We don't currently use the forward and reverse allele depths, but users might find that informative. |
So, the current proposal is to reserve either the following format keys:
or these
From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands. EDIT:
|
@pd3: I'm assuming you mean "Allele dosage on re_v_ strand" Also, could you explain how ADS would work since we would need 2R floats Would be be enough to have ##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage"> and infer the rev count from the difference? (if we want to be space On Tue, May 19, 2015 at 9:11 AM, pd3 [email protected] wrote:
|
I did not read the thread above properly, it seemed to suggest that @mcshane was proposing the ADS tag by referencing it from an unrelated pull request. I also missed the Number=2R concern. Since the specification has only Number=R fields, I agree it is practical to go with ADF and ADR tags so that allele trimming and sanity checking works. |
The proposal would be for something like
|
+1 |
Can we extend this proposal to include the total depths in INFO similarly to FORMAT?
|
+1 |
+1 to both INFO and FORMAT. |
Agreed. |
+1 for FORMAT. not sure for INFO. |
@atks samtools already uses two flavours of this annotation (INFO/DP4,DPR), I think it makes sense to standardize both the FORMAT and the INFO columns. I believe there is a consensus about this, so I am closing this thread. Please reopen if not. |
- meta-information lines must be key=value pairs (#67) - an ID attribute is required in structured header lines, unique within its type - the above point newly requires ID in the reserved PEDIGREE tag - new reserved AD, ADF, and ADR FORMAT and INFO fields added, resolves #78 - reorder list of INFO and FORMAT tags alphabetically - removed UNICODE-characters-not-supported sentence from BCF specification, in partial response to #65
I am not able to implement this easily due to the fact that So I am reticent. But there may be a way to resolve it. Wish I had seen this earlier. Freebayes uses QA, QR, AO, and RO for quality sums and allele depth. |
Well it looks like there is strong consensus about this :) So what about quality sums? AQ? |
* the unseen allele is to be specified as <*> rather then X or <X> as has been the case in mpileup for a while. See samtools/hts-specs@4a91745 Note that bcftools call supports X, <X> and <*>, see samtools/bcftools@802ff30 * add options to output AD,ADF,ADR (samtools/hts-specs#78) * deprecate DV,DP4,DPR annotations as largely superseded by the new AD tags
The current VCF specification does a very admirable job of defining the standards by which variant callers should report genotypes (GT), genotype likelihoods (GL), and overall observed depth (DP) for each sample at a given variant site.
However, in my opinion, the specification is lacking in that it fails to define a standard for reporting the sequencing depths observed for each sample and allele. This lack of specificity has allowed variant callers to each define their own convention. Consequently, downstream tools are left to write custom code that attempts to handle the rules defined by each variant caller.
For instance, GATK uses
AD
, which reports allele depths as a comma-separated list (173 for the reference and 141 for the alternate in this example:Freebayes, however uses two different tags,
RO
andAO
for reference and alternate depth, respectively.Other tools such as VARSCAN and Platypus employ yet other conventions.
Given that the inspection of allele depths are are crucial to quality control and data interpretation, I think it would be best to define a standard for this. In my opinion, the GATK convention makes most sense, as it works for single or multiple alternate alleles, so long as the reference allele is always reported first.
The standard could allow support for either reporting the total depths for each allele:
or optionally allow another delimiter (
;
?) defining the depth on each strand (positive first) for each allele:It would be great if we could make progress towards a standard here, as it would prevent hideous code such as: https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L264-L295
Best,
Aaron Quinlan
USTAR Center for Genetic Discovery
Department of Human Genetics and Biomedical Informatics
University of Utah
The text was updated successfully, but these errors were encountered: