Enforcing a standard for reporting REF and ALT allele depths #78

arq5x · 2015-03-27T14:23:08Z

The current VCF specification does a very admirable job of defining the standards by which variant callers should report genotypes (GT), genotype likelihoods (GL), and overall observed depth (DP) for each sample at a given variant site.

However, in my opinion, the specification is lacking in that it fails to define a standard for reporting the sequencing depths observed for each sample and allele. This lack of specificity has allowed variant callers to each define their own convention. Consequently, downstream tools are left to write custom code that attempts to handle the rules defined by each variant caller.

For instance, GATK uses AD, which reports allele depths as a comma-separated list (173 for the reference and 141 for the alternate in this example:

chr1    873762  .       T   G   100 .    AC=1;AF=0.50;DP=315 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255

Freebayes, however uses two different tags, RO and AO for reference and alternate depth, respectively.

GT:GQ:DP:RO:QR:AO:QA:GL 0/0:0:5:5:132:0:0:0,-0.32169,-4.34301

Other tools such as VARSCAN and Platypus employ yet other conventions.

Given that the inspection of allele depths are are crucial to quality control and data interpretation, I think it would be best to define a standard for this. In my opinion, the GATK convention makes most sense, as it works for single or multiple alternate alleles, so long as the reference allele is always reported first.

The standard could allow support for either reporting the total depths for each allele:

GT:AD  0/1:173,141

or optionally allow another delimiter (; ?) defining the depth on each strand (positive first) for each allele:

GT:AD  0/1:100;73,70;71

It would be great if we could make progress towards a standard here, as it would prevent hideous code such as: https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L264-L295

Best,
Aaron Quinlan
USTAR Center for Genetic Discovery
Department of Human Genetics and Biomedical Informatics
University of Utah

The text was updated successfully, but these errors were encountered:

lh3 · 2015-03-27T17:07:28Z

I have to deal with this DP issue in my review last year. It was a big headache. I had to treat the VCF generated by each SNP caller as a different format and parses it conditionally (see this code block). This means that script won't work with new SNP callers or when the existing SNP callers change their tags.

As to the proposal, I also prefer to get counts on both strands. Nonetheless, for strand-specific counts, we'd better not reuse AD. Adding ";" will also turn AD from an array to a string, making it less efficient. How about adding BC (base counts), which gives an array of forward and reverse counts, in the order of #refAlleleFor, #refAlleleRev, #allele1For, #allele1Rev, ...? Alternatively, we can separate BC to BCF and BCR for forward/reverse counts only?

atks · 2015-03-27T17:31:37Z

Base counts sound specific to SNPs, why not AD, ADF and ADR?

arq5x · 2015-03-27T17:47:30Z

@lh3 - yep, that sounds good, but I do agree with @atks that base counts is too specific for general alleles. How about ADS "allele depths, stranded" or SAD "stranded allele depths"?

lh3 · 2015-03-27T17:53:12Z

I agree that "BC" is not a good name. I have no preference over ADF+ADR vs ADS. I'd just like to see SNP callers all use the same tag(s) to report read depth.

mcshane · 2015-03-27T17:55:55Z

I think you should keep the forward and reverse counts in separate tags so you can use Number=A or Number=R in the header rather than having things of length 2A or 2R. As another datapoint, samtools uses DPR for depth per-allele including reference (R for Number=R).

arq5x · 2015-03-27T17:59:51Z

Good point, @mcshane. So, you propose keep AD as is (i.e., Number=R, comma-separated list of total depth for each allele). Plus, establiish new Number=R tags for depths on the forward (ADF) and reverse (ADR) strands? I like that.

mcshane · 2015-03-27T19:00:36Z

Agree AD/ADF/ADR reads better than DPR. These fields tend to blow out the size of VCF/BCFs quite significantly for large numbers of samples.

ldgauthier · 2015-05-19T12:52:46Z

Obviously from the GATK perspective, keeping AD the same is the easiest for us. We don't currently use the forward and reverse allele depths, but users might find that informative.

pd3 · 2015-05-19T13:11:14Z

So, the current proposal is to reserve either the following format keys:

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">

or these

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on rev strand">

From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.

EDIT:

"rev", not "ref" strand.
ADS was not really proposed, my mistake, see also below

yfarjoun · 2015-05-19T13:17:34Z

@pd3: I'm assuming you mean "Allele dosage on re_v_ strand"

Also, could you explain how ADS would work since we would need 2R floats
for it.

Would be be enough to have

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd
strand">

and infer the rev count from the difference? (if we want to be space
efficient)

On Tue, May 19, 2015 at 9:11 AM, pd3 [email protected] wrote:

So, the current proposal is to reserve either the following format keys:

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">

or these

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on ref strand">

From the two variants I prefer the first because it is more
space-efficient and if a stranded version is needed, one always wants to
see the depth on both strands.

—
Reply to this email directly or view it on GitHub
#78 (comment).

pd3 · 2015-05-19T13:23:22Z

I did not read the thread above properly, it seemed to suggest that @mcshane was proposing the ADS tag by referencing it from an unrelated pull request. I also missed the Number=2R concern. Since the specification has only Number=R fields, I agree it is practical to go with ADF and ADR tags so that allele trimming and sanity checking works.

mcshane · 2015-05-19T13:33:51Z

The proposal would be for something like

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">

ldgauthier · 2015-05-19T13:36:06Z

+1

pd3 · 2015-05-19T14:58:00Z

Can we extend this proposal to include the total depths in INFO similarly to FORMAT?

##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand">

eitanbanks · 2015-05-20T20:55:47Z

+1

lh3 · 2015-05-20T21:00:26Z

+1 to both INFO and FORMAT.

arq5x · 2015-05-20T21:03:59Z

+1 to both INFO and FORMAT.

Agreed.

atks · 2015-05-21T00:26:41Z

+1 for FORMAT. not sure for INFO.

pd3 · 2015-05-21T08:45:35Z

@atks samtools already uses two flavours of this annotation (INFO/DP4,DPR), I think it makes sense to standardize both the FORMAT and the INFO columns.

I believe there is a consensus about this, so I am closing this thread. Please reopen if not.

- meta-information lines must be key=value pairs (#67) - an ID attribute is required in structured header lines, unique within its type - the above point newly requires ID in the reserved PEDIGREE tag - new reserved AD, ADF, and ADR FORMAT and INFO fields added, resolves #78 - reorder list of INFO and FORMAT tags alphabetically - removed UNICODE-characters-not-supported sentence from BCF specification, in partial response to #65

ekg · 2015-07-17T14:34:23Z

I am not able to implement this easily due to the fact that Number=R is not something that can be dumped into tabular format. It also breaks filtering.

So I am reticent. But there may be a way to resolve it.

Wish I had seen this earlier. Freebayes uses QA, QR, AO, and RO for quality sums and allele depth.

ekg · 2015-07-17T14:40:47Z

Well it looks like there is strong consensus about this :)

So what about quality sums? AQ?

* the unseen allele is to be specified as <*> rather then X or <X> as has been the case in mpileup for a while. See samtools/hts-specs@4a91745 Note that bcftools call supports X, <X> and <*>, see samtools/bcftools@802ff30 * add options to output AD,ADF,ADR (samtools/hts-specs#78) * deprecate DV,DP4,DPR annotations as largely superseded by the new AD tags

mcshane mentioned this issue May 12, 2015

change allele dosage annotation from AD to ADS to avoid collision wit… richarddurbin/pbwt#31

Merged

pd3 closed this as completed May 21, 2015

pd3 mentioned this issue Aug 24, 2015

Standardise the description of Reference and Alternative alleles in sample details #38

Closed

mcshane mentioned this issue Feb 17, 2016

Writing out FORMAT/DPR tag in addition to FORMAT/DP and FORMAT/AO freebayes/freebayes#247

Merged

heuermh mentioned this issue Dec 10, 2016

Add variant and variant annotation conversion functions fnothaft/adam#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforcing a standard for reporting REF and ALT allele depths #78

Enforcing a standard for reporting REF and ALT allele depths #78

arq5x commented Mar 27, 2015

lh3 commented Mar 27, 2015

atks commented Mar 27, 2015

arq5x commented Mar 27, 2015

lh3 commented Mar 27, 2015

mcshane commented Mar 27, 2015

arq5x commented Mar 27, 2015

mcshane commented Mar 27, 2015

ldgauthier commented May 19, 2015

pd3 commented May 19, 2015

yfarjoun commented May 19, 2015

pd3 commented May 19, 2015

mcshane commented May 19, 2015

ldgauthier commented May 19, 2015

pd3 commented May 19, 2015

eitanbanks commented May 20, 2015

lh3 commented May 20, 2015

arq5x commented May 20, 2015

atks commented May 21, 2015

pd3 commented May 21, 2015

ekg commented Jul 17, 2015

ekg commented Jul 17, 2015

Enforcing a standard for reporting REF and ALT allele depths #78

Enforcing a standard for reporting REF and ALT allele depths #78

Comments

arq5x commented Mar 27, 2015

lh3 commented Mar 27, 2015

atks commented Mar 27, 2015

arq5x commented Mar 27, 2015

lh3 commented Mar 27, 2015

mcshane commented Mar 27, 2015

arq5x commented Mar 27, 2015

mcshane commented Mar 27, 2015

ldgauthier commented May 19, 2015

pd3 commented May 19, 2015

yfarjoun commented May 19, 2015

pd3 commented May 19, 2015

mcshane commented May 19, 2015

ldgauthier commented May 19, 2015

pd3 commented May 19, 2015

eitanbanks commented May 20, 2015

lh3 commented May 20, 2015

arq5x commented May 20, 2015

atks commented May 21, 2015

pd3 commented May 21, 2015

ekg commented Jul 17, 2015

ekg commented Jul 17, 2015