Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforcing a standard for reporting REF and ALT allele depths #78

Closed
arq5x opened this issue Mar 27, 2015 · 21 comments
Closed

Enforcing a standard for reporting REF and ALT allele depths #78

arq5x opened this issue Mar 27, 2015 · 21 comments

Comments

@arq5x
Copy link

arq5x commented Mar 27, 2015

The current VCF specification does a very admirable job of defining the standards by which variant callers should report genotypes (GT), genotype likelihoods (GL), and overall observed depth (DP) for each sample at a given variant site.

However, in my opinion, the specification is lacking in that it fails to define a standard for reporting the sequencing depths observed for each sample and allele. This lack of specificity has allowed variant callers to each define their own convention. Consequently, downstream tools are left to write custom code that attempts to handle the rules defined by each variant caller.

For instance, GATK uses AD, which reports allele depths as a comma-separated list (173 for the reference and 141 for the alternate in this example:

chr1    873762  .       T   G   100 .    AC=1;AF=0.50;DP=315 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255

Freebayes, however uses two different tags, RO and AO for reference and alternate depth, respectively.

GT:GQ:DP:RO:QR:AO:QA:GL 0/0:0:5:5:132:0:0:0,-0.32169,-4.34301

Other tools such as VARSCAN and Platypus employ yet other conventions.

Given that the inspection of allele depths are are crucial to quality control and data interpretation, I think it would be best to define a standard for this. In my opinion, the GATK convention makes most sense, as it works for single or multiple alternate alleles, so long as the reference allele is always reported first.

The standard could allow support for either reporting the total depths for each allele:

GT:AD  0/1:173,141

or optionally allow another delimiter (; ?) defining the depth on each strand (positive first) for each allele:

GT:AD  0/1:100;73,70;71

It would be great if we could make progress towards a standard here, as it would prevent hideous code such as: https://github.com/arq5x/cyvcf/blob/master/cyvcf/parser.pyx#L264-L295

Best,
Aaron Quinlan
USTAR Center for Genetic Discovery
Department of Human Genetics and Biomedical Informatics
University of Utah

@lh3
Copy link
Member

lh3 commented Mar 27, 2015

I have to deal with this DP issue in my review last year. It was a big headache. I had to treat the VCF generated by each SNP caller as a different format and parses it conditionally (see this code block). This means that script won't work with new SNP callers or when the existing SNP callers change their tags.

As to the proposal, I also prefer to get counts on both strands. Nonetheless, for strand-specific counts, we'd better not reuse AD. Adding ";" will also turn AD from an array to a string, making it less efficient. How about adding BC (base counts), which gives an array of forward and reverse counts, in the order of #refAlleleFor, #refAlleleRev, #allele1For, #allele1Rev, ...? Alternatively, we can separate BC to BCF and BCR for forward/reverse counts only?

@atks
Copy link

atks commented Mar 27, 2015

Base counts sound specific to SNPs, why not AD, ADF and ADR?

@arq5x
Copy link
Author

arq5x commented Mar 27, 2015

@lh3 - yep, that sounds good, but I do agree with @atks that base counts is too specific for general alleles. How about ADS "allele depths, stranded" or SAD "stranded allele depths"?

@lh3
Copy link
Member

lh3 commented Mar 27, 2015

I agree that "BC" is not a good name. I have no preference over ADF+ADR vs ADS. I'd just like to see SNP callers all use the same tag(s) to report read depth.

@mcshane
Copy link

mcshane commented Mar 27, 2015

I think you should keep the forward and reverse counts in separate tags so you can use Number=A or Number=R in the header rather than having things of length 2A or 2R. As another datapoint, samtools uses DPR for depth per-allele including reference (R for Number=R).

@arq5x
Copy link
Author

arq5x commented Mar 27, 2015

Good point, @mcshane. So, you propose keep AD as is (i.e., Number=R, comma-separated list of total depth for each allele). Plus, establiish new Number=R tags for depths on the forward (ADF) and reverse (ADR) strands? I like that.

@mcshane
Copy link

mcshane commented Mar 27, 2015

Agree AD/ADF/ADR reads better than DPR. These fields tend to blow out the size of VCF/BCFs quite significantly for large numbers of samples.

@ldgauthier
Copy link

Obviously from the GATK perspective, keeping AD the same is the easiest for us. We don't currently use the forward and reverse allele depths, but users might find that informative.

@pd3
Copy link
Member

pd3 commented May 19, 2015

So, the current proposal is to reserve either the following format keys:

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">

or these

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on rev strand">

From the two variants I prefer the first because it is more space-efficient and if a stranded version is needed, one always wants to see the depth on both strands.

EDIT:

  • "rev", not "ref" strand.
  • ADS was not really proposed, my mistake, see also below

@yfarjoun
Copy link
Contributor

@pd3: I'm assuming you mean "Allele dosage on re_v_ strand"

Also, could you explain how ADS would work since we would need 2R floats
for it.

Would be be enough to have

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd
strand">

and infer the rev count from the difference? (if we want to be space
efficient)

On Tue, May 19, 2015 at 9:11 AM, pd3 [email protected] wrote:

So, the current proposal is to reserve either the following format keys:

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADS,Number=R,Type=Float,Description="Allele dosage on fwd and rev strand">

or these

##FORMAT=<ID=AD,Number=R,Type=Float,Description="Allele dosage">
##FORMAT=<ID=ADF,Number=R,Type=Float,Description="Allele dosage on fwd strand">
##FORMAT=<ID=ADR,Number=R,Type=Float,Description="Allele dosage on ref strand">

From the two variants I prefer the first because it is more
space-efficient and if a stranded version is needed, one always wants to
see the depth on both strands.


Reply to this email directly or view it on GitHub
#78 (comment).

@pd3
Copy link
Member

pd3 commented May 19, 2015

I did not read the thread above properly, it seemed to suggest that @mcshane was proposing the ADS tag by referencing it from an unrelated pull request. I also missed the Number=2R concern. Since the specification has only Number=R fields, I agree it is practical to go with ADF and ADR tags so that allele trimming and sanity checking works.

@mcshane
Copy link

mcshane commented May 19, 2015

The proposal would be for something like

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
##FORMAT=<ID=ADF,Number=R,Type=Integer,Description="Allelic depths on the forward strand">
##FORMAT=<ID=ADR,Number=R,Type=Integer,Description="Allelic depths on the reverse strand">

@ldgauthier
Copy link

+1

@pd3
Copy link
Member

pd3 commented May 19, 2015

Can we extend this proposal to include the total depths in INFO similarly to FORMAT?

##INFO=<ID=AD,Number=R,Type=Integer,Description="Total allelic depths">
##INFO=<ID=ADF,Number=R,Type=Integer,Description="Total allelic depths on the forward strand">
##INFO=<ID=ADR,Number=R,Type=Integer,Description="Total allelic depths on the reverse strand">

@eitanbanks
Copy link

+1

@lh3
Copy link
Member

lh3 commented May 20, 2015

+1 to both INFO and FORMAT.

@arq5x
Copy link
Author

arq5x commented May 20, 2015

+1 to both INFO and FORMAT.

Agreed.

@atks
Copy link

atks commented May 21, 2015

+1 for FORMAT. not sure for INFO.

@pd3
Copy link
Member

pd3 commented May 21, 2015

@atks samtools already uses two flavours of this annotation (INFO/DP4,DPR), I think it makes sense to standardize both the FORMAT and the INFO columns.

I believe there is a consensus about this, so I am closing this thread. Please reopen if not.

@pd3 pd3 closed this as completed May 21, 2015
pd3 added a commit that referenced this issue Jun 2, 2015
- meta-information lines must be key=value pairs (#67)
- an ID attribute is required in structured header lines, unique within its type
- the above point newly requires ID in the reserved PEDIGREE tag
- new reserved AD, ADF, and ADR FORMAT and INFO fields added, resolves #78
- reorder list of INFO and FORMAT tags alphabetically
- removed UNICODE-characters-not-supported sentence from BCF specification, in partial response to #65
@ekg
Copy link

ekg commented Jul 17, 2015

I am not able to implement this easily due to the fact that Number=R is not something that can be dumped into tabular format. It also breaks filtering.

So I am reticent. But there may be a way to resolve it.

Wish I had seen this earlier. Freebayes uses QA, QR, AO, and RO for quality sums and allele depth.

@ekg
Copy link

ekg commented Jul 17, 2015

Well it looks like there is strong consensus about this :)

So what about quality sums? AQ?

mcshane added a commit to mcshane/samtools that referenced this issue Aug 5, 2015
* the unseen allele is to be specified as <*> rather then X or <X>
  as has been the case in mpileup for a while.
  See samtools/hts-specs@4a91745
  Note that bcftools call supports X, <X> and <*>,
  see samtools/bcftools@802ff30

* add options to output AD,ADF,ADR (samtools/hts-specs#78)

* deprecate DV,DP4,DPR annotations as largely superseded by
  the new AD tags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants