VCFv4.3 - first batch of changes #88

pd3 · 2015-06-02T15:17:05Z

Please see also the list of open issues here:
vcf

…r byte

- meta-information lines must be key=value pairs (#67) - an ID attribute is required in structured header lines, unique within its type - the above point newly requires ID in the reserved PEDIGREE tag - new reserved AD, ADF, and ADR FORMAT and INFO fields added, resolves #78 - reorder list of INFO and FORMAT tags alphabetically - removed UNICODE-characters-not-supported sentence from BCF specification, in partial response to #65

droazen · 2015-06-02T16:30:35Z

Please note that this pull request was opened at my request to provide a centralized location for discussion of the existing and proposed changes in VCF 4.3. It should NOT be merged yet, as there are still a few open issues under consideration.

droazen · 2015-06-02T16:33:50Z

Here is a link to a PDF version of the current VCF 4.3 draft (built from this branch), to make it easier to see what's changed (changes are highlighted in red):

https://gist.github.com/droazen/71d394379607674764f8/raw/d9e43db33fa2dbbf4f0db060aa57bf379526dc1a/VCFv4.3.draft.pdf

pgrosu · 2015-06-02T16:36:04Z

Thanks @droazen - much easier to read :)

droazen · 2015-06-02T16:37:08Z

Comments on changes in the existing draft should go directly in this thread. Comments on open issues ( vcf ) not yet incorporated into the draft should go on the page for each respective issue.

vruano · 2015-06-02T17:55:29Z

Why GP is from 0 to 1? Does not seem practical when you are dealing with very small probs. and we already have a mixture of log10 and Phred scale annotations.

vruano · 2015-06-02T18:07:09Z

In 2.1 VCF Tag naming conventions...

I think is a bit too absolutist to say that The "X" suffix means "Blah". I think that the standard should be flexible and allow users to, for example, have some annotations that finish in "L", that are not likelihoods.

I would change the three points to: "If you mean 'Blah' the annotation name should (or must?) finish with 'X'.

So from "The 'L' suffix means 'Likelihoods'" -> "The name of likelihood containing annotations should finish with the 'L' suffix"

vruano · 2015-06-02T18:08:45Z

As with GP, why CNP is in 0 to 1 scale?

bhandsaker · 2015-06-02T18:11:37Z

Section 5.4.11 needs to be updated so that the #PEDIGREE lines have IDs as required by the changes to section 1.2.10.

bhandsaker · 2015-06-02T18:21:13Z

Section 1.4.1 (subsection 5 - ALT) seems inconsistent with new section 5.5 as to whether the "unspecified" allele is represented as "" or as "<>".

I would also suggest adding to section 1.4.1, subsection 5, that "The '*' allele is reserved to indicate that the allele is missing due to an overlapping deletion or some other variant incompatible with the listed alleles."

The overlapping deletion does not necessarily need to be upstream, although this is the easiest case to visualize. Consider two offset deletions of several Kb each. Both the first and second deletion records may use '' to represent a missing allele, since the situation is symmetrical, yet for the first record the conflicting allele is downstream (i.e. appears later in the file). It is also possible that the conflicting variants are not just deletions. For example, two overlapping inversions might also use the '' pseudo-allele to indicate that an allele is unspecified, missing, or described in a different VCF record.

pd3 · 2015-06-03T19:07:30Z

@vruano @atks I updated the draft to d6b46df, it now includes the points discussed in #83.

@droazen There are quite a few parallel topics already in this thread. Wouldn't it be easier to navigate if each was open as a separate issues?

pd3 · 2015-06-03T19:13:45Z

@bhandsaker The * was added already in 4.2 and its purpose seems to be different from <*>. Is * being used at all?

@bhandsaker Good point about the ID and PEDIGREE. It would be good to hear from someone who is actually using it.

reedacartwright · 2015-08-30T06:10:45Z

I'm using the PEDIGREE tags and find that IDs are important as they allow me to specify where in a pedigree an mutation occurred:

DNT=AAxAA>AC;DNL=LB-M3;DNQ=40

These INFO tags describe that the predict de novo type is AAxAA>AC, which occurred in one of the branches leading to LB-M3

The only issue with the PEDIGREE tags is that I cannot store additional information. According to the specs, everything other than ID describes an ancestor. In order to support additional pedigree information it would be useful to have another tag, e.g:

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<NID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>

Where NID is used in PEDIGREEDATA so there can be multiple lines that contain data for the same node in the pedigree.

pd3 · 2015-09-01T07:22:13Z

Could the existing SAMPLE field be used for this perhaps?

reedacartwright · 2015-09-03T19:55:09Z

I'm not sure if SAMPLE is the proper field to use. My pedigree involves a mix of both germline meiotic events as well as somatic mitotic events. Not every node in my pedigree represents a sample. Some represents libraries (since we support technical replicates of the same sample). Others represent inferred ancestral states that occurred during family history or somatic development.

For me conceptually, the SAMPLE field provides information about the biological samples that were sequenced. Whereas, the data I need to store is about the analysis that was done. E.g. we had 2 libraries for this sample and we assumed that the error rate due to sequencing was 1e-8, etc.

I've already extended the "sample" columns in the vcf to output information about my nodes, including inferred ancestral states and library-specific depths. I have no issues reusing the SAMPLE field as well, but would like to propose some changes to the wording of the VCF format to support a broader concept of what a sample is. If that seems reasonable, I will work on a proposal.

pd3 · 2015-09-04T08:08:31Z

I see what you mean now, you are right that SAMPLE is not right for the job here. However, the specification does not explicitly say that everything other than ID refers to the ancestor, it only lists some examples. How about using the META field also for PEDIGREE and have something like this:

##META=<ID=MR,Type=Float,Number=.,Description="Mutation Rates">
##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2,MR=1e-8,1e-9>

The PEDIGREEDATA field is a valid solution as well, only the NID attribute would have to become ID.

reedacartwright · 2015-09-04T18:33:30Z

I think the ###META field makes the most sense, since it allows me to keep the meta data with the pedigree information. Is its usage defined in the spec?

However, I would be concerned about the ordering of rates in your example in case some conversion decided to swap the order of Father and Mother without changing the MR pattern. Is there a guarantee in the spec that that the attributes can be retrieved in the order written and that order will be maintained if converted between formats?

##PEDIGREEDATA might be a safer option if I can't rely on the order of attributes mattering in a VCF. The reason why I decided to use NID instead of ID because ID's are supposed to be unique withing a field type. For example, the following would not be allowed:

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<ID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>
##PEDIGREEDATA=<ID=LB-N3,Type=float,Description="Indel Rates",Father=1e-9,Mother=1e-9>

pd3 · 2015-09-09T07:46:36Z

You are making a good point about the order of attributes, the specification is silent about this. How about splitting the field into FatherMR and MotherMR? I am afraid it was agreed that ID is mandatory, so the original proposal with NID cannot be accepted as is.

The implicit filter PASS was described inconsistently throughout BCFv2.1. It is encoded as the first entry in the dictionary, not the last.

reedacartwright · 2015-09-09T07:58:19Z

Yes, we could use separate ##META lines for mother and father mutation rate.

We could also use ID to uniquely define a ##PEDIGREEDATA line and use PedigreeID to connect the ##PEDIGREEDATA line to a line in the pedigree. It is a little verbose, but perhaps less verbose than defining separate tags for each mother, father, and original value.

##PEDIGREE=<ID=LB-N3,Father=GL-N1,Mother=GL-N2>
##PEDIGREEDATA=<ID=MR-LB-N3,PedigreeID=LB-N3,Type=float,Description="Mutation Rates",Father=1e-8,Mother=1e-9>
##PEDIGREEDATA=<ID=IR-LB-N3,PedigreeID=LB-N3,Type=float,Description="Indel Rates",Father=1e-9,Mother=1e-9>

pd3 · 2015-09-09T08:11:08Z

Yeah, both are possible. The degree of verbosity depends on the number of samples in the study. If there are many samples, the FatherMR and MotherMR are defined only once with the META solution. Also no new tag is required

##META=<ID=FatherMR,Type=Float,Number=1,Description="Paternal Mutation Rate">
##META=<ID=MotherMR,Type=Float,Number=1,Description="Maternal Mutation Rate">
##PEDIGREE=<ID=LB-N1,Father=GL-N1,Mother=GL-N2,FatherMR=1e-8,MotherMR=1e-9>    
##PEDIGREE=<ID=LB-N2,Father=GL-N3,Mother=GL-N4,FatherMR=1e-8,MotherMR=1e-8>
##PEDIGREE=<ID=LB-N3,Father=GL-N5,Mother=GL-N6,FatherMR=1e-9,MotherMR=1e-8>

reedacartwright · 2015-09-11T00:45:43Z

I will give it a go using your META solution.

pd3 · 2015-09-24T07:30:48Z

OK, it seems we are ready to merge the changes. If there are no further objections, I'll do it tomorrow.

tk2 · 2015-09-28T13:21:39Z

@droazen as the GATK rep - are you in agreement with merging?

jmarshall · 2015-10-02T14:06:36Z

In §1.3 (Data types),

Data types supported by VCF are: Integer (32-bit, signed) […] For the Integer type, the values from −2³¹ to −2³¹ + 7 cannot be stored in the binary version, see 6.3.3.

It's not too good to have things that are representable in one format but not the other. So we should say that -2³¹ to −2³¹ + 7 are disallowed in both VCF and BCF.

In §1.6.1 (Fixed fields, §1.4.1 in previous spec versions),

There are 8 fixed fields per record. All data lines are tab-delimited with no tab character at the end of the line. The last data line should end with a line separator. In all cases, missing values are specified with a dot (‘.’).

The first sentence is about fixed fields. The next two sentences (re tabs and lines) are about the whole line, so should be lifted to §1.6 (Data lines).

The final sentence (re missing value dots (‘.’)) is hopefully intended to apply to all fields. However at present being in this section it only applies to the 8 fixed fields. It should also be lifted to §1.6, or a similar sentence needs to be added to §1.6.2 (Genotype fields).

jmarshall · 2015-10-05T13:18:59Z

Re missing value dots, see the conversation that motivated this: samtools/htsjdk#340.

VCFv4.3 - first batch of changes

Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR samtools#88 removed the `<>` from one instance of `##pedigreeDB=<url>` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance.

PR samtools#88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too.

…nes (#620) * Clarify structured vs unstructured meta-information lines Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR #88 removed the `<>` from one instance of `##pedigreeDB=<url>` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance. * Remove `<>` from VCF test file `##pedigreeDB` lines The specification now consistently reflects that `##pedigreeDB`'s value should not be delimited by angle brackets (despite being an URL!). Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause of failure is the invalid URL hostname rather than the angle brackets. * Fix misleading `##pedigreeDB=<url>` notation in older VCF specifications PR #88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too. * Mention structured lines that are not defined by the VCF specification

pd3 and others added 18 commits July 28, 2014 16:03

Added VCFv4.3 draft, changes marked in red color

af41ed9

Adding subsection about gVCF files.

2d3f041

Updated gVCF section, added list of changes

4ca3424

Added a gVCF section plus minor edits

efed15f

Forbid duplicate fields, UTF8 the only encoding.

a09e56f

Removed forgotten copy and paste error, resolves #41

af36fb1

Introduced symbolic allele "<*>" in gVCF

4a91745

Add ##META definitions

7789a1b

Include META header lines in the list of changes

250a88c

Clarification: no negative values in GT array except the end-of-vecto…

1080e8a

…r byte

Clarified GL ordering. Resolves #58

66d45d5

Merge branch 'master' into VCFv4.3

ce805dd

Fix #82

aabe553

Fix #82

61d0f01

Fix SVLEN example, resolves #84

672b512

Fix SVLEN examplein VCFv4.3, resolves #84

73482c0

Update minor version number, resolves #63

8f6b7cf

GL ordering made clearer as discussed in #83

d6b46df

Fix inconsistent description of PASS in BCF.

187745f

The implicit filter PASS was described inconsistently throughout BCFv2.1. It is encoded as the first entry in the dictionary, not the last.

pd3 added a commit that referenced this pull request Sep 9, 2015

Clarifications made as requested by #88 (comment)

cb4605d

Removed red coloring from the text, VCFv4.3 is ready to be merged

d3a28e2

pd3 added 3 commits October 10, 2015 13:49

Modified example, coordinates must be ordered

861e235

Move sentences which apply to all data lines from 1.6.1 to 1.6

ad14194

Stronger wording about disallowed INT32 values

fe0222c

pd3 added a commit that referenced this pull request Oct 10, 2015

Merge pull request #88 from samtools/VCFv4.3

bd89f29

VCFv4.3 - first batch of changes

pd3 merged commit bd89f29 into master Oct 10, 2015

yfarjoun deleted the VCFv4.3 branch July 22, 2018 01:29

This was referenced Nov 12, 2021

VCF v4.3 ##META header line syntax #558

Open

Other misc. VCF test data queries. #608

Open

jmarshall mentioned this pull request Jan 5, 2022

Clarify description of structured vs unstructured meta-information lines #620

Merged

jmarshall mentioned this pull request May 9, 2022

VCF Header: must Number be before Type? #642

Closed

jmarshall mentioned this pull request Mar 27, 2023

vcf: Contig name rules contradiction #711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCFv4.3 - first batch of changes #88

VCFv4.3 - first batch of changes #88

pd3 commented Jun 2, 2015

droazen commented Jun 2, 2015

droazen commented Jun 2, 2015

pgrosu commented Jun 2, 2015

droazen commented Jun 2, 2015

vruano commented Jun 2, 2015

vruano commented Jun 2, 2015

vruano commented Jun 2, 2015

bhandsaker commented Jun 2, 2015

bhandsaker commented Jun 2, 2015

pd3 commented Jun 3, 2015

pd3 commented Jun 3, 2015

reedacartwright commented Aug 30, 2015

pd3 commented Sep 1, 2015

reedacartwright commented Sep 3, 2015

pd3 commented Sep 4, 2015

reedacartwright commented Sep 4, 2015

pd3 commented Sep 9, 2015

reedacartwright commented Sep 9, 2015

pd3 commented Sep 9, 2015

reedacartwright commented Sep 11, 2015

pd3 commented Sep 24, 2015

tk2 commented Sep 28, 2015

jmarshall commented Oct 2, 2015

jmarshall commented Oct 5, 2015

VCFv4.3 - first batch of changes #88

VCFv4.3 - first batch of changes #88

Conversation

pd3 commented Jun 2, 2015

droazen commented Jun 2, 2015

droazen commented Jun 2, 2015

pgrosu commented Jun 2, 2015

droazen commented Jun 2, 2015

vruano commented Jun 2, 2015

vruano commented Jun 2, 2015

vruano commented Jun 2, 2015

bhandsaker commented Jun 2, 2015

bhandsaker commented Jun 2, 2015

pd3 commented Jun 3, 2015

pd3 commented Jun 3, 2015

reedacartwright commented Aug 30, 2015

pd3 commented Sep 1, 2015

reedacartwright commented Sep 3, 2015

pd3 commented Sep 4, 2015

reedacartwright commented Sep 4, 2015

pd3 commented Sep 9, 2015

reedacartwright commented Sep 9, 2015

pd3 commented Sep 9, 2015

reedacartwright commented Sep 11, 2015

pd3 commented Sep 24, 2015

tk2 commented Sep 28, 2015

jmarshall commented Oct 2, 2015

jmarshall commented Oct 5, 2015