-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCFv4.3 - first batch of changes #88
Conversation
- meta-information lines must be key=value pairs (#67) - an ID attribute is required in structured header lines, unique within its type - the above point newly requires ID in the reserved PEDIGREE tag - new reserved AD, ADF, and ADR FORMAT and INFO fields added, resolves #78 - reorder list of INFO and FORMAT tags alphabetically - removed UNICODE-characters-not-supported sentence from BCF specification, in partial response to #65
Please note that this pull request was opened at my request to provide a centralized location for discussion of the existing and proposed changes in VCF 4.3. It should NOT be merged yet, as there are still a few open issues under consideration. |
Here is a link to a PDF version of the current VCF 4.3 draft (built from this branch), to make it easier to see what's changed (changes are highlighted in red): |
Thanks @droazen - much easier to read :) |
Comments on changes in the existing draft should go directly in this thread. Comments on open issues ( vcf ) not yet incorporated into the draft should go on the page for each respective issue. |
Why GP is from 0 to 1? Does not seem practical when you are dealing with very small probs. and we already have a mixture of log10 and Phred scale annotations. |
In 2.1 VCF Tag naming conventions... I think is a bit too absolutist to say that The "X" suffix means "Blah". I think that the standard should be flexible and allow users to, for example, have some annotations that finish in "L", that are not likelihoods. I would change the three points to: "If you mean 'Blah' the annotation name should (or must?) finish with 'X'. So from "The 'L' suffix means 'Likelihoods'" -> "The name of likelihood containing annotations should finish with the 'L' suffix" |
As with GP, why CNP is in 0 to 1 scale? |
Section 5.4.11 needs to be updated so that the #PEDIGREE lines have IDs as required by the changes to section 1.2.10. |
Section 1.4.1 (subsection 5 - ALT) seems inconsistent with new section 5.5 as to whether the "unspecified" allele is represented as "" or as "<>". I would also suggest adding to section 1.4.1, subsection 5, that "The '*' allele is reserved to indicate that the allele is missing due to an overlapping deletion or some other variant incompatible with the listed alleles." The overlapping deletion does not necessarily need to be upstream, although this is the easiest case to visualize. Consider two offset deletions of several Kb each. Both the first and second deletion records may use '' to represent a missing allele, since the situation is symmetrical, yet for the first record the conflicting allele is downstream (i.e. appears later in the file). It is also possible that the conflicting variants are not just deletions. For example, two overlapping inversions might also use the '' pseudo-allele to indicate that an allele is unspecified, missing, or described in a different VCF record. |
@bhandsaker The @bhandsaker Good point about the |
I'm using the PEDIGREE tags and find that IDs are important as they allow me to specify where in a pedigree an mutation occurred:
These INFO tags describe that the predict de novo type is The only issue with the PEDIGREE tags is that I cannot store additional information. According to the specs, everything other than ID describes an ancestor. In order to support additional pedigree information it would be useful to have another tag, e.g:
Where NID is used in PEDIGREEDATA so there can be multiple lines that contain data for the same node in the pedigree. |
Could the existing SAMPLE field be used for this perhaps? |
I'm not sure if SAMPLE is the proper field to use. My pedigree involves a mix of both germline meiotic events as well as somatic mitotic events. Not every node in my pedigree represents a sample. Some represents libraries (since we support technical replicates of the same sample). Others represent inferred ancestral states that occurred during family history or somatic development. For me conceptually, the SAMPLE field provides information about the biological samples that were sequenced. Whereas, the data I need to store is about the analysis that was done. E.g. we had 2 libraries for this sample and we assumed that the error rate due to sequencing was 1e-8, etc. I've already extended the "sample" columns in the vcf to output information about my nodes, including inferred ancestral states and library-specific depths. I have no issues reusing the SAMPLE field as well, but would like to propose some changes to the wording of the VCF format to support a broader concept of what a sample is. If that seems reasonable, I will work on a proposal. |
I see what you mean now, you are right that SAMPLE is not right for the job here. However, the specification does not explicitly say that everything other than ID refers to the ancestor, it only lists some examples. How about using the META field also for PEDIGREE and have something like this:
The PEDIGREEDATA field is a valid solution as well, only the NID attribute would have to become ID. |
I think the However, I would be concerned about the ordering of rates in your example in case some conversion decided to swap the order of Father and Mother without changing the MR pattern. Is there a guarantee in the spec that that the attributes can be retrieved in the order written and that order will be maintained if converted between formats?
|
You are making a good point about the order of attributes, the specification is silent about this. How about splitting the field into FatherMR and MotherMR? I am afraid it was agreed that ID is mandatory, so the original proposal with NID cannot be accepted as is. |
The implicit filter PASS was described inconsistently throughout BCFv2.1. It is encoded as the first entry in the dictionary, not the last.
Yes, we could use separate We could also use
|
Yeah, both are possible. The degree of verbosity depends on the number of samples in the study. If there are many samples, the FatherMR and MotherMR are defined only once with the META solution. Also no new tag is required
|
I will give it a go using your META solution. |
OK, it seems we are ready to merge the changes. If there are no further objections, I'll do it tomorrow. |
@droazen as the GATK rep - are you in agreement with merging? |
In §1.3 (Data types),
It's not too good to have things that are representable in one format but not the other. So we should say that -231 to −231 + 7 are disallowed in both VCF and BCF. In §1.6.1 (Fixed fields, §1.4.1 in previous spec versions),
The first sentence is about fixed fields. The next two sentences (re tabs and lines) are about the whole line, so should be lifted to §1.6 (Data lines). The final sentence (re missing value dots (‘.’)) is hopefully intended to apply to all fields. However at present being in this section it only applies to the 8 fixed fields. It should also be lifted to §1.6, or a similar sentence needs to be added to §1.6.2 (Genotype fields). |
Re missing value dots, see the conversation that motivated this: samtools/htsjdk#340. |
Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR samtools#88 removed the `<>` from one instance of `##pedigreeDB=<url>` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance.
PR samtools#88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too.
…nes (#620) * Clarify structured vs unstructured meta-information lines Introduce the term "unstructured meta-information line", and reword this section so it describes the two flavours of meta-information line clearly. Specify that an unstructured value must not start with `<` (so that structured/unstructured are easily distinguished) and must be non-empty. Remove `<>` from unstructured `##pedigreeDB` example. PR #88 removed the `<>` from one instance of `##pedigreeDB=<url>` presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally, but missed this instance. * Remove `<>` from VCF test file `##pedigreeDB` lines The specification now consistently reflects that `##pedigreeDB`'s value should not be delimited by angle brackets (despite being an URL!). Adjust the failed_meta_pedigreedb_002.vcf files as the claimed cause of failure is the invalid URL hostname rather than the angle brackets. * Fix misleading `##pedigreeDB=<url>` notation in older VCF specifications PR #88 removed the `<>` from `##pedigreeDB=<url>` in VCFv4.3.tex, presumably on the grounds that they were merely metasyntactic variable notation and not intended to appear literally. As some readers still refer to these older documents, remove the misleading notation here too. * Mention structured lines that are not defined by the VCF specification
Please see also the list of open issues here:
vcf