-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF Header: must Number be before Type? #642
Comments
It feels to me like HTSJDK is over-validating here. Does anyone really care whether it's:
or
All recent versions of the spec have language like this:
I.e. they provide an example of an INFO header line and then refer to "first four keys", perhaps implying ordering, but I wouldn't read it that way. Input from the spec maintainers would be greatly appreciated. |
This particular exception message (“Tag … in wrong order”) was added fairly recently in samtools/htsjdk@93250d5, but this only refactored an existing check that previously produced a more vague failure message. That check has existed since VCF parsing was added to htsjdk in 2013 in samtools/htsjdk@f411234; earlier details would only be in some other repository perhaps internal to the Broad. The VCF spec does not say anything explicit about the ordering of these fields. As @tfenne notes, the relevant current text in §1.4 is
and in the relevant subsection of that:
The latter text is essentially unchanged from the original wiki text that would have been current when this htsjdk check was first implemented; the former was added in PR #88. PR #620 is a step towards clarifying this text, but does not address ordering. The current spec does contain the following for
with
but this is misleading: the required fields for A similar question was asked about SAM header tag ordering a year ago (#571). No major implementation has enforced any particular ordering for these, so we were able to clarify that ordering here is immaterial with impunity (PR #572):
|
I'll add my vote for the fields being order-less. Many years ago I attempted to use SAM with multiple copies of the same aux tag type and met breakage from various tools that loaded them into a hash table. The spec at the time did not forbid this, so I (wrongly) assumed it was an appropriate thing to do. As I recall it was then clarified. I don't know if VCF has addressed this issue or not (either in INFO/FORMAT fields or in header lines), but if not then suggest it's something to clarify at the same time. |
I'll also add a vote in favor of clarifying in the spec that the tags shouldn't be ordered -- I don't really think the order of the keys should matter in a list of named key-value pairs (I intuitively want to think of it as a map data structure when I look at the string representation). The VCF file in question likely had its origin in an early version version of GATK-SV (possibly from even before it was named GATK-SV) but we changed the behavior a while back to put the tags in the traditionally expected order. |
The VCF parts of the test suite were inherited from EBIvariation/vcf-validator, which explains why it contains no passing test files with these subfields ordered differently:— Rather disappointingly, EBI's VCF Validator has also taken the specification text to mean “subfields are ordered as shown in the one (1) example”:
|
FWIW here's my interpretation of the specification text and my opinion: What the text actually says is
The words in bold are the only ones that explicitly involve ordering. (1)'s “after” only defines the ordering of the extra fields vs the default fields; it says nothing about the ordering within the default fields. (2)'s “first” just identifies which fields in the following verbatim example line are the ones that are required; it says nothing (IMHO) about the ordering within those first four fields. Thus the specification does not explicitly say anything about the ordering within these four required fields, so the conservative interpretation is to assume that any ordering is valid. Moreover, in VCF:
Thus to my mind it is clear that the spirit of VCF is not to impose constraints, such as ordering constraints, where they are not naturally required. So in the absence of explicit text specifying the ordering within these four required fields, IMHO the natural and conservative assumption is that these four fields' ordering is similarly unconstrained. (In the light of these other parts of VCF, interpreting “…described as follows (first four keys are required…” as specifying that the first four keys are ordered as shown in the description is IMHO a rather astonishing overreach.) All the examples show Hence IMHO the spec should be updated to say one of the following:
As a matter of practicality, we should also add a non-normative footnote mentioning that HTSJDK ≤ x.y.z required the first four INFO (et al) subfields to appear in the order in which they are shown in the example in §1.4.2. (And also raise HTSJDK and VCF Validator issues!) |
I agree with the proposed change, however playing devil's advocate, would you take issue with someone saying?
I agree it's perhaps a bit of an over reach within the language of the specification, but being generous I can see how someone may interpret it in another way. Basically it's an ambiguous statement. The flip side is I don't even see why someone would bother to validate this anyway even if they believed the specification implied it. It feels rather over-zealous pedantry given there is nothing detrimental in accepting the data as-is. |
I too agree that the ordering shouldn't matter but also agree that specs are unclear and lean towards an interpretation that does require a fixed header ordering. I'd like to resolve this with a PR that does the following:
Any objections to including this as part of the 4.4 feedback? Does anyone feel like this is something we need to push into 4.3 as well? |
I think we should expand this to say the ordering of any subfields conveys no meaning. This includes I'd like to say the ordering doesn't matter at all but there's the practical consideration that in every implementation the order of the |
You're mixing two things there: the order of fields within a record and the order of the records themselves. The order of the contig records does matter and is defined in the spec, at least for the BCF part, as it replaces the string with a number indicating the Nth contig line. As for the order within fields, why do you feel requiring them to be alphabetical is a good idea? It seems totally unintuitive to me. One hand hand you're saying the order should confer no meaning, and then dictating that a specific order must be obeyed during writes. This adds extra complication which by definition serves zero purpose. |
@d-cameron I agree with your points (1) and (3) - allowing any order and disallowing duplicate keys. I think (2) would be better served by providing recommendations rather than requirements, and I think a more logical ordering than alphabetical would be preferable. For example I would recommend:
|
If we accept that the order of the fields convey no meaning, is there any point in requiring any ordering at all? |
The reason I proposed an ordering was for round-trip stability. Different tools will output headers in different order which makes it annoying to diff a VCF (in my particular case it loading/saving a VCF in htsjdk and BioConductor/VariantAnnotation). I guess we could do something like the SAM section 2 recommendation in which a particular convention is preferred, but not required by the specifications. |
Maybe, but that would need a required ordering for INFO and FORMAT fields too, or at least a statement that the output must not change the input order. All of this puts unnecessary burdon on the implementation, and may rule out certain efficient implementation methods. So I'm against such things. Instead the right solution IMO is to provide tools that can compare files for equality of content, not equality of byte-streams. Eg htslib has a "compare_sam.pl" test script, used in the test harness. |
I must admit I don't understand the notion that the required fields must come before the optional fields, given the required fields aren't required to be in a specific order. It's not like the other parts of VCF and SAM where you can say "column 4 means this" and "column 5 is that". They're all key=value pairs and can swap around the columns, so practically speaking I would expect implementations to simply load them into a hash or if they're just searching for one item will do a linear scan. That said, it's already in the spec and something is probably validating them so I don't really feel an urgency to change that. I do however think it's unnecessary to try and tighten it further. What probably does need adding is explicit denial of repeated keys; eg appending another description when one already exists may naively feel like a reasonable thing to do, but it obviously breaks any implementation storing key=val in a (basic) hash table. |
I think I suggested that, so I'll respond. I don't have strong feelings about this, and would be quite happy for the spec to be clarified on the two points we've all agreed on (repeated keys are not allowed, keys can be in any order). The reason I suggested a recommended ordering is that VCFs are often still used as human-readable files, and I know I look at them by eye fairly frequently. When doing so it is generally useful to have ID/Number/Type before the other keys, with ID first. I don't think that should be required, but I would have a general preference for recommending it to make it easier on human readers. |
We should raise unrelated issues — such as Also, as a practical matter, I would recommend the VCF maintainers complete their review of PR #620 and merge it before starting to draft changes to address this subfield ordering issue, which will affect the same text. The core question here is how the VCF specification should rule on whether differing orders of the four required Apparently in practice just about everybody writing out VCF files does follow the ordering shown in the spec examples and prior extant VCF files, because it has taken about 10 years for this issue to be reported! At the heart of it, the VCF maintainers have two choices:
It seems that the consensus is towards (1), the “elegant” approach. Even with @d-cameron's ideas about requiring/recommending a particular ordering when writing, he agrees about accepting arbitrary orderings (at least, of these first four) when reading — so even that proposal has the consequences that HTSJDK must change and that files taking advantage of the arbitrary orderings will fail with older HTSJDK versions. The wider question is whether to embellish this decision with a recommended ordering. I agree with @d-cameron's point about round-trip stability being desirable, but also with @jkbonfield's point about allowing freedom of implementation choices (i.e., that this is naturally an unordered hash table!). IMHO round-trip stability is desirable and one thing to weigh up when implementing, and best left as a stated or unstated “quality of implementation” issue. (Note that round-trip stability can be provided by e.g. recommending filters to “preserve the same order as in the input record” or by recommending a particular order.) I also The VCF spec uses the shouty MAY/SHOULD/MUST RFC 2119 terminology, so recommending an ordering here would be a candidate for using SHOULD (at most) rather than MUST. |
Since this clarification was not backported to VCF 4.3, does it still stand that the required fields are ordered in VCF < 4.4? It is still a failing test case in 4.3: failed_meta_info_003.vcf. |
In my opinion, the 4.1 through 4.3 text is most naturally interpreted as not requiring the particular field ordering in question — see #642 (comment). IMHO that was also the consensus amongst the maintainers. Alas, we have inherited a test case that believes otherwise (see #642 (comment)). So we should adjust {4.1,4.2,4.3}/failed/failed_meta_info_003.vcf so that they pass instead, or possibly just remove these test cases. The 4.4 text was clarified, but the 4.3 and prior texts remain arguably unclear. IMHO we should indeed backport this clarification to the spec documents for all active VCF versions (which means 4.3 and probably 4.2 too). |
Background
We were trying to process a VCF with htsjdk from here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr20.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
And I got the following exception:
This is likely a result of gatk-sv (see this issue).
The offending line in the header is:
##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">
As you can see the
Type
is described before theNumber
. Is this allowed by the spec?cc: @tfenne
The text was updated successfully, but these errors were encountered: