Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF Header: must Number be before Type? #642

Closed
nh13 opened this issue May 6, 2022 · 19 comments
Closed

VCF Header: must Number be before Type? #642

nh13 opened this issue May 6, 2022 · 19 comments
Assignees
Labels
Milestone

Comments

@nh13
Copy link
Member

nh13 commented May 6, 2022

Background

We were trying to process a VCF with htsjdk from here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr20.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

And I got the following exception:

Exception in thread "main" htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: Tag Type in wrong order (was #2, expected #3) in line <ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">, for input source: file:///tmp/1kGP_high_coverage_Illumina.chr20.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
	at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:264)
	at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:103)
	at htsjdk.tribble.TribbleIndexedFeatureReader.<init>(TribbleIndexedFeatureReader.java:128)
	at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:121)
	at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:81)
        ...
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Tag Type in wrong order (was #2, expected #3) in line <ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">
	at htsjdk.variant.vcf.VCF4Parser.parseLine(VCFHeaderLineTranslator.java:172)
	at htsjdk.variant.vcf.VCFHeaderLineTranslator.parseLine(VCFHeaderLineTranslator.java:58)
	at htsjdk.variant.vcf.VCFCompoundHeaderLine.<init>(VCFCompoundHeaderLine.java:215)
	at htsjdk.variant.vcf.VCFInfoHeaderLine.<init>(VCFInfoHeaderLine.java:56)
	at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:192)
	at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111)
	at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
	at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
	at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:262)

This is likely a result of gatk-sv (see this issue).

The offending line in the header is:

##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">

As you can see the Type is described before the Number. Is this allowed by the spec?

cc: @tfenne

@tfenne
Copy link
Member

tfenne commented May 8, 2022

It feels to me like HTSJDK is over-validating here. Does anyone really care whether it's:

##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">

or

##INFO=<ID=END2,Number=1,Type=Integer,Description="Position of breakpoint on CHR2">

All recent versions of the spec have language like this:

INFO fields are described as follows (first four keys are required, source and version are recommended):
##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">

I.e. they provide an example of an INFO header line and then refer to "first four keys", perhaps implying ordering, but I wouldn't read it that way.

Input from the spec maintainers would be greatly appreciated.

@jmarshall
Copy link
Member

jmarshall commented May 9, 2022

This particular exception message (“Tag … in wrong order”) was added fairly recently in samtools/htsjdk@93250d5, but this only refactored an existing check that previously produced a more vague failure message. That check has existed since VCF parsing was added to htsjdk in 2013 in samtools/htsjdk@f411234; earlier details would only be in some other repository perhaps internal to the Broad.

The VCF spec does not say anything explicit about the ordering of these fields. As @tfenne notes, the relevant current text in §1.4 is

For all of the structured lines (##INFO, ##FORMAT, ##FILTER, etc.), extra fields can be included after the default fields. For example:

##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">

In the above example, the extra fields of “Source” and “Version” are provided.

and in the relevant subsection of that:

INFO fields are described as follows (first four keys are required, source and version are recommended):

##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">

The latter text is essentially unchanged from the original wiki text that would have been current when this htsjdk check was first implemented; the former was added in PR #88. PR #620 is a step towards clarifying this text, but does not address ordering.

The current spec does contain the following for ##META headers:

It is possible to define sample to genome mappings as shown below:

##META=<ID=Assay,Type=String,Number=.,Values=[WholeGenome, Exome]>

with Type and Number in the opposite order. (However that's all it says about ##META and nobody knows what keys are “required” here or what this is actually intended to mean… See also #88 (comment) and the following conversation for the genesis of ##META, including a conclusion that the ordering of these subfields is not specified in general.)

The test suite in test/vcf does not exercise ##INFO or other structured headers with different subfield orderings. [Edit: Sigh… In fact there are test/vcf/4.*/failed/failed_meta_info_003.vcf.] There is test/vcf/4.*/failed/failed_meta_alt_004.vcf:

##CauseOfFailure=Incorrect order of ALT fields
##ALT=<ID=DEL,Type=String,Number=1,Description="Deletion">

but this is misleading: the required fields for ##ALT are ID and Description, so the cause of failure is that two coincidentally named (“Type” and “Number”) extra fields are violating the “extra fields can be included after the default [a.k.a. required] fields” text.

A similar question was asked about SAM header tag ordering a year ago (#571). No major implementation has enforced any particular ordering for these, so we were able to clarify that ordering here is immaterial with impunity (PR #572):

The order of the [SAM] header tagged fields is not significant. Whatever order they appear in, the meaning is the same; and any order for the tagged fields is valid. The specification (§1.3, “The header section”) currently does not say anything about this (hence you can assume that this means no particular ordering is invalid).

[from this comment]

@jkbonfield
Copy link
Contributor

I'll add my vote for the fields being order-less.

Many years ago I attempted to use SAM with multiple copies of the same aux tag type and met breakage from various tools that loaded them into a hash table. The spec at the time did not forbid this, so I (wrongly) assumed it was an appropriate thing to do. As I recall it was then clarified.

I don't know if VCF has addressed this issue or not (either in INFO/FORMAT fields or in header lines), but if not then suggest it's something to clarify at the same time.

@cwhelan
Copy link

cwhelan commented May 9, 2022

I'll also add a vote in favor of clarifying in the spec that the tags shouldn't be ordered -- I don't really think the order of the keys should matter in a list of named key-value pairs (I intuitively want to think of it as a map data structure when I look at the string representation).

The VCF file in question likely had its origin in an early version version of GATK-SV (possibly from even before it was named GATK-SV) but we changed the behavior a while back to put the tags in the traditionally expected order.

@jmarshall
Copy link
Member

jmarshall commented May 9, 2022

The test suite in test/vcf does not exercise ##INFO or other structured headers with different subfield orderings.

The VCF parts of the test suite were inherited from EBIvariation/vcf-validator, which explains why it contains no passing test files with these subfields ordered differently:—

Rather disappointingly, EBI's VCF Validator has also taken the specification text to mean “subfields are ordered as shown in the one (1) example”:

meta_info = 'ID=' %meta_id identifier $err(meta_id_err)
              ',Number=' %meta_number meta_field_num $err(meta_info_number_err)
              ',Type=' %meta_type meta_field_type $err(meta_info_type_err)
              ',Description=' %meta_description '"' meta_field_desc '"' $err(meta_desc_err)
              (',' identifier $err(meta_id_err) '="' meta_field_desc '"' $err(meta_desc_err))* ;

https://github.com/EBIvariation/vcf-validator/blob/3e1bd9b4500e84ef9e7c67aa89dc0f9469775d21/src/vcf/vcf_v43.ragel#L113-L117

@jmarshall
Copy link
Member

FWIW here's my interpretation of the specification text and my opinion:

What the text actually says is

  1. “Extra fields” (Source, Version, …) can be included after the “default fields” (ID, Number, Type, and Description).
  2. First four keys (namely ID, Number, Type, and Description) are required

The words in bold are the only ones that explicitly involve ordering. (1)'s “after” only defines the ordering of the extra fields vs the default fields; it says nothing about the ordering within the default fields. (2)'s “first” just identifies which fields in the following verbatim example line are the ones that are required; it says nothing (IMHO) about the ordering within those first four fields.

Thus the specification does not explicitly say anything about the ordering within these four required fields, so the conservative interpretation is to assume that any ordering is valid.

Moreover, in VCF:

  • The ##contig|INFO|FILTER|FORMAT|etc headers may appear in any order (the only constraint is that ##fileformat must be the first line) and different categories may even be interspersed.
  • I think it's generally accepted that different subsets of INFO subfields will appear on different variant records, and that the ordering of the INFO subfields on a line conveys no meaning and may vary between records.
  • For each record line, the FORMAT field specifies the subset of genotype files that appear on that line and the ordering in which they appear on that line. The only constraint is that GT (if present) must appear first. I think it's generally accepted that this ordering also conveys no meaning and may vary between records.
  • The (underspecified) ##META header (§1.4.8) uses almost the same four ID/ Number/ Type/ DescriptionValues but the examples in the spec show Type/Number the other way around.

Thus to my mind it is clear that the spirit of VCF is not to impose constraints, such as ordering constraints, where they are not naturally required. So in the absence of explicit text specifying the ordering within these four required fields, IMHO the natural and conservative assumption is that these four fields' ordering is similarly unconstrained. (In the light of these other parts of VCF, interpreting “…described as follows (first four keys are required…” as specifying that the first four keys are ordered as shown in the description is IMHO a rather astonishing overreach.)

All the examples show ID=xyz appearing first, and IMHO it might be justifiable to specify that ID should be the first subfield in the header line. OTOH IMHO the restriction that is explicitly written out — that additional “extra” fields must appear after the default “required” fields (the fantastic four in the case of INFO) — is pointless and could be relaxed. I agree with the other responses that structured header lines have the form of a key-value pairs map data structure which is naturally unordered.

Hence IMHO the spec should be updated to say one of the following:

  1. The key=value pairs within the <> on a structured meta-information line may appear in any order
  2. The key=value pairs within the <> on a structured meta-information line may appear in any order, except that all required subfields must appear before any extra subfields.

As a matter of practicality, we should also add a non-normative footnote mentioning that HTSJDK ≤ x.y.z required the first four INFO (et al) subfields to appear in the order in which they are shown in the example in §1.4.2. (And also raise HTSJDK and VCF Validator issues!)

@jkbonfield
Copy link
Contributor

I agree with the proposed change, however playing devil's advocate, would you take issue with someone saying?

The first four letters of the alphabet are b, a, d and c?
The first four Fibbonacci numbers are 1, 2, 3 and 1?

I agree it's perhaps a bit of an over reach within the language of the specification, but being generous I can see how someone may interpret it in another way. Basically it's an ambiguous statement. The flip side is I don't even see why someone would bother to validate this anyway even if they believed the specification implied it. It feels rather over-zealous pedantry given there is nothing detrimental in accepting the data as-is.

@d-cameron
Copy link
Contributor

I too agree that the ordering shouldn't matter but also agree that specs are unclear and lean towards an interpretation that does require a fixed header ordering. I'd like to resolve this with a PR that does the following:

  1. require parser to accept keys in any order
  2. require parser to write the pre-defined keys first, then extra fields in alphabetical order
  3. disallow duplicated keys within a record (so you can have 3 IDs in a single row)

Any objections to including this as part of the 4.4 feedback? Does anyone feel like this is something we need to push into 4.3 as well?

@d-cameron
Copy link
Contributor

the ordering of the INFO subfields on a line conveys no meaning

I think we should expand this to say the ordering of any subfields conveys no meaning. This includes ## metainformation header, INFO fields, as well as FORMAT fields.

I'd like to say the ordering doesn't matter at all but there's the practical consideration that in every implementation the order of the ##contig headers define the sort order of the records (even though the specs don't technically say anything about this)

@jkbonfield
Copy link
Contributor

You're mixing two things there: the order of fields within a record and the order of the records themselves.

The order of the contig records does matter and is defined in the spec, at least for the BCF part, as it replaces the string with a number indicating the Nth contig line.

As for the order within fields, why do you feel requiring them to be alphabetical is a good idea? It seems totally unintuitive to me. One hand hand you're saying the order should confer no meaning, and then dictating that a specific order must be obeyed during writes. This adds extra complication which by definition serves zero purpose.

@tfenne
Copy link
Member

tfenne commented May 10, 2022

@d-cameron I agree with your points (1) and (3) - allowing any order and disallowing duplicate keys. I think (2) would be better served by providing recommendations rather than requirements, and I think a more logical ordering than alphabetical would be preferable.

For example I would recommend:

  • ID should always be first
  • The required fields per header type should ideally come next (without specifying ordering within them)
  • The optional fields per header type should come after required fields

@tcezard
Copy link
Contributor

tcezard commented May 10, 2022

If we accept that the order of the fields convey no meaning, is there any point in requiring any ordering at all?
The required fields should be just that ... required.
Or is it the case that parser benefit from lazy parsing of the line by ignoring everything after the required fields ?

@d-cameron
Copy link
Contributor

d-cameron commented May 12, 2022

is there any point in requiring any ordering at all

The reason I proposed an ordering was for round-trip stability. Different tools will output headers in different order which makes it annoying to diff a VCF (in my particular case it loading/saving a VCF in htsjdk and BioConductor/VariantAnnotation). I guess we could do something like the SAM section 2 recommendation in which a particular convention is preferred, but not required by the specifications.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 13, 2022

Maybe, but that would need a required ordering for INFO and FORMAT fields too, or at least a statement that the output must not change the input order. All of this puts unnecessary burdon on the implementation, and may rule out certain efficient implementation methods. So I'm against such things.

Instead the right solution IMO is to provide tools that can compare files for equality of content, not equality of byte-streams. Eg htslib has a "compare_sam.pl" test script, used in the test harness.

@jkbonfield
Copy link
Contributor

If we accept that the order of the fields convey no meaning, is there any point in requiring any ordering at all? The required fields should be just that ... required. Or is it the case that parser benefit from lazy parsing of the line by ignoring everything after the required fields ?

I must admit I don't understand the notion that the required fields must come before the optional fields, given the required fields aren't required to be in a specific order. It's not like the other parts of VCF and SAM where you can say "column 4 means this" and "column 5 is that". They're all key=value pairs and can swap around the columns, so practically speaking I would expect implementations to simply load them into a hash or if they're just searching for one item will do a linear scan.

That said, it's already in the spec and something is probably validating them so I don't really feel an urgency to change that. I do however think it's unnecessary to try and tighten it further. What probably does need adding is explicit denial of repeated keys; eg appending another description when one already exists may naively feel like a reasonable thing to do, but it obviously breaks any implementation storing key=val in a (basic) hash table.

@tfenne
Copy link
Member

tfenne commented May 13, 2022

I must admit I don't understand the notion that the required fields must come before the optional fields

I think I suggested that, so I'll respond. I don't have strong feelings about this, and would be quite happy for the spec to be clarified on the two points we've all agreed on (repeated keys are not allowed, keys can be in any order). The reason I suggested a recommended ordering is that VCFs are often still used as human-readable files, and I know I look at them by eye fairly frequently. When doing so it is generally useful to have ID/Number/Type before the other keys, with ID first. I don't think that should be required, but I would have a general preference for recommending it to make it easier on human readers.

@jmarshall
Copy link
Member

jmarshall commented May 24, 2022

We should raise unrelated issues — such as ##contig ordering defining sort order, and disallowing duplicate subfield keys within a structured meta-information line (though an eventual PR may wish to address that one at the same time) — as separate issues, so that the discussion here doesn't get distracted.

Also, as a practical matter, I would recommend the VCF maintainers complete their review of PR #620 and merge it before starting to draft changes to address this subfield ordering issue, which will affect the same text.


The core question here is how the VCF specification should rule on whether differing orders of the four required ##INFO subfields should be allowed. HTSJDK has always enforced the ID/ Number/ Type/ Description order, but the OP has encountered a tool that has produced a VCF file with a different ordering that works fine with bcftools but is rejected by HTSJDK.

Apparently in practice just about everybody writing out VCF files does follow the ordering shown in the spec examples and prior extant VCF files, because it has taken about 10 years for this issue to be reported!

At the heart of it, the VCF maintainers have two choices:

  1. (The “elegant” choice.) Clarify that arbitrary orderings of ID/Number/Type/Description are acceptable. This would require a change to HTSJDK to relax this check (and similarly to EBI's VCF Validator), endorse the existing HTSlib behaviour, and bless any VCF files using a not-per-the-spec-example ordering — while accepting that any such files would be rejected by older versions of HTSJDK-based tools.

  2. (The “pragmatic” choice.) Restrict the specification to require that only the ID, Number, Type, Description ordering is valid. This would bless the existing HTSJDK behaviour, designate HTSlib's long-time behaviour as overgenerous, and outlaw any existing VCF files using a different ordering. (The evidence suggests that there may not be many such files; or perhaps it is the case that there are groups happily using such files with bcftools and other non-Java pipelines — we don't know.)

It seems that the consensus is towards (1), the “elegant” approach. Even with @d-cameron's ideas about requiring/recommending a particular ordering when writing, he agrees about accepting arbitrary orderings (at least, of these first four) when reading — so even that proposal has the consequences that HTSJDK must change and that files taking advantage of the arbitrary orderings will fail with older HTSJDK versions.

The wider question is whether to embellish this decision with a recommended ordering.

I agree with @d-cameron's point about round-trip stability being desirable, but also with @jkbonfield's point about allowing freedom of implementation choices (i.e., that this is naturally an unordered hash table!). IMHO round-trip stability is desirable and one thing to weigh up when implementing, and best left as a stated or unstated “quality of implementation” issue. (Note that round-trip stability can be provided by e.g. recommending filters to “preserve the same order as in the input record” or by recommending a particular order.)

I also mildly agree with @tfenne's point about the canonical ordering being useful when eyeballing ##INFO lines, though I think only really ID-first is important here: the exact ordering of Number/Type/Description isn't really important, though it is best for the eyeballs if the ordering is at least consistent within the headers of a particular VCF file. Again, this is mostly a “quality of implementation” issue rather than being necessarily critical to the file format.
[Edited to add: re-reading Tim's comment, I see his eyeballs are defending the required-then-optional ordering and suggesting ID-first, but not commenting on ordering within the other three required fields Number/Type/Description. So I guess I agree completely, and the only spec change I might recommend would be relaxing the informally written “extra fields can be included after the default fields” (which probably constitutes a MUST, but who knows exactly what the intention was) to a SHOULD.]

The VCF spec uses the shouty MAY/SHOULD/MUST RFC 2119 terminology, so recommending an ordering here would be a candidate for using SHOULD (at most) rather than MUST.

@zaeleus
Copy link

zaeleus commented Feb 6, 2023

Since this clarification was not backported to VCF 4.3, does it still stand that the required fields are ordered in VCF < 4.4? It is still a failing test case in 4.3: failed_meta_info_003.vcf.

@jmarshall
Copy link
Member

In my opinion, the 4.1 through 4.3 text is most naturally interpreted as not requiring the particular field ordering in question — see #642 (comment). IMHO that was also the consensus amongst the maintainers.

Alas, we have inherited a test case that believes otherwise (see #642 (comment)). So we should adjust {4.1,4.2,4.3}/failed/failed_meta_info_003.vcf so that they pass instead, or possibly just remove these test cases.

The 4.4 text was clarified, but the 4.3 and prior texts remain arguably unclear. IMHO we should indeed backport this clarification to the spec documents for all active VCF versions (which means 4.3 and probably 4.2 too).

@jkbonfield jkbonfield moved this to To do (backlog) in GA4GH File Formats Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

8 participants