Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equals in VCFInfoHeaderLine is using description.equals(other.description) ? #929

Open
lindenb opened this issue Jul 7, 2017 · 6 comments
Labels
VCFHeader Refactor Issues fixed by the header/line refactoring branch

Comments

@lindenb
Copy link
Contributor

lindenb commented Jul 7, 2017

Subject of the issue

A GATK program bugs because there are two fields with the same VCFInfoHeaderLine but the description is different. One is from my original VCF, the other is injected by GATK

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in  genotypes">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">

Verify

The following code inserts two VCFInfoHeaderLine in a Set. They only differ on their description. The size of the set is 2.

import java.util.HashSet;
import java.util.Set;

import htsjdk.variant.vcf.VCFHeaderLineCount;
import htsjdk.variant.vcf.VCFHeaderLineType;
import htsjdk.variant.vcf.VCFInfoHeaderLine;

public class Test {
public static void main(String[] args) {
    VCFInfoHeaderLine ac1 = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.Integer,
            "ONE");
    VCFInfoHeaderLine ac2 = new VCFInfoHeaderLine("AC", VCFHeaderLineCount.A, VCFHeaderLineType.Integer,
            "TWO");
    Set<VCFInfoHeaderLine> set = new HashSet<>();
    set.add(ac1);
    set.add(ac2);
    
    System.err.println(set);
    // displays 2 values:
    // [INFO=<ID=AC,Number=A,Type=Integer,Description="ONE">, INFO=<ID=AC,Number=A,Type=Integer,Description="TWO">]
    }
}

Expected behaviour

I think the field description shouldn't be considered (?)

equals is defined in CompoundHeaderLine as

    @Override
    public boolean equals(final Object o) {
     (...)
        final VCFCompoundHeaderLine that = (VCFCompoundHeaderLine) o;
        return equalsExcludingDescription(that) &&
               description.equals(that.description);
    }

I can submit a PR to override equals in VCFInfoHeaderLine (and VCFFormatHeaderLine ? ) to set

    @Override
    public boolean equals(final Object o) {
     (...)
        final VCFInfoHeaderLine that = (VCFInfoHeaderLine) o;
        return equalsExcludingDescription(that)  ;
    }

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 7, 2017

@lindenb There is a PR that has a pretty big refactoring of the header line hierarchy in an attempt to fix a number of issues similar to this one. (One of them was that the current implementation doesn't preserve additional tags for compound header lines at all, i.e., what if one of the lines has a version tag, or some other tag). I'd propose that we resolve this issue in the context of the review discussion for that PR. Unfortunately, the PR large because there are so many interdependent issues. Hopefully get to reviewing it soon.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 7, 2017

BTW, I'd be curious to know more about which GATK version and tool/code path failed. There are tests in both Picard and GATK where this exact case occurs (a set containing header lines that are identical except for description is used to create a header). Internally, the current VCFHeader implementation would only use one of the lines (because the ID must be unique for a given key), but it also preserves and roundtrips the other one. So there can be surprising results.

@lindenb
Copy link
Contributor Author

lindenb commented Jul 7, 2017

@cmnbroad

BTW, I'd be curious to know more about which GATK version and tool/code path failed

GATK SelectVariants Version=3.7-0-gcfedb67

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 7, 2017

@lindenb Yeah, I've seen this issue with SelectVariants in both GATK3 and 4. Thanks.

@lindenb
Copy link
Contributor Author

lindenb commented Jul 7, 2017

closing as #835

@lindenb lindenb closed this as completed Jul 7, 2017
@cmnbroad cmnbroad reopened this Jul 10, 2017
@cmnbroad
Copy link
Collaborator

I think we should keep tracking this to ensure that it gets resolved as part of #835.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
VCFHeader Refactor Issues fixed by the header/line refactoring branch
Projects
None yet
Development

No branches or pull requests

2 participants