Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to Enable SelectVariants to drop sites with * allele as only ALT #5129

Merged
merged 6 commits into from
Aug 30, 2018

Conversation

kvinter1
Copy link
Contributor

Attempt to eliminate lines in vcf with only * as ALT when using SelectVariants (change to SelectVariants). Also includes addition of new tests in SelectVariantsIntegrationTest to ensure this is true over many variations of arguments / assignments.

Working on completing a test in SelectVariantsIntegrationTest
Working on completing a test in SelectVariantsIntegrationTest
Work in progress
Work in progress
Including test to ensure alleles that are only monomorphic, in respect to samples included, will have lines deleted.
@kvinter1 kvinter1 requested a review from vdauwera August 22, 2018 21:37
@ldgauthier
Copy link
Contributor

Looks like I also had a dupe at #4258. I swear I had a branch where I started on this, but I can't find it. I remember raising the question of expected behavior when --remove-unused-alternates is not specified. @vdauwera do you think we should always drop lonely *s? That's a lot easier.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix Katie! I like the expected behavior, but I think one of your tests is not what you intended.

@Test
public void testRemoveMonomorphAfterSNSelect() throws IOException {
final String testFile = getToolTestDataDir() + "spanning_deletion.vcf";
final String samplesFile = "NA1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really this is a sample name rather than a samplesFile so can you change the variable name?

baseTestString(" -sn " + samplesFile + " --remove-unused-alternates --exclude-non-variants", testFile),
Collections.singletonList(getToolTestDataDir() + "expected/" + "testSelectVariants_RemoveSingleSpanDelAlleleNoSpanDel.vcf")
);
spec.executeTest("test will not remove variant line where '*' is only ALT allele because none exist" + testFile, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is different from the above test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to use a different input file where the * is in a heterozygous genotype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. They are the same. Turns out the test technically tests two things at once, I deleted the second instance.

@@ -569,8 +569,8 @@ public void apply(VariantContext vc, ReadsContext readsContext, ReferenceContext
}
final VariantContext filteredGenotypeToNocall = setFilteredGenotypesToNocall ? builder.make(): sub;

// Not excluding non-variants or subsetted polymorphic variants AND including filtered loci or subsetted variant is not filtered
if ((!XLnonVariants || filteredGenotypeToNocall.isPolymorphicInSamples()) && (!XLfiltered || !filteredGenotypeToNocall.isFiltered())) {
// Not excluding non-variants OR (subsetted polymorphic variants AND not spanning deletion) AND (including filtered loci OR subsetted variant) is not filtered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still really hard to read. Maybe give examples of things that will pass here?

@kvinter1
Copy link
Contributor Author

@ldgauthier Thanks for the comments! I made the changes you requested and deleted the redundant test. Let me know if you have any more edits in mind.

@codecov-io
Copy link

Codecov Report

Merging #5129 into master will increase coverage by 0.013%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5129       +/-   ##
===============================================
+ Coverage     86.655%   86.668%   +0.013%     
- Complexity     29046     30619     +1573     
===============================================
  Files           1808      1834       +26     
  Lines         134686    140269     +5583     
  Branches       14938     15772      +834     
===============================================
+ Hits          116712    121568     +4856     
- Misses         12559     13160      +601     
- Partials        5415      5541      +126
Impacted Files Coverage Δ Complexity Δ
...rs/variantutils/SelectVariantsIntegrationTest.java 100% <100%> (ø) 69 <3> (+3) ⬆️
...der/tools/walkers/variantutils/SelectVariants.java 79.646% <100%> (+0.06%) 119 <3> (+4) ⬆️
...ats/collections/SimpleCountCollectionUnitTest.java 83.784% <0%> (-2.883%) 4% <0%> (ø)
...ngine/spark/AddContextDataToReadSparkUnitTest.java 92.453% <0%> (-2.142%) 12% <0%> (+6%)
...ils/read/markduplicates/sparkrecords/Fragment.java 90.909% <0%> (-1.948%) 12% <0%> (+6%)
...r/utils/read/markduplicates/sparkrecords/Pair.java 98.276% <0%> (-1.724%) 41% <0%> (+15%)
...rkduplicates/MarkDuplicatesSparkUtilsUnitTest.java 95.238% <0%> (-1.314%) 18% <0%> (+7%)
...bender/tools/walkers/qc/PileupIntegrationTest.java 89.189% <0%> (-1.287%) 12% <0%> (+6%)
...institute/hellbender/utils/io/IOUtilsUnitTest.java 86.458% <0%> (-0.79%) 32% <0%> (+6%)
...ncotator/mafOutput/MafOutputRendererConstants.java 98.333% <0%> (-0.686%) 2% <0%> (+1%)
... and 97 more

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ldgauthier
Copy link
Contributor

@vdauwera Is Robert on github? Is this (i.e. expected behavior, arg name review, etc.) going to be one of his responsibilities going forward?

@kvinter1
Copy link
Contributor Author

@sooheelee Could you review? Thanks!

@sooheelee
Copy link
Contributor

Sure @kvinter1, I can review.

@sooheelee sooheelee self-requested a review August 29, 2018 19:45
@sooheelee
Copy link
Contributor

sooheelee commented Aug 29, 2018

I just tested your branch @kvinter1 with some data at hand with the following command:

./gatk SelectVariants \
-V /Users/shlee/Downloads/gatk_bundle_1807/2-germline/input_vcfs/trio.vcf.gz \
-sn NA12878 \
--exclude-non-variants 
--remove-unused-alternates \
-O trio_excludeNVrmvUA.vcf.gz

And when I grep for the spanning deletion with gzcat trio_excludeNVrmvUA.vcf.gz | grep -v '##' | grep '*' | awk '$5="*"' | wc -l, I see two records pop up:

20 19013133 . C * 2084.57 . AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=1.46;ClippingRankSum=0.00;DP=48;ExcessHet=3.9794;FS=0.000;MQ=44.84;MQRankSum=-4.256e+00;QD=18.29;ReadPosRankSum=-3.990e-01;SOR=0.672 GT:AD:DP:GQ:PL 1/2:0,0,18:48:1:732,732,732,1,0,178
20 25939208 . A * 352.14 . AC=1,1;AF=0.500,0.500;AN=2;BaseQRankSum=1.14;ClippingRankSum=0.00;DP=12;ExcessHet=3.0103;FS=21.733;MQ=31.92;MQRankSum=-3.331e+00;QD=11.74;ReadPosRankSum=1.99;SOR=1.570 GT:AD:DP:GQ:PL 1/2:1,0,6:12:15:195,156,177,15,0,923

I assume testing functionality is what you meant for review? In terms of updating the JavaDoc portion, is this something you can write a draft of, e.g. a summary of the functionality implemented. I noticed there were no changes to the doc portions. Your synopsis is something I can then review for clarity and style.

P.S. I should mention otherwise your branch removed 511 such unwanted records. Nice.

Copy link
Contributor

@sooheelee sooheelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left in the conversation thread.

@ldgauthier
Copy link
Contributor

@sooheelee can you share that input file with Katie so she can add the het-non-ref case to the integration tests?

@sooheelee
Copy link
Contributor

It is publically available in the GATK workshop bundle but here it is for convenience:

forKatieFromSooHee.zip

@sooheelee
Copy link
Contributor

sooheelee commented Aug 30, 2018

Just learned awk '$5="*"' replaces T,* with * and the correct usage is awk '5~"*"'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants