Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SelectVariants JEXL filter fixes and refactor #8092

Merged
merged 16 commits into from
Jan 7, 2023
Merged

Conversation

takutosato
Copy link
Contributor

@takutosato takutosato commented Nov 14, 2022

This PR is an attempt at improving SelectVariants by

  • Rewriting unclear code;
  • Adding documentation where needed; and
  • Adding tests.

The initial motivation for this code change (#7497) was to improve performance of SelectVariants by adding an option to do the "INFO-level filtering" before doing "genotype filtering." Our assumption was that this would improve performance because then we would avoid the expensive genotype "fully-decode" operation, which turns string format fields into appropriate object/types (int, array, etc.). This is (we think) done in VariantContext.fullyDecode().

This turned out not to be possible for the following reasons. First, there are roughly four types of genotype subsetting you could do:

a) By the sample names (--sample-name NA12878)
b) JEXL (--select GQ > 0)
c) JEXL by accessing the variant context object (--select vc.getGenotype('NA12878').getGQ() > 1)
d) Others (e.g. --remove-fraction-genotype)

a) does not need "fully-decode." It turns out b) was never supported (GATK currently removes all variants and succeed.) And from my experiments, c) does not seem to ever trigger calling VariantContext.fullyDecode(). In fact the only code path I can see that calls fullyDecode() is by setting the fully-decode SelectVariants argument, which seems to just call fullyDecode at the beginning just for the sake of calling it (or so it appears to me. The utility of this command line argument is highly dubious.)

It's possible that apache code does something similar to fully decoding that could affect performance.

All that is to say that we cannot achieve performance improvement with our original blueprint simply because this expensive "fullyDecode" operation seems to be a mythical operation that is never used in reality.

So while I could not speed up SelectVariants, I cleaned up the code and added the following new arguments:

  • --select-genotype: with this new genotype-specific JEXL argument, we support filtering by genotype fields like 'GQ > 0', where the behavior in the multi-sample case is 'GQ > 0' in at least one sample. I have not added the ability to do 'GQ > 0 for all samples' but it should be a simple (but not easy…) exercise in boolean operations.
  • applyJexlFiltersBeforeFilteringGenotypes: if set to true, we do the JEXL checking before we subset by samples. In my tests, performance improvement from this option was very modest. Subsetting a ~3k 1kg SV vcf to a single sample was about 30 seconds faster (out of ~20 min total run time) than the default. I kept it in the PR because I thought some user might find it useful, but I wouldn't be opposed to removing it.

Tests needed:

  • Filter by genotypes with a new flag --genotype-select, with the default behavior being 'passes if at least one sample passes'
  • Multiple --select expressions should be combined with logical-or
  • Test string annotations (e.g. ALGORITHM == 'depth')
  • Jexl involving with logical-and (e.g. AC > 0 && AF > 0.01)
  • Access genotypes directly e.g. vc.getsample('NA12878')
  • DP > 0 as --genotype-select and as --select
  • Combine --select and --select-genotypes
  • Code path that uses "fully-decode"
  • Failing cases (reference genotype fields in --select and vice versa)
  • --applyJexlFiltersBeforeFilteringGenotypes. Does this actually give us performance advantage?
  • Add a test for select-random-fraction

Comment on lines 538 to 583
// START Chris/Louis/David
// Attempt 1, does not work
// final ArgumentsBuilder args1 = new ArgumentsBuilder()
// .add("V", testVcf.toString())
// .add("O", testOutput.getAbsolutePath())
// .add("--select", "'AC > 1'")
// .add("--select", "'AF > 0'");
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason...
// runCommandLine(args1, SelectVariants.class.getSimpleName());

// Attempt 2 (remove single quotes), does not work
// final ArgumentsBuilder args2 = new ArgumentsBuilder()
// .add("V", testVcf.toString())
// .add("O", testOutput.getAbsolutePath())
// .add("--select", "AC > 1")
// .add("--select", "AF > 0");
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason...
// runCommandLine(args2, SelectVariants.class.getSimpleName());

// Atempt 3 (remove spaces). commandline parser lets this through but not the JEXL parser
// final ArgumentsBuilder args3 = new ArgumentsBuilder()
// .add("V", testVcf.toString())
// .add("O", testOutput.getAbsolutePath())
// .addRaw("--select 'AC>1'")
// .addRaw("--select 'AF>0'");
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason...
// runCommandLine(args3, SelectVariants.class.getSimpleName());

// Attempt 4 (give a list to runCommandLine), works
// runCommandLine(Arrays.asList("-V", testVcf.toString(),
// "-O", testOutput.getAbsolutePath(),
// "--select","AC > 1"),
// SelectVariants.class.getSimpleName());

// Attempt 5, works!
// runCommandLine(Arrays.asList("-V", testVcf.toString(),
// "-O", testOutput.getAbsolutePath(),
// "--select", "ALGORITHMS == 'depth'"),
// SelectVariants.class.getSimpleName());

// Attempt 6, works!
runCommandLine(Arrays.asList("-V", testVcf.toString(),
"-O", testOutput.getAbsolutePath(),
"--select", "SVTYPE == 'DEL'"),
SelectVariants.class.getSimpleName());
// END Chris/Louis/David
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen @lbergelson @cmnbroad this block of code has the command line parsing issue I mentioned. My work around is Attempt 5 and 6, where I feed runCommandLine a list of arguments rather than an argument builder. Is that good enough? Otherwise I suspect we need to update ArgumentsBuilder to support JEXL command lines.

@gatk-bot
Copy link

gatk-bot commented Nov 14, 2022

Github actions tests reported job failures from actions build 3463372954
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 8 3463372954.10 logs
cloud 11 3463372954.11 logs
integration 11 3463372954.12 logs
integration 8 3463372954.0 logs

@takutosato takutosato marked this pull request as draft November 19, 2022 20:15
@gatk-bot
Copy link

gatk-bot commented Nov 19, 2022

Github actions tests reported job failures from actions build 3505269846
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud 11 3505269846.11 logs
unit 11 3505269846.13 logs
integration 11 3505269846.12 logs
integration 8 3505269846.0 logs

@takutosato takutosato marked this pull request as ready for review November 28, 2022 18:21
@gatk-bot
Copy link

gatk-bot commented Nov 28, 2022

Github actions tests reported job failures from actions build 3567484153
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3567484153.13 logs
integration 11 3567484153.12 logs

@gatk-bot
Copy link

gatk-bot commented Nov 28, 2022

Github actions tests reported job failures from actions build 3567534442
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 3567534442.13 logs
integration 11 3567534442.12 logs

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato Back to you with my comments. Changes look really good overall, and definitely an improvement over the existing tool! Go ahead and merge this after addressing my comments and tests pass. You'll need to rebase the branch to get rid of some unrelated test failures.

* specified samples are extracted and the INFO field annotations are updated.
* See example commands above for detailed usage examples. The expressions given to this argument
* should either refer to INFO fields, or access FORMAT field with the VariantContext object
* e.g. --select "vc.getGenotype('NA12878').getGQ() > 35"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to include an example with vc.getGenotype().getGQ() in the main tool docs above, along with a comment comparing that syntax to the new -select-genotype approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, --select should be -select here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Argument(shortName="select", doc="One or more criteria to use when selecting the data", optional=true)
public static final String SELECT_SHORT_NAME = "select";
public static final String SELECT_LONG_NAME = "select-expressions";
@Argument(fullName=SELECT_LONG_NAME, shortName=SELECT_SHORT_NAME, doc="", optional=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string should not be empty for this argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit worried about breaking some users' script by not allowing --select, so I might revert this new long name change.

*/
public static final String GENOTYPE_SELECT_SHORT_NAME = "select-genotype";
public static final String GENOTYPE_SELECT_LONG_NAME = "select-genotype-expressions";
@Argument(fullName=GENOTYPE_SELECT_LONG_NAME, shortName=GENOTYPE_SELECT_SHORT_NAME, doc="", optional=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string should not be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* on INFO fields only, which may improve speed when working with a large cohort vcf that contains genotypes for
* thousands of samples (format fields).
*/
@Argument(fullName="jexl-first", shortName="jexl-first", doc="", optional=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill in doc string

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long argument name should perhaps be more descriptive here -- eg., --apply-jexl-filters-first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done x2

@@ -480,7 +523,7 @@ public void onTraversalStart() {

final List<String> genotypeField = getHeaderForVariants().getGenotypeSamples();
if(!ParsingUtils.isSorted(genotypeField)){
if(genotypeField.size() > 10) {
if(!genotypeField.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the previous threshold of 10 was deliberate, so that we don't warn unnecessarily about performance when there are only a few samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done.

final int threshold = 1;
final Predicate<VariantContext> filter = vc -> vc.getGenotypes().stream().anyMatch(g -> g.getGQ() > threshold);
final Pair<VCFHeader, List<VariantContext>> preFilter = VariantContextTestUtils.readEntireVCFIntoMemory(svHGDBMultiSampleVcf.toString());
final int expectedNumVCs = (int) preFilter.getRight().stream().filter(filter).count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these new tests you added, have you confirmed in each case that the number of expected records is > 0? Otherwise the tests will not be as meaningful as they could be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added extra assert statements to check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SelectVariants.class.getSimpleName());
final Pair<VCFHeader, List<VariantContext>> result2 = VariantContextTestUtils.readEntireVCFIntoMemory(testOutput2.getAbsolutePath());
final int actualNumVCs2 = result2.getRight().size();
Assert.assertEquals(expectedNumVCs2, actualNumVCs2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second test case should probably be a separate test method rather than embedded within testCombineInfoAndGenotypeJexl()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

SelectVariants.class.getSimpleName());
final Pair<VCFHeader, List<VariantContext>> result3 = VariantContextTestUtils.readEntireVCFIntoMemory(testOutput3.getAbsolutePath());
final int actualNumVCs3 = result3.getRight().size();
Assert.assertEquals(0, actualNumVCs3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be pulled out into a separate named test method of its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done




// Assert.assertEquals(expectedNumVCs2, actualNumVCs2);D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented-out code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// runCommandLine(Arrays.asList("-V", testVcf.toString(),
// "-O", testOutput.getAbsolutePath(),
// "--select", "ALGORITHMS == 'depth'"),
// SelectVariants.class.getSimpleName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented-out code (but consult with @lbergelson on why ArgumentsBuilder didn't work here)

Comment on lines 525 to 534
final Predicate<VariantContext> filter = vc -> vc.getAttributeAsInt("AC", -1) > 0 ||
vc.getAttributeAsString("SVTYPE", "").equals("DEL");

final Pair<VCFHeader, List<VariantContext>> preFilter = VariantContextTestUtils.readEntireVCFIntoMemory(svHGDBMultiSampleVcf.toString());
final int expectedCount = (int) preFilter.getRight().stream().filter(filter).count();

runCommandLine(Arrays.asList("-V", svHGDBMultiSampleVcf.toString(),
"-O", testOutput.getAbsolutePath(),
"--select", "SVTYPE == 'DEL'",
"--select", "AC > 0"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad what do you think? The issue here is that "--select" should no longer be a valid command line but runCommandLine does not complain about it.

@@ -2,13 +2,7 @@

import htsjdk.samtools.SAMSequenceDictionary;
import htsjdk.tribble.util.ParsingUtils;
import htsjdk.variant.variantcontext.Allele;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the explicit imports back

@@ -603,6 +654,11 @@ public void apply(VariantContext vc, ReadsContext readsContext, ReferenceContext
return;
}

if (applyJexlFiltersBeforeFilteringGenotypes && !passesJexlFilters(vc)) {
return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@codecov
Copy link

codecov bot commented Jan 7, 2023

Codecov Report

Merging #8092 (78c226d) into master (90bf668) will increase coverage by 0.023%.
The diff coverage is 92.237%.

❗ Current head 78c226d differs from pull request most recent head b0dcf74. Consider uploading reports for the commit b0dcf74 to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##              master     #8092       +/-   ##
===============================================
+ Coverage     86.636%   86.659%   +0.023%     
- Complexity     38944     39021       +77     
===============================================
  Files           2335      2337        +2     
  Lines         182675    182958      +283     
  Branches       20053     20080       +27     
===============================================
+ Hits          158262    158549      +287     
+ Misses         17373     17362       -11     
- Partials        7040      7047        +7     
Impacted Files Coverage Δ
...ender/tools/walkers/filters/VariantFiltration.java 92.593% <ø> (ø)
...der/tools/walkers/variantutils/SelectVariants.java 83.911% <86.765%> (+2.516%) ⬆️
...rs/variantutils/SelectVariantsIntegrationTest.java 96.928% <94.702%> (-0.783%) ⬇️
.../scalable/data/LabeledVariantAnnotationsDatum.java 68.421% <0.000%> (-3.801%) ⬇️
...rs/vqsr/scalable/TrainVariantAnnotationsModel.java 77.778% <0.000%> (-2.991%) ⬇️
...bender/utils/runtime/AsynchronousStreamWriter.java 81.633% <0.000%> (-2.041%) ⬇️
...ct/CreateSomaticPanelOfNormalsIntegrationTest.java 96.396% <0.000%> (-1.305%) ⬇️
...stitute/hellbender/utils/config/ConfigFactory.java 73.750% <0.000%> (-1.250%) ⬇️
...ools/walkers/annotator/VariantAnnotatorEngine.java 86.260% <0.000%> (-0.999%) ⬇️
...org/broadinstitute/hellbender/utils/MathUtils.java 80.622% <0.000%> (-0.957%) ⬇️
... and 80 more

@takutosato takutosato merged commit 9f77b1f into master Jan 7, 2023
@takutosato takutosato deleted the ts_select_variants branch January 7, 2023 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants