Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the behavior of getNextReferenceVertex for non-ref paths #4889

Merged

Conversation

DonFreed
Copy link
Contributor

There was a subtle change in the behavior of getNextReferenceVertex in the migration from GATK 3.x to the GATK 4.x that may be worth another look.

The original 3.x behavior is described in a comment of the function, if @param allowNonRefPaths is true, allow sub-paths that are non-reference if there is only a single outgoing edge. The 4.x migration copied the comment, but changed the behavior to if @param allowNonRefPaths is true, allow sub-paths that are non-reference and pick a random outgoing edge.

This pull request restores the 3.x behavior, which we believe is the design intent.

@codecov-io
Copy link

codecov-io commented Jun 12, 2018

Codecov Report

Merging #4889 into master will decrease coverage by 0.257%.
The diff coverage is 50%.

@@               Coverage Diff               @@
##              master     #4889       +/-   ##
===============================================
- Coverage     80.421%   80.164%   -0.257%     
+ Complexity     17820     17772       -48     
===============================================
  Files           1089      1089               
  Lines          64161     64161               
  Branches       10344     10344               
===============================================
- Hits           51599     51434      -165     
- Misses          8501      8672      +171     
+ Partials        4061      4055        -6
Impacted Files Coverage Δ Complexity Δ
...ools/walkers/haplotypecaller/graphs/BaseGraph.java 82.239% <50%> (ø) 94 <2> (ø) ⬇️
...s/spark/ParallelCopyGCSDirectoryIntoHDFSSpark.java 0% <0%> (-74.257%) 0% <0%> (-17%)
...nder/tools/spark/pipelines/PrintVariantsSpark.java 0% <0%> (-66.667%) 0% <0%> (-2%)
...oadinstitute/hellbender/utils/test/XorWrapper.java 13.043% <0%> (-65.217%) 2% <0%> (-7%)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 54.194% <0%> (-25.806%) 30% <0%> (-10%)
...nder/tools/spark/BaseRecalibratorSparkSharded.java 0% <0%> (-22.807%) 0% <0%> (-2%)
...titute/hellbender/utils/test/MiniClusterUtils.java 78.947% <0%> (-10.526%) 6% <0%> (-1%)
...der/engine/spark/datasources/ReadsSparkSource.java 77.083% <0%> (-3.125%) 31% <0%> (ø)
...adinstitute/hellbender/engine/ReadsDataSource.java 89.394% <0%> (-3.03%) 61% <0%> (-2%)
...broadinstitute/hellbender/utils/test/BaseTest.java 62.838% <0%> (-2.703%) 36% <0%> (-3%)
... and 3 more

@lbergelson
Copy link
Member

@davidbenjamin Are you able to review this?

@davidbenjamin
Copy link
Contributor

@lbergelson Sure!

final Optional<E> edge = outgoingEdges.stream().filter(e -> !blacklistedEdgeSet.contains(e)).findAny();
return edge.isPresent() ? getEdgeTarget(edge.get()) : null;
final List<E> edges = outgoingEdges.stream().filter(e -> !blacklistedEdgeSet.contains(e)).limit(2).collect(Collectors.toList());
return edges.size() == 1 ? getEdgeTarget(edges.get(0)) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the first edge is still essentially random in that it depends on the order in which reads arrived. Why not be principled and traverse the edge with the greatest weight, which seems like a good heuristic if you're trying to arrive back at the reference?

final Optional<E> edge = outgoingEdges.stream().max(Comparator.comparingInt(BaseEdge::getMultiplicity).reversed())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misread the code. But I still think my suggestion to do a greedy search before giving up on finding the reference is reasonable and more robust to things like high depth where you might have lots of low-weight non-reference edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the current behavior is to pick a random edge. The PR reverts to the 3.x behavior, which is to return a non-reference edge iff there 1 non-reference edge.

It would be great to try a more exhaustive search. However, my intuition is that the we are unlikely to make it back to the reference when the paths start forking and the computational cost can be substantial. We can use this PR to revert back to the deterministic logic and save implementation (and evaluation) of a greedy search for a future issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DonFreed I'm fine with the 3.x behavior. Anything fancier can wait for a later thorough revision of the dangling end merging code.

@lbergelson lbergelson merged commit 54aa82e into broadinstitute:master Jun 19, 2018
@lbergelson
Copy link
Member

Thank you @DonFreed!

@davidbenjamin for reviewing, if you feel we should make the changes you've suggested could you open an issue so we don't forget?

@DonFreed DonFreed deleted the df_getNextReferenceVertex_behavior branch June 19, 2018 18:32
lbergelson pushed a commit that referenced this pull request Jun 19, 2018
There was a subtle change in the behavior of getNextReferenceVertex in the migration from GATK 3.x to the GATK 4.x 

The original 3.x behavior is described in a comment of the function, if @param allowNonRefPaths is true, allow sub-paths that are non-reference if there is only a single outgoing edge. The 4.x migration copied the comment, but changed the behavior to if @param allowNonRefPaths is true, allow sub-paths that are non-reference and pick a random outgoing edge.

This pull request restores the 3.x behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants