Fixed a bug in pileup mode relating to number of haplotypes #8489

jonn-smith · 2023-08-21T21:35:23Z

Recent refactoring seems to have introduced a bug in pileup mode that failed to enforce the limit on the number of haplotypes to be considered. With this patch:

HaplotypeCaller once again respects the limit on haplotypes before genotyping.
Changed some HashSets to LinkedHashSets to preserve determinism.

gatk-bot · 2023-08-21T22:04:01Z

Github actions tests reported job failures from actions build 5931710172
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	17.0.6+10	5931710172.12	logs
unit	17.0.6+10	5931710172.1	logs
variantcalling	17.0.6+10	5931710172.2	logs

gatk-bot · 2023-08-21T22:13:19Z

Github actions tests reported job failures from actions build 5931724901
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	17.0.6+10	5931724901.12	logs
unit	17.0.6+10	5931724901.1	logs
variantcalling	17.0.6+10	5931724901.2	logs

gatk-bot · 2023-08-22T03:45:35Z

Github actions tests reported job failures from actions build 5933685416
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
variantcalling	17.0.6+10	5933685416.2	logs

jamesemery

One minor change and a request to update the unit tests to reflect the new reality a little better.

@davidbenjamin This is a reversion to a fix you made recently fixing the non-determninism... it turns out this was in some edge cases producing assembly windows with 50k+ haplotypes (because we were taking everything that ties at the 5 ranked score without limit...) Now we are arbitrarily breaking based on string comparision which at least prevents us from exploding...

Do you have any ideas for an alternative tie-breaking scheme for haploytpes? It seems these ties were happening much more than expected (probably because most haplotypes end up with (length - k) as their scores because there kmers appear at least somewhere... This also would seem to bias us to mostly select the longest haplotypes in every case for inserting events....

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

...roadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtilsUnitTest.java

jamesemery · 2023-08-23T15:41:25Z

...es/org/broadinstitute/hellbender/tools/haplotypecaller/expected.pileupCallerDRAGEN.gatk4.vcf

-20	10099055	.	T	C	47.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-1.460;DP=19;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=55.52;MQRankSum=-3.419;QD=2.65;ReadPosRankSum=-0.830;SOR=0.330	GT:AD:DP:GQ:PL	0/1:15,3:18:55:55,0,557
-20	10099079	.	C	T	40.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.694;DP=23;ExcessHet=0.0000;FS=2.192;MLEAC=1;MLEAF=0.500;MQ=56.33;MQRankSum=-3.254;QD=1.77;ReadPosRankSum=-0.568;SOR=1.460	GT:AD:DP:GQ:PL	0/1:19,4:23:48:48,0,551
-20	10099140	.	G	T	583.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-0.176;DP=49;ExcessHet=0.0000;FS=6.794;MLEAC=1;MLEAF=0.500;MQ=60.07;MQRankSum=1.344;QD=12.42;ReadPosRankSum=-0.466;SOR=1.944	GT:AD:DP:GQ:PL	0/1:28,19:47:99:591,0,932
+20	10099046	.	T	C	31.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-1.842;DP=17;ExcessHet=0.0000;FS=3.701;MLEAC=1;MLEAF=0.500;MQ=56.07;MQRankSum=-3.254;QD=1.86;ReadPosRankSum=-0.597;SOR=0.090	GT:AD:DP:GQ:PL	0/1:15,2:17:39:39,0,564


these diferences are mostly noise to do with exactly what haplotypes come out... it is interesting the ties at the 5th score level seem to be very common... We really need to replace this heuristic with something not quite so terrible... @davidbenjamin

@jamesemery It seems to be that the intent here is to corral the combinatorial explosion of paths in the de Bruijn graph that don't correspond to actual phasing. However, by using the same kmer size as in assembly the heuristic is completely toothless! It's basically a tautology that all the kmers of an assembled haplotype show up in some read, since otherwise the kmer wouldn't exist in the graph.

A better approach would be to use the same heuristic but with larger kmers than used in assembly. For example, using k = 30 would get rid of spurious phasing within 30 bases.

Also, I don't think that using a larger k is merely a tie-breaker for the extant approach. Rather, what we have now should be replaced completely.

I don't think I am done pondering this but that's a good start.

One might also ask whether an excess of haplotypes means that instead of filtering we should just redo assembly with a larger k, or dynamically switch to linked de Bruijn graphs. That could well be.

Of course, in pileup mode that's moot.

That seems like a good observation... It would also make it harder for every haplotype to be maximally represented by reads! I think it might also be worth considering normalizing to the lenght of the haplotype as well to prevent ourselves from being biased towards longer haplotypes that have incidental (and distant) insertions being over-represented in the choices here... alternatively we could flip the score on its head... where the score is the absolute number of missing kmers from the reads, which should more or less be capped at ~(2k -1) but not have this gross property that it prefers longer hapltotypes

I agree that counting non-missing kmers and thereby rewarding length is wrong.

Since pileup haplotypes are constructed by injecting pileup events into base haplotypes, why not compute the score at that point as follows: 1) make a single kmer containing the pileup event and any nearby events on the base haplotype 2) check whether this kmer is supported in the reads. If not, it's a spurious phasing. 3) If too many pileup haplotypes pass, repeat with larger k.

@davidbenjamin i'm going to open a ticket to re-visit this heuristic... i think the band-aid here is sufficient and necessary to get into the next release.

jamesemery

One last minute fix then i think this can be merged

jamesemery · 2023-08-24T16:35:49Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

+        // final ordering determined by string representation (for determinism).
+        return onlyNewHaplotypes.stream()
+                .filter(h -> scores.get(h) >= minimumPassingScore)
+                .sorted(Comparator.comparing(Haplotype::getBaseString))


Actually i think you need to change this... you need Comparitor.comparing(score).thenComparing (need to check the syntax on that) because now this is shuffling the score > minPassingScore haplotypes which is probably fine most of the time but could end up not selecting the top scorign hap.

jamesemery

Once you fix that subtlety in the comparator feel free to merge this branch.

gatk-bot · 2023-08-28T15:28:28Z

Github actions tests reported job failures from actions build 6000851764
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
variantcalling	17.0.6+10	6000851764.2	logs

* Limited the number of haplotypes returned in pileup mode to the number requested. * Modified pileup assembly code to be more deterministic.

jonn-smith added 5 commits August 18, 2023 11:03

Modified pileup assembly code to be more deterministic.

2e7d4c1

Added in some debug log statements.

a97af5b

Added debugging code - remember to remove this later.

2a7c987

Added another debug log statement.

3867584

Removed debug logging.

9089e75

jonn-smith requested a review from jamesemery August 21, 2023 21:37

jonn-smith assigned jamesemery Aug 21, 2023

Fixed tests so they respect the number of haplotypes to be returned.

5c9000e

Updated expected results for some pileup mode integration tests.

5febb02

jamesemery requested changes Aug 23, 2023

View reviewed changes

Addressing reviewer comments.

abf536f

jamesemery reviewed Aug 24, 2023

View reviewed changes

jamesemery approved these changes Aug 24, 2023

View reviewed changes

jamesemery mentioned this pull request Aug 24, 2023

ReVisit the PileupCaller haplotype filtering heuristics #8494

Open

Fixing minor bug in haplotype sorting.

3266315

Updating some integration tests again with new comparator results.

c092c5b

jonn-smith merged commit c9bf941 into master Aug 28, 2023
20 checks passed

jonn-smith deleted the jts_haplotypecaller_pileup_num_haplotypes_fix branch August 28, 2023 17:11

rickymagner pushed a commit that referenced this pull request Nov 28, 2023

Fixed a bug in pileup mode relating to number of haplotypes (#8489)

74afe13

* Limited the number of haplotypes returned in pileup mode to the number requested. * Modified pileup assembly code to be more deterministic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a bug in pileup mode relating to number of haplotypes #8489

Fixed a bug in pileup mode relating to number of haplotypes #8489

jonn-smith commented Aug 21, 2023

gatk-bot commented Aug 21, 2023 •

edited

Loading

gatk-bot commented Aug 21, 2023 •

edited

Loading

gatk-bot commented Aug 22, 2023

jamesemery left a comment

jamesemery Aug 23, 2023

davidbenjamin Aug 23, 2023

davidbenjamin Aug 23, 2023 •

edited

Loading

jamesemery Aug 23, 2023

davidbenjamin Aug 23, 2023

jamesemery Aug 24, 2023

davidbenjamin Aug 24, 2023

jamesemery left a comment

jamesemery Aug 24, 2023

jamesemery left a comment

gatk-bot commented Aug 28, 2023

Fixed a bug in pileup mode relating to number of haplotypes #8489

Fixed a bug in pileup mode relating to number of haplotypes #8489

Conversation

jonn-smith commented Aug 21, 2023

gatk-bot commented Aug 21, 2023 • edited Loading

gatk-bot commented Aug 21, 2023 • edited Loading

gatk-bot commented Aug 22, 2023

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Aug 23, 2023

Choose a reason for hiding this comment

davidbenjamin Aug 23, 2023

Choose a reason for hiding this comment

davidbenjamin Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

jamesemery Aug 23, 2023

Choose a reason for hiding this comment

davidbenjamin Aug 23, 2023

Choose a reason for hiding this comment

jamesemery Aug 24, 2023

Choose a reason for hiding this comment

davidbenjamin Aug 24, 2023

Choose a reason for hiding this comment

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Aug 24, 2023

Choose a reason for hiding this comment

jamesemery left a comment

Choose a reason for hiding this comment

gatk-bot commented Aug 28, 2023

gatk-bot commented Aug 21, 2023 •

edited

Loading

gatk-bot commented Aug 21, 2023 •

edited

Loading

davidbenjamin Aug 23, 2023 •

edited

Loading