Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a bug in pileup mode relating to number of haplotypes #8489

Merged
merged 10 commits into from
Aug 28, 2023

Conversation

jonn-smith
Copy link
Collaborator

Recent refactoring seems to have introduced a bug in pileup mode that failed to enforce the limit on the number of haplotypes to be considered. With this patch:

  • HaplotypeCaller once again respects the limit on haplotypes before genotyping.
  • Changed some HashSets to LinkedHashSets to preserve determinism.

@gatk-bot
Copy link

gatk-bot commented Aug 21, 2023

Github actions tests reported job failures from actions build 5931710172
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 5931710172.12 logs
unit 17.0.6+10 5931710172.1 logs
variantcalling 17.0.6+10 5931710172.2 logs

@gatk-bot
Copy link

gatk-bot commented Aug 21, 2023

Github actions tests reported job failures from actions build 5931724901
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 17.0.6+10 5931724901.12 logs
unit 17.0.6+10 5931724901.1 logs
variantcalling 17.0.6+10 5931724901.2 logs

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 5933685416
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 5933685416.2 logs

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change and a request to update the unit tests to reflect the new reality a little better.

@davidbenjamin This is a reversion to a fix you made recently fixing the non-determninism... it turns out this was in some edge cases producing assembly windows with 50k+ haplotypes (because we were taking everything that ties at the 5 ranked score without limit...) Now we are arbitrarily breaking based on string comparision which at least prevents us from exploding...

Do you have any ideas for an alternative tie-breaking scheme for haploytpes? It seems these ties were happening much more than expected (probably because most haplotypes end up with (length - k) as their scores because there kmers appear at least somewhere... This also would seem to bias us to mostly select the longest haplotypes in every case for inserting events....

20 10099055 . T C 47.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.460;DP=19;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=55.52;MQRankSum=-3.419;QD=2.65;ReadPosRankSum=-0.830;SOR=0.330 GT:AD:DP:GQ:PL 0/1:15,3:18:55:55,0,557
20 10099079 . C T 40.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.694;DP=23;ExcessHet=0.0000;FS=2.192;MLEAC=1;MLEAF=0.500;MQ=56.33;MQRankSum=-3.254;QD=1.77;ReadPosRankSum=-0.568;SOR=1.460 GT:AD:DP:GQ:PL 0/1:19,4:23:48:48,0,551
20 10099140 . G T 583.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.176;DP=49;ExcessHet=0.0000;FS=6.794;MLEAC=1;MLEAF=0.500;MQ=60.07;MQRankSum=1.344;QD=12.42;ReadPosRankSum=-0.466;SOR=1.944 GT:AD:DP:GQ:PL 0/1:28,19:47:99:591,0,932
20 10099046 . T C 31.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.842;DP=17;ExcessHet=0.0000;FS=3.701;MLEAC=1;MLEAF=0.500;MQ=56.07;MQRankSum=-3.254;QD=1.86;ReadPosRankSum=-0.597;SOR=0.090 GT:AD:DP:GQ:PL 0/1:15,2:17:39:39,0,564
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these diferences are mostly noise to do with exactly what haplotypes come out... it is interesting the ties at the 5th score level seem to be very common... We really need to replace this heuristic with something not quite so terrible... @davidbenjamin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery It seems to be that the intent here is to corral the combinatorial explosion of paths in the de Bruijn graph that don't correspond to actual phasing. However, by using the same kmer size as in assembly the heuristic is completely toothless! It's basically a tautology that all the kmers of an assembled haplotype show up in some read, since otherwise the kmer wouldn't exist in the graph.

A better approach would be to use the same heuristic but with larger kmers than used in assembly. For example, using k = 30 would get rid of spurious phasing within 30 bases.

Also, I don't think that using a larger k is merely a tie-breaker for the extant approach. Rather, what we have now should be replaced completely.

I don't think I am done pondering this but that's a good start.

Copy link
Contributor

@davidbenjamin davidbenjamin Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One might also ask whether an excess of haplotypes means that instead of filtering we should just redo assembly with a larger k, or dynamically switch to linked de Bruijn graphs. That could well be.

Of course, in pileup mode that's moot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good observation... It would also make it harder for every haplotype to be maximally represented by reads! I think it might also be worth considering normalizing to the lenght of the haplotype as well to prevent ourselves from being biased towards longer haplotypes that have incidental (and distant) insertions being over-represented in the choices here... alternatively we could flip the score on its head... where the score is the absolute number of missing kmers from the reads, which should more or less be capped at ~(2k -1) but not have this gross property that it prefers longer hapltotypes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that counting non-missing kmers and thereby rewarding length is wrong.

Since pileup haplotypes are constructed by injecting pileup events into base haplotypes, why not compute the score at that point as follows: 1) make a single kmer containing the pileup event and any nearby events on the base haplotype 2) check whether this kmer is supported in the reads. If not, it's a spurious phasing. 3) If too many pileup haplotypes pass, repeat with larger k.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbenjamin i'm going to open a ticket to re-visit this heuristic... i think the band-aid here is sufficient and necessary to get into the next release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last minute fix then i think this can be merged

// final ordering determined by string representation (for determinism).
return onlyNewHaplotypes.stream()
.filter(h -> scores.get(h) >= minimumPassingScore)
.sorted(Comparator.comparing(Haplotype::getBaseString))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i think you need to change this... you need Comparitor.comparing(score).thenComparing (need to check the syntax on that) because now this is shuffling the score > minPassingScore haplotypes which is probably fine most of the time but could end up not selecting the top scorign hap.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you fix that subtlety in the comparator feel free to merge this branch.

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 6000851764
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling 17.0.6+10 6000851764.2 logs

@jonn-smith jonn-smith merged commit c9bf941 into master Aug 28, 2023
20 checks passed
@jonn-smith jonn-smith deleted the jts_haplotypecaller_pileup_num_haplotypes_fix branch August 28, 2023 17:11
rickymagner pushed a commit that referenced this pull request Nov 28, 2023
* Limited the number of haplotypes returned in pileup mode to the number requested.
* Modified pileup assembly code to be more deterministic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants