-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Scramble accuracy for BWA and Dragen 3.7.8 #722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and clearly represents a lot of work, nice job. I had a few questions, especially about the way we search for nearby indels, mostly just to understand what went into all the choices. I was wondering if it might be worthwhile to package up some of the documentation / figures / slides you made while doing this, put them on github in an issue or wiki page, and link to it from the code in case people have questions about why things are implemented this way (or for future maintainers).
@@ -260,18 +261,21 @@ The following sections briefly describe each module and highlights inter-depende | |||
## <a name="gather-sample-evidence">GatherSampleEvidence</a> | |||
*Formerly Module00a* | |||
|
|||
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section. | |||
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you say not to above but this list makes it sound like Scramble + MELT would be OK.
Maybe rephrase to,
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and an MEI caller ([Scramble](https://github.com/GeneDx/scramble) or [MELT](https://melt.igs.umaryland.edu/)). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this too and am fixing it in a separate PR transferring this over to the website
|
||
Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors. | ||
|
||
#### Inputs: | ||
* Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs. | ||
|
||
#### Outputs: | ||
* Caller VCFs (Manta, MELT, and/or Wham) | ||
* Caller VCFs (Manta, Scramble, MELT, and/or Wham) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again the and/or
is a little confusing I think. Maybe just provide the list of callers supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
* Binned read counts file | ||
* Split reads (SR) file | ||
* Discordant read pairs (PE) file | ||
* Scramble intermediate clusters file and table (not needed downstream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not needed downstream but useful for examining candidate sites when high sensitivity is required"?
Or something similar describing the main use of having the file as an output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
@@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS | |||
mv bedtools.static /opt/bedtools/bin/bedtools && \ | |||
chmod a+x /opt/bedtools/bin/bedtools | |||
|
|||
# install bwa | |||
# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up this issue and it looks like it might have been fixed in a new release 0.7.18 in the last few weeks (https://github.com/lh3/bwa/releases/tag/v0.7.18). Might be worth switching this back to installing a release rather than building from source for simplicity and build time sake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like a build got included with that release unfortunately
"ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"", | ||
"ApplyManualVariantFilter.filter_name": "filter_wham_only_del" | ||
"ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)", | ||
"ApplyManualVariantFilter.filter_name": "hard_filter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer something a little more descriptive but I'm not sure what the best name would be. What about something like "high_algorithm_fp_rate"?
|
||
header = make_header(reference_path=arguments.reference, sample_id=arguments.sample) | ||
logging.info("Loading Scramble table...") | ||
sorted_bed_lines = preprocess_scramble_table(path=arguments.table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method doesn't actually sort the records, do we need to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I think maybe they used to be but the sorting is not necessary. Changed to just bed_lines
.
ref_sequences_index_dict = {seq: i for i, seq in enumerate(f.references)} | ||
|
||
# Buffer of active items within cluster_distance | ||
buffer = list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIght not make any practical difference in runtim but buffer
, new_buffer
and data
feel like they might be more efficient as deque
objects rather than lists since they have insertions to the front and end of the lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I've changed these to deque
return False | ||
start = max(record.pos - indel_window, 1) | ||
end = record.pos + indel_window | ||
num_indels = _get_num_indels(reads=samfile.fetch(record.chrom, start, end), min_reads=min_reads, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fetch any reads that overlap the window, so you'll be counting indels that could technically be more than indel_window
bases away from record.pos
if eg a read only overlaps the window by a few bases.
Is this what you want actually? Seems like you might want to only look at reads that actually overlap the position itself, right?
Either way, it would be good to clarify in the comments or argument descriptions what is really happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is another case where repetitive contexts dominated some choices, but they were are a substantial source of false positives. I found that looking at just the position was too stringent.
It's a good point that the search extends past the window since we pull in reads that overlap the window and count all indels, so I will clarify in the comment.
} | ||
} | ||
|
||
task RealignSoftClippedReads { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task is pretty tied to scramble, maybe move it to Scramble.wdl
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried moving it in but it adds a lot of wdl code. One thing that's kind of awkward is having to pass in the scramble_table from the previous step. I think there's less code this way and avoids adding a lot of mutually exclusive inputs to the Scramble wdl itself (all the bwa files etc).
# DRAGEN-3.7.8: 60 | ||
Int alignment_score_cutoff | ||
|
||
Float min_clipped_reads_fraction = 0.22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What went into choosing this parameter? It seems pretty important. Is it worth considering an alternate approach of setting it lower and then applying a filter to calls that don't meet this threshold, so that they'll be there in the VCF for cohorts where high sensitivity is desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be a useful feature. The only issue with increasing the threshold is cost. A lower threshold can bring in a lot more clusters which will drive up run time proportionally for all users. I think it would be a reasonable compromise to lower the threshold fed to Scramble by a tad (say to 0.15) so that the variants at least appear in the Scramble table, and then filter at 0.22 for the vcf. I don't think this is critical but I'm going to open a ticket for this.
parent 2b9af68 author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1718121961 -0400 parent ceb41fc author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1715005950 -0400 parent 9ad640d author Mark Walker <[email protected]> 1701883462 -0500 committer Mark Walker <[email protected]> 1704486807 -0500 Add scramble filter Realignment Add bwa to mini docker Build bwa from git Use /opt/bin Start in /opt Runtime parameters Fix alignment command bump memory Make scramble vcf python script Scramble make vcf task added Fix realignment bai path Missing sv_pipeline_docker input Fix scramble table output Buffered clustering Fix reverses Use samtools view -M; realign from Scramble table instead of vcf Fix interval generation and only realign soft-clipped reads Slop 1 base Slop 150 bases And merge Change scramble priority in overlap bp filter; update dockers Reduce realignment cores to 4 Sort scramble table Fix wdl Update json templates Bump realignment memory to 12gb Update script to filter on indels Indel filtering switch Update Add MEI intervals and manta/wham filtering Delete comments Optimize mei tree loading; consume table directly Add scramble_min_clipped_reads Fix reads_index_ Update scramble vcf script args Add missing vcf script input in GatherSampleEvidence LINE1 to L1 alt Give preference to MEI over Manta insertions in breakpoint overlap filter Update dockers with last build Update dockers Reduce memory usage in RealignSoftClippedReads Bump realignment memory Add manta_vcf_input in case manta isn't run but scramble is Increase min_clipped_reads to 5 Bump memory Update scramble commit Bump MakeScrambleVcf memory to 4 Reduce min reads to 3 Add mei tree padding and alignment_score_cutoff Select alignment score cutoff based on aligner Update dockers Remove debugging lines Add scramble_mei_bed Add scramble_mei_bed to test template Bump ScramblePart1 to 3 GB Adjust scramble clipped reads threshold based on coverage Top level scramble_min_clipped_reads_fraction optional Update dockers Add mw_scramble_filter to ResolveComplexVariants in dockstore.yml Update dockers Update dockstore yml Final touches Update templates; add SVA filter Revert unnecessary changes Lint Update readme Update single sample pipeline and gathersampleevidence batch Update readme Readme grammar Scramble intermedates description in README Add comments to scramble vcf script Minor fix Fix GatherSampleEvidence optional comparison Fix double declaration of mei_bed in GATKSVPipelineSingleSample Update Scramble commit to latest master Update ref panel vcfs Reviewer comments
0024d24
to
dca0b44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for addressing my comments.
Updates methods for Scramble calling. This includes several critical components:
GatherSampleEvidence
now automatically detects if Dragen 3.7.8 is used using the CRAM header (or if theis_dragen_3_7_8
flag is set explicitly). If so, it performs realignment of all soft-clipped reads near Scramble calls using BWA and re-runs Scramble off the realignment. The cost of this is very small and eliminates this artifact.ResolveComplexVariants
to prioritize Manta over Scramble. A major source of Scramble false positives was at deletion sites. This filter catches such cases missed in the raw call filter from part (1).ApplyManualVariantFilter
to remove Scramble-only SVAs flagged asHIGH_SR_BACKGROUND
. This raised SVA precision from ~0.65 to ~0.85 in our tests, with a tolerably small loss of sensitivity. In the future, improvements and/or retraining of the GQ Recalibrator model may obviate the need for this hard filter.This PR also includes changes to the README to recommend Scramble for MEI calling and deprecate MELT support.
These changes were extensively tested and validated on CRAMS aligned with both BWA and Dragen-3.7.8, including a full run of 1000 Genomes Phase 3 on 3.7.8.