Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

mwalker174
Copy link
Collaborator

Updates methods for Scramble calling. This includes several critical components:

  1. Filtering of calls within reference MEIs overlapping indels and small deletion SVs. This was a major source of false positives for Scramble with WGS. The filter identifies such cases from the BAM/CRAM and Manta calls.
  2. Improved SVLEN estimation. The previous method was overly simplistic, but length estimation now accounts for the orientation of soft clips and alignment strand. Note that this method's accuracy is greatly limited in cases where split reads are found on only one side of the breakpoint.
  3. Deduplication. Scramble seems to emit duplicated calls in many cases. Now calls within a short window are collapsed appropriately.
  4. Dragen 3.7.8 realignment. An apparent regression was introduced somewhere after Dragen 3.4.12 and after or as of 3.7.8 where small indels cause erroneous soft-clipping. This caused a massive number of false positive calls with Scramble. GatherSampleEvidence now automatically detects if Dragen 3.7.8 is used using the CRAM header (or if the is_dragen_3_7_8 flag is set explicitly). If so, it performs realignment of all soft-clipped reads near Scramble calls using BWA and re-runs Scramble off the realignment. The cost of this is very small and eliminates this artifact.
  5. Updates the overlap breakpoint filter in ResolveComplexVariants to prioritize Manta over Scramble. A major source of Scramble false positives was at deletion sites. This filter catches such cases missed in the raw call filter from part (1).
  6. SVA precision was initially low even after implementation of the above filters. A hard filter has been added to ApplyManualVariantFilter to remove Scramble-only SVAs flagged as HIGH_SR_BACKGROUND. This raised SVA precision from ~0.65 to ~0.85 in our tests, with a tolerably small loss of sensitivity. In the future, improvements and/or retraining of the GQ Recalibrator model may obviate the need for this hard filter.
  7. Filter secondary/supplementary reads from consideration in the Scramble tool. This was bringing in further false positives. The Dockerfile has been updated.

This PR also includes changes to the README to recommend Scramble for MEI calling and deprecate MELT support.

These changes were extensively tested and validated on CRAMS aligned with both BWA and Dragen-3.7.8, including a full run of 1000 Genomes Phase 3 on 3.7.8.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and clearly represents a lot of work, nice job. I had a few questions, especially about the way we search for nearby indels, mostly just to understand what went into all the choices. I was wondering if it might be worthwhile to package up some of the documentation / figures / slides you made while doing this, put them on github in an issue or wiki page, and link to it from the code in case people have questions about why things are implemented this way (or for future maintainers).

@@ -260,18 +261,21 @@ The following sections briefly describe each module and highlights inter-depende
## <a name="gather-sample-evidence">GatherSampleEvidence</a>
*Formerly Module00a*

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you say not to above but this list makes it sound like Scramble + MELT would be OK.

Maybe rephrase to,

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and an MEI caller ([Scramble](https://github.com/GeneDx/scramble) or [MELT](https://melt.igs.umaryland.edu/)). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this too and am fixing it in a separate PR transferring this over to the website


Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

#### Inputs:
* Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs.

#### Outputs:
* Caller VCFs (Manta, MELT, and/or Wham)
* Caller VCFs (Manta, Scramble, MELT, and/or Wham)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the and/or is a little confusing I think. Maybe just provide the list of callers supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

* Binned read counts file
* Split reads (SR) file
* Discordant read pairs (PE) file
* Scramble intermediate clusters file and table (not needed downstream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not needed downstream but useful for examining candidate sites when high sensitivity is required"?

Or something similar describing the main use of having the file as an output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

@@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS
mv bedtools.static /opt/bedtools/bin/bedtools && \
chmod a+x /opt/bedtools/bin/bedtools

# install bwa
# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up this issue and it looks like it might have been fixed in a new release 0.7.18 in the last few weeks (https://github.com/lh3/bwa/releases/tag/v0.7.18). Might be worth switching this back to installing a release rather than building from source for simplicity and build time sake?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like a build got included with that release unfortunately

"ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"",
"ApplyManualVariantFilter.filter_name": "filter_wham_only_del"
"ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)",
"ApplyManualVariantFilter.filter_name": "hard_filter"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer something a little more descriptive but I'm not sure what the best name would be. What about something like "high_algorithm_fp_rate"?


header = make_header(reference_path=arguments.reference, sample_id=arguments.sample)
logging.info("Loading Scramble table...")
sorted_bed_lines = preprocess_scramble_table(path=arguments.table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method doesn't actually sort the records, do we need to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I think maybe they used to be but the sorting is not necessary. Changed to just bed_lines.

ref_sequences_index_dict = {seq: i for i, seq in enumerate(f.references)}

# Buffer of active items within cluster_distance
buffer = list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIght not make any practical difference in runtim but buffer, new_buffer and data feel like they might be more efficient as deque objects rather than lists since they have insertions to the front and end of the lists.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I've changed these to deque

return False
start = max(record.pos - indel_window, 1)
end = record.pos + indel_window
num_indels = _get_num_indels(reads=samfile.fetch(record.chrom, start, end), min_reads=min_reads,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fetch any reads that overlap the window, so you'll be counting indels that could technically be more than indel_window bases away from record.pos if eg a read only overlaps the window by a few bases.

Is this what you want actually? Seems like you might want to only look at reads that actually overlap the position itself, right?

Either way, it would be good to clarify in the comments or argument descriptions what is really happening.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is another case where repetitive contexts dominated some choices, but they were are a substantial source of false positives. I found that looking at just the position was too stringent.

It's a good point that the search extends past the window since we pull in reads that overlap the window and count all indels, so I will clarify in the comment.

}
}

task RealignSoftClippedReads {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is pretty tied to scramble, maybe move it to Scramble.wdl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving it in but it adds a lot of wdl code. One thing that's kind of awkward is having to pass in the scramble_table from the previous step. I think there's less code this way and avoids adding a lot of mutually exclusive inputs to the Scramble wdl itself (all the bwa files etc).

# DRAGEN-3.7.8: 60
Int alignment_score_cutoff

Float min_clipped_reads_fraction = 0.22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What went into choosing this parameter? It seems pretty important. Is it worth considering an alternate approach of setting it lower and then applying a filter to calls that don't meet this threshold, so that they'll be there in the VCF for cohorts where high sensitivity is desired?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be a useful feature. The only issue with increasing the threshold is cost. A lower threshold can bring in a lot more clusters which will drive up run time proportionally for all users. I think it would be a reasonable compromise to lower the threshold fed to Scramble by a tad (say to 0.15) so that the variants at least appear in the Scramble table, and then filter at 0.22 for the vcf. I don't think this is critical but I'm going to open a ticket for this.

parent 2b9af68
author Mark Walker <[email protected]> 1701883462 -0500
committer Mark Walker <[email protected]> 1718121961 -0400

parent ceb41fc
author Mark Walker <[email protected]> 1701883462 -0500
committer Mark Walker <[email protected]> 1715005950 -0400

parent 9ad640d
author Mark Walker <[email protected]> 1701883462 -0500
committer Mark Walker <[email protected]> 1704486807 -0500

Add scramble filter

Realignment

Add bwa to mini docker

Build bwa from git

Use /opt/bin

Start in /opt

Runtime parameters

Fix alignment command

bump memory

Make scramble vcf python script

Scramble make vcf task added

Fix realignment bai path

Missing sv_pipeline_docker input

Fix scramble table output

Buffered clustering

Fix reverses

Use samtools view -M; realign from Scramble table instead of vcf

Fix interval generation and only realign soft-clipped reads

Slop 1 base

Slop 150 bases

And merge

Change scramble priority in overlap bp filter; update dockers

Reduce realignment cores to 4

Sort scramble table

Fix wdl

Update json templates

Bump realignment memory to 12gb

Update script to filter on indels

Indel filtering switch

Update

Add MEI intervals and manta/wham filtering

Delete comments

Optimize mei tree loading; consume table directly

Add scramble_min_clipped_reads

Fix reads_index_

Update scramble vcf script args

Add missing vcf script input in GatherSampleEvidence

LINE1 to L1 alt

Give preference to MEI over Manta insertions in breakpoint overlap filter

Update dockers with last build

Update dockers

Reduce memory usage in RealignSoftClippedReads

Bump realignment memory

Add manta_vcf_input in case manta isn't run but scramble is

Increase min_clipped_reads to 5

Bump memory

Update scramble commit

Bump MakeScrambleVcf memory to 4

Reduce min reads to 3

Add mei tree padding and alignment_score_cutoff

Select alignment score cutoff based on aligner

Update dockers

Remove debugging lines

Add scramble_mei_bed

Add scramble_mei_bed to test template

Bump ScramblePart1 to 3 GB

Adjust scramble clipped reads threshold based on coverage

Top level scramble_min_clipped_reads_fraction optional

Update dockers

Add mw_scramble_filter to ResolveComplexVariants in dockstore.yml

Update dockers

Update dockstore yml

Final touches

Update templates; add SVA filter

Revert unnecessary changes

Lint

Update readme

Update single sample pipeline and gathersampleevidence batch

Update readme

Readme grammar

Scramble intermedates description in README

Add comments to scramble vcf script

Minor fix

Fix GatherSampleEvidence optional comparison

Fix double declaration of mei_bed in GATKSVPipelineSingleSample

Update Scramble commit to latest master

Update ref panel vcfs

Reviewer comments
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for addressing my comments.

@mwalker174 mwalker174 merged commit 5ea22a5 into main Oct 25, 2024
9 checks passed
@mwalker174 mwalker174 deleted the mw_scramble_filter branch October 25, 2024 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants