Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Scramble accuracy for BWA and Dragen 3.7.8 #722

Merged
merged 1 commit into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ A structural variation discovery pipeline for Illumina short-read whole-genome s
* A workflow execution system supporting the [Workflow Description Language](https://openwdl.org/) (WDL), either:
* [Cromwell](https://github.com/broadinstitute/cromwell) (v36 or higher). A dedicated server is highly recommended.
* or [Terra](https://terra.bio/) (note preconfigured GATK-SV workflows are not yet available for this platform)
* Recommended: [MELT](https://melt.igs.umaryland.edu/). Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
* Recommended: [cromshell](https://github.com/broadinstitute/cromshell) for interacting with a dedicated Cromwell server.
* Recommended: [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/) for validating WDL/json files.

Expand Down Expand Up @@ -122,16 +121,18 @@ There are two scripts for running the full pipeline:

#### Building inputs
Example workflow inputs can be found in `/inputs`. Build using `scripts/inputs/build_default_inputs.sh`, which
generates input jsons in `/inputs/build`. Except the MELT docker image, all required resources are available in public
generates input jsons in `/inputs/build`. All required resources are available in public
Google buckets.

#### MELT
**Important**: The example input files contain MELT inputs that are NOT public (see [Requirements](#requirements)). These include:
**Important**: MELT has been replaced with [Scramble](https://github.com/GeneDx/scramble) for mobile element calling. While it is still possible to run GATK-SV with MELT, we no longer support it as a caller. It will be fully deprecated in the future.

Due to licensing restrictions, we cannot redistribute MELT binaries or input files, including the docker image. Some default input files contain MELT inputs that are NOT public (see [Requirements](#requirements)) including:

* `GATKSVPipelineSingleSample.melt_docker` and `GATKSVPipelineBatch.melt_docker` - MELT docker URI (see [Docker readme](https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md))
* `GATKSVPipelineSingleSample.ref_std_melt_vcfs` - Standardized MELT VCFs ([GatherBatchEvidence](#gather-batch-evidence))

The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting `GATKSVPipelineBatch.use_melt` to `false`.
The input values are provided only as placeholders. In some workflows, MELT must be enabled with appropriate settings, by providing optional MELT inputs and/or with an explicit option e.g. `GATKSVPipelineBatch.use_melt` to `true`. We do not recommend running both Scramble and MELT together.

#### Execution
We recommend running the pipeline on a dedicated [Cromwell](https://github.com/broadinstitute/cromwell) server with a [cromshell](https://github.com/broadinstitute/cromshell) client. A batch run can be started with the following commands:
Expand All @@ -151,7 +152,7 @@ where `cromwell_config.json` is a Cromwell [workflow options file](https://cromw

## <a name="overview">Pipeline Overview</a>
The pipeline consists of a series of modules that perform the following:
* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
* [GatherSampleEvidence](#gather-sample-evidence): SV evidence collection, including calls from a configurable set of algorithms (Manta, Scramble, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
* [EvidenceQC](#evidence-qc): Dosage bias scoring and ploidy estimation
* [GatherBatchEvidence](#gather-batch-evidence): Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
* [ClusterBatch](#cluster-batch): Variant clustering
Expand Down Expand Up @@ -249,18 +250,21 @@ The following sections briefly describe each module and highlights inter-depende
## <a name="gather-sample-evidence">GatherSampleEvidence</a>
*Formerly Module00a*

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), [Scramble](https://github.com/GeneDx/scramble), and/or [MELT](https://melt.igs.umaryland.edu/). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you say not to above but this list makes it sound like Scramble + MELT would be OK.

Maybe rephrase to,

Runs raw evidence collection on each sample with the following SV callers: [Manta](https://github.com/Illumina/manta), [Wham](https://github.com/zeeev/wham), and an MEI caller ([Scramble](https://github.com/GeneDx/scramble) or [MELT](https://melt.igs.umaryland.edu/)). For guidance on pre-filtering prior to `GatherSampleEvidence`, refer to the [Sample Exclusion](#sample-exclusion) section.```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this too and am fixing it in a separate PR transferring this over to the website


The `scramble_clusters` and `scramble_table` are generated as outputs for troubleshooting purposes but not consumed by any downstream workflows.

Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

#### Inputs:
* Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs.

#### Outputs:
* Caller VCFs (Manta, MELT, and/or Wham)
* Caller VCFs (Manta, Scramble, MELT, and/or Wham)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again the and/or is a little confusing I think. Maybe just provide the list of callers supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

* Binned read counts file
* Split reads (SR) file
* Discordant read pairs (PE) file
* Scramble intermediate clusters file and table (not needed downstream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not needed downstream but useful for examining candidate sites when high sensitivity is required"?

Or something similar describing the main use of having the file as an output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here


## <a name="evidence-qc">EvidenceQC</a>
*Formerly Module00b*
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/scramble/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ RUN mkdir -p /opt && cd /opt && \
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

# install scramble
ARG SCRAMBLE_COMMIT="f320d604ac030e4a7fa96b0663bcae02994c7d94"
ARG SCRAMBLE_COMMIT="56b5ae849d16ec1fc83ea1426b0ffc356ee6d99c"
RUN mkdir /app && cd /app \
&& git clone https://github.com/mwalker174/scramble-gatk-sv.git \
&& cd scramble-gatk-sv \
Expand Down
17 changes: 16 additions & 1 deletion dockerfiles/sv-base-mini/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ ARG UBUNTU_RELEASE="22.04"
ARG HTSLIB_VERSION="1.15.1"
ARG BEDTOOLS_VERSION="2.31.0"
ARG VCFTOOLS_VERSION="0.1.16"
ARG BWA_COMMIT="139f68fc4c3747813783a488aef2adc86626b01b"

ARG APT_REQUIRED_PACKAGES="/opt/apt-required-packages.list"

Expand All @@ -14,7 +15,7 @@ ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
apt-get -qqy dist-upgrade && \
apt-get -qqy install --no-install-recommends \
ca-certificates autoconf automake bzip2 g++ make wget pkgconf python2 \
ca-certificates autoconf automake bzip2 g++ git make wget pkgconf python2 \
libssl-dev libbz2-dev libcurl4-openssl-dev liblzma-dev libncurses-dev zlib1g-dev libdeflate-dev

# install samtools
Expand Down Expand Up @@ -51,6 +52,19 @@ RUN wget -q https://github.com/arq5x/bedtools2/releases/download/v$BEDTOOLS_VERS
mv bedtools.static /opt/bedtools/bin/bedtools && \
chmod a+x /opt/bedtools/bin/bedtools

# install bwa
# must do from source because of compiler error in latest release (see https://github.com/lh3/bwa/issues/387)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up this issue and it looks like it might have been fixed in a new release 0.7.18 in the last few weeks (https://github.com/lh3/bwa/releases/tag/v0.7.18). Might be worth switching this back to installing a release rather than building from source for simplicity and build time sake?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like a build got included with that release unfortunately

ARG BWA_COMMIT
RUN cd /opt && \
git clone https://github.com/lh3/bwa.git && \
cd bwa && \
git checkout $BWA_COMMIT && \
make -s && \
cd .. && \
mkdir -p /opt/bin && \
mv /opt/bwa/bwa /opt/bin/ && \
rm -r bwa
ENV PATH=/opt/bin:$PATH

############### stage 1: copy tools and install needed non-dev libraries
FROM ubuntu:$UBUNTU_RELEASE
Expand Down Expand Up @@ -100,3 +114,4 @@ RUN tabix --version
RUN bcftools --version
RUN bedtools --version
RUN vcftools --version
RUN which bwa
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
{
"GatherSampleEvidence.primary_contigs_list": "${workspace.primary_contigs_list}",
"GatherSampleEvidence.reference_bwa_alt": "${workspace.reference_bwa_alt}",
"GatherSampleEvidence.reference_bwa_amb": "${workspace.reference_bwa_amb}",
"GatherSampleEvidence.reference_bwa_ann": "${workspace.reference_bwa_ann}",
"GatherSampleEvidence.reference_bwa_bwt": "${workspace.reference_bwa_bwt}",
"GatherSampleEvidence.reference_bwa_pac": "${workspace.reference_bwa_pac}",
"GatherSampleEvidence.reference_bwa_sa": "${workspace.reference_bwa_sa}",
"GatherSampleEvidence.reference_fasta": "${workspace.reference_fasta}",
"GatherSampleEvidence.reference_index": "${workspace.reference_index}",
"GatherSampleEvidence.reference_dict": "${workspace.reference_dict}",
Expand All @@ -12,6 +18,7 @@

"GatherSampleEvidence.manta_region_bed": "${workspace.manta_region_bed}",
"GatherSampleEvidence.manta_region_bed_index": "${workspace.manta_region_bed_index}",
"GatherSampleEvidence.mei_bed": "${workspace.mei_bed}",
"GatherSampleEvidence.sd_locs_vcf": "${workspace.sd_locs_vcf}",
"GatherSampleEvidence.melt_standard_vcf_header": "${workspace.melt_standard_vcf_header}",

Expand All @@ -22,6 +29,7 @@
"GatherSampleEvidence.sv_pipeline_docker": "${workspace.sv_pipeline_docker}",
"GatherSampleEvidence.manta_docker": "${workspace.manta_docker}",
"GatherSampleEvidence.wham_docker": "${workspace.wham_docker}",
"GatherSampleEvidence.scramble_docker": "${workspace.scramble_docker}",
"GatherSampleEvidence.genomes_in_the_cloud_docker" : "${workspace.genomes_in_the_cloud_docker}",
"GatherSampleEvidence.gatk_docker" : "${workspace.gatk_docker}",
"GatherSampleEvidence.gatk_docker_pesr_override": "${workspace.gatk_docker_pesr_override}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,12 @@ primary_contigs_fai {{ reference_resources.primary_contigs_fai }}
primary_contigs_list {{ reference_resources.primary_contigs_list }}
protein_coding_gtf {{ reference_resources.protein_coding_gtf }}
reference_build {{ reference_resources.reference_build }}
reference_bwa_alt {{ reference_resources.reference_bwa_alt }}
reference_bwa_amb {{ reference_resources.reference_bwa_amb }}
reference_bwa_ann {{ reference_resources.reference_bwa_ann }}
reference_bwa_bwt {{ reference_resources.reference_bwa_bwt }}
reference_bwa_pac {{ reference_resources.reference_bwa_pac }}
reference_bwa_sa {{ reference_resources.reference_bwa_sa }}
reference_dict {{ reference_resources.reference_dict }}
reference_fasta {{ reference_resources.reference_fasta }}
reference_index {{ reference_resources.reference_index }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
"GATKSVPipelineSingleSample.batch" : "${this.sample_id}",
"GATKSVPipelineSingleSample.bam_or_cram_file" : "${this.bam_or_cram_file}",

"GATKSVPipelineSingleSample.use_melt": "false",

"GATKSVPipelineSingleSample.cutoffs" : "${workspace.ref_panel_cutoffs}",

"GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff" : "${workspace.ref_panel_genotype_pesr_pesr_sepcutoff}",
Expand Down Expand Up @@ -49,7 +47,14 @@
"GATKSVPipelineSingleSample.max_ref_panel_carrier_freq": 0.03,
"GATKSVPipelineSingleSample.manta_region_bed" : "${workspace.reference_manta_region_bed}",
"GATKSVPipelineSingleSample.manta_region_bed_index" : "${workspace.reference_manta_region_bed_index}",
"GATKSVPipelineSingleSample.mei_bed": "${workspace.mei_bed}",
"GATKSVPipelineSingleSample.sd_locs_vcf" : "${workspace.reference_sd_locs_vcf}",
"GATKSVPipelineSingleSample.reference_bwa_alt" : "${workspace.reference_bwa_alt}",
"GATKSVPipelineSingleSample.reference_bwa_amb" : "${workspace.reference_bwa_amb}",
"GATKSVPipelineSingleSample.reference_bwa_ann" : "${workspace.reference_bwa_ann}",
"GATKSVPipelineSingleSample.reference_bwa_bwt" : "${workspace.reference_bwa_bwt}",
"GATKSVPipelineSingleSample.reference_bwa_pac" : "${workspace.reference_bwa_pac}",
"GATKSVPipelineSingleSample.reference_bwa_sa" : "${workspace.reference_bwa_sa}",
"GATKSVPipelineSingleSample.reference_dict" : "${workspace.reference_dict}",
"GATKSVPipelineSingleSample.reference_fasta" : "${workspace.reference_fasta}",
"GATKSVPipelineSingleSample.reference_index" : "${workspace.reference_index}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ reference_name {{ reference_resources.name }}
reference_allosome_file {{ reference_resources.allosome_file }}
reference_autosome_file {{ reference_resources.autosome_file }}
reference_bin_exclude {{ reference_resources.bin_exclude }}
reference_bwa_alt {{ reference_resources.reference_bwa_alt }}
reference_bwa_amb {{ reference_resources.reference_bwa_amb }}
reference_bwa_ann {{ reference_resources.reference_bwa_ann }}
reference_bwa_bwt {{ reference_resources.reference_bwa_bwt }}
reference_bwa_pac {{ reference_resources.reference_bwa_pac }}
reference_bwa_sa {{ reference_resources.reference_bwa_sa }}
reference_cnmops_exclude_list {{ reference_resources.cnmops_exclude_list }}
reference_contig_ploidy_priors {{ reference_resources.contig_ploidy_priors }}
reference_copy_number_autosomal_contigs {{ reference_resources.copy_number_autosomal_contigs }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@
"ApplyManualVariantFilter.vcf" : {{ test_batch.clean_vcf | tojson }},
"ApplyManualVariantFilter.prefix" : {{ test_batch.name | tojson }},
"ApplyManualVariantFilter.sv_base_mini_docker":{{ dockers.sv_base_mini_docker | tojson }},
"ApplyManualVariantFilter.bcftools_filter": "SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\"",
"ApplyManualVariantFilter.filter_name": "filter_wham_only_del"
"ApplyManualVariantFilter.bcftools_filter": "(SVTYPE==\"DEL\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"wham\") || (ALT==\"<INS:ME:SVA>\" && COUNT(ALGORITHMS)==1 && ALGORITHMS==\"scramble\" && HIGH_SR_BACKGROUND==1)",
"ApplyManualVariantFilter.filter_name": "high_algorithm_fp_rate"
}
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
{
"GATKSVPipelineBatch.use_manta": "true",
"GATKSVPipelineBatch.use_wham": "true",
"GATKSVPipelineBatch.use_melt": "true",
"GATKSVPipelineBatch.use_scramble": "false",

"GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
"GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
"GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
Expand Down Expand Up @@ -54,6 +49,14 @@
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},

"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},

"GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
"GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,4 @@
{
"GATKSVPipelineBatch.use_manta": "true",
"GATKSVPipelineBatch.use_wham": "true",
"GATKSVPipelineBatch.use_melt": "true",
"GATKSVPipelineBatch.use_scramble": "false",

"GATKSVPipelineBatch.name": {{ test_batch.name | tojson }},
"GATKSVPipelineBatch.ped_file": {{ test_batch.ped_file | tojson }},
"GATKSVPipelineBatch.samples": {{ test_batch.samples | tojson }},
Expand Down Expand Up @@ -48,6 +43,14 @@
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.melt_standard_vcf_header": {{ reference_resources.melt_std_vcf_header | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.wham_include_list_bed_file": {{ reference_resources.wham_include_list_bed_file | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.preprocessed_intervals": {{ reference_resources.preprocessed_intervals | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.mei_bed": {{ reference_resources.mei_bed | tojson }},

"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_alt": {{ reference_resources.reference_bwa_alt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_amb": {{ reference_resources.reference_bwa_amb | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_ann": {{ reference_resources.reference_bwa_ann | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_bwt": {{ reference_resources.reference_bwa_bwt | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_pac": {{ reference_resources.reference_bwa_pac | tojson }},
"GATKSVPipelineBatch.GatherSampleEvidenceBatch.reference_bwa_sa": {{ reference_resources.reference_bwa_sa | tojson }},

"GATKSVPipelineBatch.EvidenceQC.wgd_scoring_mask": {{ reference_resources.wgd_scoring_mask | tojson }},
"GATKSVPipelineBatch.EvidenceQC.run_vcf_qc": "false",
Expand Down
Loading
Loading