Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarek bcftools normalization #1682

Open
wants to merge 36 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
4772da1
First modification to contribute to the bcftools/norm module in Sarek
JC-Delmas Apr 25, 2024
451aaec
Changes in the GERMLINE_VCFS_NORM process
JC-Delmas Apr 25, 2024
d97726b
Add fasta argument to POST_VARIANTCALLING process.
JC-Delmas Apr 25, 2024
e034ff0
add fasta input as argument
JC-Delmas Apr 25, 2024
8469832
remove vcfs in the GERMLINE_VCFS_NORM process, replaced by germline_v…
JC-Delmas Apr 25, 2024
e885888
First modification to contribute to the bcftools/norm module in Sarek
JC-Delmas Apr 25, 2024
9e94a05
Changes in the GERMLINE_VCFS_NORM process
JC-Delmas Apr 25, 2024
2bdba7e
Add fasta argument to POST_VARIANTCALLING process.
JC-Delmas Apr 25, 2024
1214f10
add fasta input as argument
JC-Delmas Apr 25, 2024
b7ba4f2
remove vcfs in the GERMLINE_VCFS_NORM process, replaced by germline_v…
JC-Delmas Apr 25, 2024
34bf47b
Update workflows/sarek/main.nf
JC-Delmas Apr 25, 2024
6dff9af
Resolved merge conflict by keeping changes from branch 34bf47baa9d61f…
JC-Delmas Apr 30, 2024
d289261
Refactor normalization and concatenation of VCF files
JC-Delmas Apr 30, 2024
c78af62
Modify and adjust two scripts to add normalization and integrate FAST…
JC-Delmas May 16, 2024
d646ec3
Added normalization for all vcfs
Patricie34 Oct 9, 2024
8fb64b2
Fixed linting issues and updated schema parameters
Patricie34 Oct 11, 2024
92094af
Update conf/modules/post_variant_calling.config
Patricie34 Oct 11, 2024
fbbfe1b
edit of normalization steps
Patricie34 Oct 11, 2024
24791dc
Fixed linting issues
Patricie34 Oct 15, 2024
50f1b4b
Merge remote-tracking branch 'upstream/dev' into sarek_bcftools_norm
Patricie34 Oct 15, 2024
fb4bb1e
Sync with dev_branch
Patricie34 Oct 15, 2024
a80cf11
Updated CHANGELOG.md
Patricie34 Oct 15, 2024
b0f6c12
Update conf/modules/post_variant_calling.config
Patricie34 Oct 16, 2024
3bcc27b
Update nextflow.config
Patricie34 Oct 16, 2024
f3c6ac6
Changed module.config
Patricie34 Oct 16, 2024
f9c815d
Changelog.md updated
Patricie34 Oct 16, 2024
f60d60d
Fixed params.normalize
Patricie34 Oct 16, 2024
c0a6ffc
Update CHANGELOG.md
Patricie34 Oct 18, 2024
188cf86
pytesttags.yml changed
Patricie34 Oct 18, 2024
1fe12e3
edited test_normalize_vcfs.yml
Patricie34 Oct 18, 2024
f9e5204
Separated vcf_normalization
Patricie34 Oct 22, 2024
7c96c98
Merge branch 'dev' into sarek_bcftools_norm
maxulysse Nov 4, 2024
b5909f2
module.config edited
Patricie34 Nov 5, 2024
ea7d25a
extra file removed
Patricie34 Nov 5, 2024
391f1ea
post_variantcalling edited
Patricie34 Nov 5, 2024
0bdb5d4
added annotation for vcfs_normalized
Patricie34 Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions conf/modules/post_variant_calling.config
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@


Patricie34 marked this conversation as resolved.
Show resolved Hide resolved
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config file for defining DSL2 per module options and publishing paths
Expand All @@ -16,6 +18,30 @@

process {

withName: 'GERMLINE_VCFS_NORM'{
ext.args = { [
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records
'--rm-dup all' //output only the first instance of a record which is present multiple times
].join(' ') }
ext.when = { params.concatenate_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/concat/${meta.id}/" }
]
}

withName: 'VCFS_NORM'{
ext.args = { [
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records
'--rm-dup all' //output only the first instance of a record which is present multiple times
].join(' ') }
ext.when = { params.normalized_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/normalized/${meta.id}/" }
]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be condensed into a single configuration. Why are you publishing the normalised vcfs into two different subdirectories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tabix below is not published in the same way, we would end up with the tbi in a different directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained it below, two different processes for either normalisation and concatenation of germline vcfs or normalisation of all vcfs.

withName: 'GERMLINE_VCFS_CONCAT'{
ext.args = { "-a" }
ext.when = { params.concatenate_vcfs }
Expand All @@ -34,11 +60,25 @@ process {
]
}

withName: 'VCFS__SORT'{
Patricie34 marked this conversation as resolved.
Show resolved Hide resolved
ext.prefix = { "${meta.id}.norm" }
ext.when = { params.normalized_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/normalized/${meta.id}/" }
]
}

withName: 'TABIX_EXT_VCF' {
ext.prefix = { "${input.baseName}" }
ext.when = { params.concatenate_vcfs }
}

withName: 'TABIX_VCF' {
ext.prefix = { "${input.baseName}" }
ext.when = { params.normalized_vcfs }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ones can be combined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the module, I think you can actually output the tbi in the same process as the vcf, so no need to spin up an extra process for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the bcftools_norm module and it needs as inputs vcf and tbi. So I guess, we can't exclude it. Or do you mean to output tbi from variant callers directly? But what could be excluded is tabix at the end after sorting, right? Because tbi is ouput from bcftools norm process and is transferred to bcftools sort, so it should end up with sorted vcf and tbi at the end

withName: 'TABIX_GERMLINE_VCFS_CONCAT_SORT'{
ext.prefix = { "${meta.id}.germline" }
ext.when = { params.concatenate_vcfs }
Expand All @@ -47,4 +87,13 @@ process {
path: { "${params.outdir}/variant_calling/concat/${meta.id}/" }
]
}

withName: 'TABIX_VCFS_INDEX'{
ext.prefix = { "${meta.id}.norm" }
ext.when = { params.normalized_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/norm/${meta.id}/" }
]
}
}
7 changes: 7 additions & 0 deletions modules/nf-core/bcftools/norm/environment.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run nf-core modules update bcftools/norm that's an old version of the modules

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

60 changes: 60 additions & 0 deletions modules/nf-core/bcftools/norm/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

61 changes: 61 additions & 0 deletions modules/nf-core/bcftools/norm/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ params {
ignore_soft_clipped_bases = false // no --dont-use-soft-clipped-bases for GATK Mutect2
joint_germline = false // g.vcf & joint germline calling are not run by default if HaplotypeCaller is selected
joint_mutect2 = false // if true, enables patient-wise multi-sample somatic variant calling
normalized_vcfs = false // by default we don't normalize the vcf-files
only_paired_variant_calling = false // if true, skips germline variant calling for normal-paired sample
sentieon_dnascope_emit_mode = 'variant' // default value for Sentieon dnascope
sentieon_dnascope_pcr_indel_model = 'CONSERVATIVE'
Expand Down
14 changes: 12 additions & 2 deletions subworkflows/local/post_variantcalling/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,33 @@
//

include { CONCATENATE_GERMLINE_VCFS } from '../vcf_concatenate_germline/main'
include { NORMALIZE_VCFS } from '../vcf_normalization/main'

workflow POST_VARIANTCALLING {

take:
vcfs
fasta
concatenate_vcfs

normalized_vcfs

main:
versions = Channel.empty()

if (concatenate_vcfs){
CONCATENATE_GERMLINE_VCFS(vcfs)
CONCATENATE_GERMLINE_VCFS(vcfs, fasta)

vcfs = vcfs.mix(CONCATENATE_GERMLINE_VCFS.out.vcfs)
versions = versions.mix(CONCATENATE_GERMLINE_VCFS.out.versions)
}

if (normalized_vcfs){
NORMALIZE_VCFS(vcfs, fasta)

vcfs = vcfs.mix(NORMALIZE_VCFS.out.vcfs)
versions = versions.mix(NORMALIZE_VCFS.out.versions)
}

emit:
vcfs // post processed vcfs

Expand Down
42 changes: 29 additions & 13 deletions subworkflows/local/vcf_concatenate_germline/main.nf
Original file line number Diff line number Diff line change
@@ -1,42 +1,58 @@
//
// CONCATENATE Germline VCFs
//

// Concatenation of germline vcf-files
include { ADD_INFO_TO_VCF } from '../../../modules/local/add_info_to_vcf/main'
include { TABIX_BGZIPTABIX as TABIX_EXT_VCF } from '../../../modules/nf-core/tabix/bgziptabix/main'
include { BCFTOOLS_CONCAT as GERMLINE_VCFS_CONCAT } from '../../../modules/nf-core/bcftools/concat/main'
include { BCFTOOLS_SORT as GERMLINE_VCFS_CONCAT_SORT } from '../../../modules/nf-core/bcftools/sort/main'
include { TABIX_TABIX as TABIX_GERMLINE_VCFS_CONCAT_SORT } from '../../../modules/nf-core/tabix/tabix/main'
include { ADD_INFO_TO_VCF } from '../../../modules/local/add_info_to_vcf/main'
include { TABIX_BGZIPTABIX as TABIX_EXT_VCF } from '../../../modules/nf-core/tabix/bgziptabix/main'
include { BCFTOOLS_NORM as GERMLINE_VCFS_NORM } from '../../../modules/nf-core/bcftools/norm/main'
include { BCFTOOLS_CONCAT as GERMLINE_VCFS_CONCAT } from '../../../modules/nf-core/bcftools/concat/main'
include { BCFTOOLS_SORT as GERMLINE_VCFS_CONCAT_SORT } from '../../../modules/nf-core/bcftools/sort/main'
include { TABIX_TABIX as TABIX_GERMLINE_VCFS_CONCAT_SORT } from '../../../modules/nf-core/tabix/tabix/main'

workflow CONCATENATE_GERMLINE_VCFS {

take:
vcfs
fasta

main:
versions = Channel.empty()

// Concatenate vcf-files
// Add additional information to VCF files
ADD_INFO_TO_VCF(vcfs)

// Compress the VCF files with bgzip
TABIX_EXT_VCF(ADD_INFO_TO_VCF.out.vcf)

// Normalize the VCF files with BCFTOOLS_NORM
GERMLINE_VCFS_NORM(vcf: ADD_INFO_TO_VCF.out.vcf, fasta: fasta)

// Compress the normalized VCF files with bgzip
TABIX_EXT_VCF(GERMLINE_VCFS_NORM.out.vcf)

// Index the compressed normalized VCF files
TABIX_GERMLINE_VCFS_CONCAT_SORT(TABIX_EXT_VCF.out.gz)

// Gather vcfs and vcf-tbis for concatenating germline-vcfs
germline_vcfs_with_tbis = TABIX_EXT_VCF.out.gz_tbi.map{ meta, vcf, tbi -> [ meta.subMap('id'), vcf, tbi ] }.groupTuple()
germline_vcfs_with_tbis = TABIX_GERMLINE_VCFS_CONCAT_SORT.out.map { meta, vcf, tbi -> [meta.subMap('id'), vcf, tbi] }.groupTuple()

// Concatenate the VCF files
GERMLINE_VCFS_CONCAT(germline_vcfs_with_tbis)

// Sort the concatenated VCF files
GERMLINE_VCFS_CONCAT_SORT(GERMLINE_VCFS_CONCAT.out.vcf)

// Index the sorted concatenated VCF files
TABIX_GERMLINE_VCFS_CONCAT_SORT(GERMLINE_VCFS_CONCAT_SORT.out.vcf)

// Gather versions of all tools used
versions = versions.mix(ADD_INFO_TO_VCF.out.versions)
versions = versions.mix(TABIX_EXT_VCF.out.versions)
versions = versions.mix(GERMLINE_VCFS_NORM.out.versions)
versions = versions.mix(GERMLINE_VCFS_CONCAT.out.versions)
versions = versions.mix(GERMLINE_VCFS_CONCAT.out.versions)
versions = versions.mix(GERMLINE_VCFS_CONCAT_SORT.out.versions)
versions = versions.mix(TABIX_GERMLINE_VCFS_CONCAT_SORT.out.versions)

emit:
vcfs = germline_vcfs_with_tbis // post processed vcfs

vcfs = TABIX_GERMLINE_VCFS_CONCAT_SORT.out.gz_tbi // post-processed VCFs
versions // channel: [ versions.yml ]
}
}
46 changes: 46 additions & 0 deletions subworkflows/local/vcf_normalization/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// Normalize all unannotated VCFs

// Import modules
include { ADD_INFO_TO_VCF } from '../../../modules/local/add_info_to_vcf/main'
include { TABIX_BGZIPTABIX as TABIX_VCF } from '../../../modules/nf-core/tabix/bgziptabix/main'
include { BCFTOOLS_NORM as VCFS_NORM } from '../../../modules/nf-core/bcftools/norm/main'
include { BCFTOOLS_SORT as VCFS_SORT } from '../../../modules/nf-core/bcftools/sort/main'
include { TABIX_TABIX as TABIX_VCFS_INDEX } from '../../../modules/nf-core/tabix/tabix/main'

// Workflow to normalize, compress, and index VCF files
workflow NORMALIZE_VCFS {

take:
vcfs
fasta

main:
versions = Channel.empty()

// Add additional information to VCF files
ADD_INFO_TO_VCF(vcfs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the same thing in the concatenation step. Can you check what happens if someone concatenates and the normalisaes? I have the feeling this will end a bunch of redundant information. On that note, the current order is: concat then normalise. Are we sure it shouldn't be the other way around?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR, there was a vcf_concatenate process that performed normalization first and then concatenation of germline vcfs. Since I wanted to normalize all vcfs without concatenation, I decided to keep the original process and add an additional one (vcf_normalization/main.nf) specifically for normalization. Also, Sarek already includes a process for concatenating germline vcfs, so I thought it should remain unchanged.

However, I am a bit confused because you mentioned that concatenation occurs before normalization. I initially thought that based on the boolean parameters (params.concatenate, params.normalized_vcfs), one could choose which process to run. If both processes run, I expected two different outputs: concatenated and normalized germline vcfs, and normalized vcfs (for all). It seems like I might have misunderstood the intended workflow.

Could you please explain where the order of processes comes from?


// Normalize the VCF files with BCFTOOLS_NORM
normalized_vcf = VCFS_NORM(vcf: ADD_INFO_TO_VCF.out.vcf)

// Compress the normalized VCF files with bgzip
compressed_vcf = TABIX_VCF(normalized_vcf)

// Sort the compressed normalized VCF files
sorted_vcf = VCFS_SORT(compressed_vcf)

// Index the sorted VCF files
sorted_indexed_vcf = TABIX_VCFS_INDEX(sorted_vcf)

// Gather versions of all tools used
versions = versions.mix(ADD_INFO_TO_VCF.out.versions)
versions = versions.mix(VCFS_NORM.out.versions)
versions = versions.mix(TABIX_VCF.out.versions)
versions = versions.mix(VCFS_SORT.out.versions)
versions = versions.mix(TABIX_VCFS_INDEX.out.versions)

emit:
normalized_vcfs = sorted_indexed_vcf // Post-processed sorted VCFs
versions // Channel: [versions.yml]
}

6 changes: 5 additions & 1 deletion workflows/sarek/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -794,7 +794,11 @@ workflow SAREK {

// POST VARIANTCALLING
POST_VARIANTCALLING(BAM_VARIANT_CALLING_GERMLINE_ALL.out.vcf_all,
params.concatenate_vcfs)
BAM_VARIANT_CALLING_TUMOR_ONLY_ALL.out.vcf_all,
BAM_VARIANT_CALLING_SOMATIC_ALL.out.vcf_all,
fasta,
params.concatenate_vcfs,
params.normalized_vcfs)

// Gather vcf files for annotation and QC
vcf_to_annotate = Channel.empty()
Expand Down
Loading