Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarek bcftools normalization #1682

Open
wants to merge 36 commits into
base: dev
Choose a base branch
from

Conversation

Patricie34
Copy link

@Patricie34 Patricie34 commented Oct 9, 2024

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@Patricie34
Copy link
Author

Hi all,

I've modified the normalization step to include all VCFs, not just the germline ones. For this, I used the pull request from JC-Delmas as a base. I am aware that this still requires a lot of work, and I would greatly appreciate any advice or feedback you can provide.

Thank you!

Patricie

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run nf-core modules update bcftools/norm that's an old version of the modules

Copy link

github-actions bot commented Oct 10, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 0bdb5d4

+| ✅ 215 tests passed       |+
#| ❔  11 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-11-05 12:24:49

@maxulysse
Copy link
Member

@nf-core-bot fix linting pretty please 🙏

@maxulysse
Copy link
Member

We're missing CHANGELOG + tests + subway map

@maxulysse
Copy link
Member

@nf-core-bot fix linting pretty please 🙏

nextflow Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to commit this file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is still there

Comment on lines 19 to 42
withName: 'GERMLINE_VCFS_NORM'{
ext.args = { [
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records
'--rm-dup all' //output only the first instance of a record which is present multiple times
].join(' ') }
ext.when = { params.concatenate_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/concat/${meta.id}/" }
]
}

withName: 'VCFS_NORM'{
ext.args = { [
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records
'--rm-dup all' //output only the first instance of a record which is present multiple times
].join(' ') }
ext.when = { params.normalized_vcfs }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/variant_calling/normalized/${meta.id}/" }
]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be condensed into a single configuration. Why are you publishing the normalised vcfs into two different subdirectories?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tabix below is not published in the same way, we would end up with the tbi in a different directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained it below, two different processes for either normalisation and concatenation of germline vcfs or normalisation of all vcfs.

Comment on lines 70 to 79
withName: 'TABIX_EXT_VCF' {
ext.prefix = { "${input.baseName}" }
ext.when = { params.concatenate_vcfs }
}

withName: 'TABIX_VCF' {
ext.prefix = { "${input.baseName}" }
ext.when = { params.normalized_vcfs }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ones can be combined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the module, I think you can actually output the tbi in the same process as the vcf, so no need to spin up an extra process for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the bcftools_norm module and it needs as inputs vcf and tbi. So I guess, we can't exclude it. Or do you mean to output tbi from variant callers directly? But what could be excluded is tabix at the end after sorting, right? Because tbi is ouput from bcftools norm process and is transferred to bcftools sort, so it should end up with sorted vcf and tbi at the end

nextflow.config Outdated Show resolved Hide resolved
versions = Channel.empty()

// Add additional information to VCF files
ADD_INFO_TO_VCF(vcfs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the same thing in the concatenation step. Can you check what happens if someone concatenates and the normalisaes? I have the feeling this will end a bunch of redundant information. On that note, the current order is: concat then normalise. Are we sure it shouldn't be the other way around?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR, there was a vcf_concatenate process that performed normalization first and then concatenation of germline vcfs. Since I wanted to normalize all vcfs without concatenation, I decided to keep the original process and add an additional one (vcf_normalization/main.nf) specifically for normalization. Also, Sarek already includes a process for concatenating germline vcfs, so I thought it should remain unchanged.

However, I am a bit confused because you mentioned that concatenation occurs before normalization. I initially thought that based on the boolean parameters (params.concatenate, params.normalized_vcfs), one could choose which process to run. If both processes run, I expected two different outputs: concatenated and normalized germline vcfs, and normalized vcfs (for all). It seems like I might have misunderstood the intended workflow.

Could you please explain where the order of processes comes from?

Copy link
Contributor

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things we should check:

  • Compression and indexing should be possible with bcftools natively now, saving compute time by not spinning up a new process
  • What happens if concat + norm is run. Should norm be run before/after?
  • parameter should be normalise_vcfs imo since it is something the pipeline will do
  • Can you update the changelog and sort in the number in ascending order?
  • Simplify the modules.config

CHANGELOG.md Outdated Show resolved Hide resolved
@maxulysse
Copy link
Member

Can you add this in sarek/tests/config/pytesttags.yml after the concatenate_vcfs trigger?

normalize_vcfs:
  - conf/modules/post_variant_calling.config
  - modules/nf-core/bcftools/concat/**
  - modules/nf-core/bcftools/mpileup/**
  - modules/nf-core/bcftools/norm/**
  - modules/nf-core/bcftools/sort/**
  - modules/nf-core/deepvariant/**
  - modules/nf-core/freebayes/**
  - modules/nf-core/gatk4/haplotypecaller/**
  - modules/nf-core/gatk4/mergevcfs/**
  - modules/nf-core/manta/germline/**
  - modules/nf-core/samtools/mpileup/**
  - modules/nf-core/strelka/germline/**
  - modules/nf-core/tabix/bgziptabix/**
  - modules/nf-core/tabix/tabix/**
  - modules/nf-core/tiddit/sv/**
  - subworkflows/local/bam_variant_calling_deepvariant/**
  - subworkflows/local/bam_variant_calling_freebayes/**
  - subworkflows/local/bam_variant_calling_germline_all/**
  - subworkflows/local/bam_variant_calling_germline_manta/**
  - subworkflows/local/bam_variant_calling_haplotypecaller/**
  - subworkflows/local/bam_variant_calling_mpileup/**
  - subworkflows/local/bam_variant_calling_single_strelka/**
  - subworkflows/local/bam_variant_calling_single_tiddit/**
  - subworkflows/local/bam_variant_calling_somatic_all/**
  - subworkflows/local/bam_variant_calling_tumor_only_all/**
  - subworkflows/local/post_variantcalling/**
  - subworkflows/local/vcf_concatenate_germline/**
  - tests/csv/3.0/mapped_joint_bam.csv
  - tests/test_normalize_vcfs.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants