-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sarek bcftools normalization #1682
base: dev
Are you sure you want to change the base?
Conversation
…cfs_with_tbis and fasta.
…cfs_with_tbis and fasta.
resolve issue with spacing Co-authored-by: Maxime U Garcia <[email protected]>
Hi all, I've modified the normalization step to include all VCFs, not just the germline ones. For this, I used the pull request from JC-Delmas as a base. I am aware that this still requires a lot of work, and I would greatly appreciate any advice or feedback you can provide. Thank you! Patricie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run nf-core modules update bcftools/norm
that's an old version of the modules
|
@nf-core-bot fix linting pretty please 🙏 |
We're missing CHANGELOG + tests + subway map |
@nf-core-bot fix linting pretty please 🙏 |
nextflow
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to commit this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is still there
withName: 'GERMLINE_VCFS_NORM'{ | ||
ext.args = { [ | ||
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records | ||
'--rm-dup all' //output only the first instance of a record which is present multiple times | ||
].join(' ') } | ||
ext.when = { params.concatenate_vcfs } | ||
publishDir = [ | ||
mode: params.publish_dir_mode, | ||
path: { "${params.outdir}/variant_calling/concat/${meta.id}/" } | ||
] | ||
} | ||
|
||
withName: 'VCFS_NORM'{ | ||
ext.args = { [ | ||
'--multiallelics - both', //split multiallelic sites into biallelic records and both SNPs and indels should be merged separately into two records | ||
'--rm-dup all' //output only the first instance of a record which is present multiple times | ||
].join(' ') } | ||
ext.when = { params.normalized_vcfs } | ||
publishDir = [ | ||
mode: params.publish_dir_mode, | ||
path: { "${params.outdir}/variant_calling/normalized/${meta.id}/" } | ||
] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be condensed into a single configuration. Why are you publishing the normalised vcfs into two different subdirectories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Tabix below is not published in the same way, we would end up with the tbi in a different directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained it below, two different processes for either normalisation and concatenation of germline vcfs or normalisation of all vcfs.
withName: 'TABIX_EXT_VCF' { | ||
ext.prefix = { "${input.baseName}" } | ||
ext.when = { params.concatenate_vcfs } | ||
} | ||
|
||
withName: 'TABIX_VCF' { | ||
ext.prefix = { "${input.baseName}" } | ||
ext.when = { params.normalized_vcfs } | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these ones can be combined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the module, I think you can actually output the tbi in the same process as the vcf, so no need to spin up an extra process for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the bcftools_norm module and it needs as inputs vcf and tbi. So I guess, we can't exclude it. Or do you mean to output tbi from variant callers directly? But what could be excluded is tabix at the end after sorting, right? Because tbi is ouput from bcftools norm process and is transferred to bcftools sort, so it should end up with sorted vcf and tbi at the end
versions = Channel.empty() | ||
|
||
// Add additional information to VCF files | ||
ADD_INFO_TO_VCF(vcfs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing the same thing in the concatenation step. Can you check what happens if someone concatenates and the normalisaes? I have the feeling this will end a bunch of redundant information. On that note, the current order is: concat then normalise. Are we sure it shouldn't be the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original PR, there was a vcf_concatenate process that performed normalization first and then concatenation of germline vcfs. Since I wanted to normalize all vcfs without concatenation, I decided to keep the original process and add an additional one (vcf_normalization/main.nf) specifically for normalization. Also, Sarek already includes a process for concatenating germline vcfs, so I thought it should remain unchanged.
However, I am a bit confused because you mentioned that concatenation occurs before normalization. I initially thought that based on the boolean parameters (params.concatenate, params.normalized_vcfs), one could choose which process to run. If both processes run, I expected two different outputs: concatenated and normalized germline vcfs, and normalized vcfs (for all). It seems like I might have misunderstood the intended workflow.
Could you please explain where the order of processes comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things we should check:
- Compression and indexing should be possible with bcftools natively now, saving compute time by not spinning up a new process
- What happens if concat + norm is run. Should norm be run before/after?
- parameter should be
normalise_vcfs
imo since it is something the pipeline will do - Can you update the changelog and sort in the number in ascending order?
- Simplify the modules.config
Co-authored-by: Friederike Hanssen <[email protected]>
Co-authored-by: Friederike Hanssen <[email protected]>
Can you add this in sarek/tests/config/pytesttags.yml after the
|
Co-authored-by: Maxime U Garcia <[email protected]>
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).