Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

Open
c2997108 opened this issue Aug 20, 2024 · 5 comments

Comments

@c2997108
Copy link

I was able to complete the processing of 500 samples in cohort mode using GATKSVPipelineBatch.wdl and would like to merge the results according to the instructions in the README.
I would like to follow the GATK-SV workflow for merging because the result of GATK-SV cohort mode is very nice and removes a lot of noise.
How do I pass the output of GATKSVPipelineBatch.wdl as input to MakeCohortVcf? An example of input.json or a script to create input.json would be very nice.

@mwalker174
Copy link
Collaborator

mwalker174 commented Aug 20, 2024

Hi @c2997108, GATKSVPipelineBatch runs MakeCohortVcf already. If you ran all 500 samples together, then output you're looking for is the "clean_vcf." We have some new downstream filtering steps that you can run to further reduce false positives and are documented in #695, although with only 500 samples the benefits might be marginal.

@c2997108
Copy link
Author

Thanks for your reply, mwalker174. Yes, as per your comment every 500 samples are merged and the output clean_vcf files are great. But I have 2000 samples and the README says to process 100-500 each, so I want to merge 500 x 4 outputs.
Or could I use GATKSVPipelineBatch.wdl to process the 2000 samples?

@mwalker174
Copy link
Collaborator

mwalker174 commented Aug 20, 2024

I see, so GATKSVPipelineBatch is designed for cohorts consisting of a single batch. If there are multiple batches, we run each module individually. One important step is MergeBatchSites, which is run prior to genotyping and ensures that all sites across the cohort get genotyped into every batch. Better documentation and a Terra workspace are coming soon for this.

If you saved all the outputs from GATKSVPipelineBatch, you should be able to pick up at that step. See the Quickstart section of the readme for directions on how to build example inputs. You'd want to start re-running from MergeBatchSites.

If re-running is not an option for you, you could attempt to simply cluster the cleaned vcfs from all 4 batches using GATK SVCluster. However, this is not a recommended practice.

@c2997108
Copy link
Author

I am looking forward to "Better documentation and a Terra workspace".
I thought MergeBatchSites.wdl is not used in GATKSVPipelineBatch.wdl but it is needed when merging.
I just want to ask for confirmation because it doesn't seem to be clearly stated in the documentation. Do I need to input the multiple output of GATKSVPipelinePhase1.wdl of the GATKSVPipelineBatch.wdl workflow into MergeBatchSites.wdl for merging, and then continue from GenotypeBatch.wdl in the GATKSVPipelineBatch.wdl workflow?

@epiercehoffman
Copy link
Collaborator

Yes, that's correct: you can take your existing GATKSVPipelinePhase1 outputs, run MergeBatchSites on the filtered_pesr_vcf and filtered_depth_vcf files from each batch, then run GenotypeBatch and onwards, using the outputs of MergeBatchSites as the cohort_depth_vcf and cohort_pesr_vcf inputs. While the Terra workspace and documentation are still in progress, the draft of the workspace dashboard might still be useful documentation for a multi-batch cohort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants