I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

c2997108 · 2024-08-20T15:08:37Z

I was able to complete the processing of 500 samples in cohort mode using GATKSVPipelineBatch.wdl and would like to merge the results according to the instructions in the README.
I would like to follow the GATK-SV workflow for merging because the result of GATK-SV cohort mode is very nice and removes a lot of noise.
How do I pass the output of GATKSVPipelineBatch.wdl as input to MakeCohortVcf? An example of input.json or a script to create input.json would be very nice.

mwalker174 · 2024-08-20T15:28:51Z

Hi @c2997108, GATKSVPipelineBatch runs MakeCohortVcf already. If you ran all 500 samples together, then output you're looking for is the "clean_vcf." We have some new downstream filtering steps that you can run to further reduce false positives and are documented in #695, although with only 500 samples the benefits might be marginal.

c2997108 · 2024-08-20T15:37:04Z

Thanks for your reply, mwalker174. Yes, as per your comment every 500 samples are merged and the output clean_vcf files are great. But I have 2000 samples and the README says to process 100-500 each, so I want to merge 500 x 4 outputs.
Or could I use GATKSVPipelineBatch.wdl to process the 2000 samples?

mwalker174 · 2024-08-20T15:55:18Z

I see, so GATKSVPipelineBatch is designed for cohorts consisting of a single batch. If there are multiple batches, we run each module individually. One important step is MergeBatchSites, which is run prior to genotyping and ensures that all sites across the cohort get genotyped into every batch. Better documentation and a Terra workspace are coming soon for this.

If you saved all the outputs from GATKSVPipelineBatch, you should be able to pick up at that step. See the Quickstart section of the readme for directions on how to build example inputs. You'd want to start re-running from MergeBatchSites.

If re-running is not an option for you, you could attempt to simply cluster the cleaned vcfs from all 4 batches using GATK SVCluster. However, this is not a recommended practice.

c2997108 · 2024-08-22T02:38:52Z

I am looking forward to "Better documentation and a Terra workspace".
I thought MergeBatchSites.wdl is not used in GATKSVPipelineBatch.wdl but it is needed when merging.
I just want to ask for confirmation because it doesn't seem to be clearly stated in the documentation. Do I need to input the multiple output of GATKSVPipelinePhase1.wdl of the GATKSVPipelineBatch.wdl workflow into MergeBatchSites.wdl for merging, and then continue from GenotypeBatch.wdl in the GATKSVPipelineBatch.wdl workflow?

epiercehoffman · 2024-08-26T19:37:41Z

Yes, that's correct: you can take your existing GATKSVPipelinePhase1 outputs, run MergeBatchSites on the filtered_pesr_vcf and filtered_depth_vcf files from each batch, then run GenotypeBatch and onwards, using the outputs of MergeBatchSites as the cohort_depth_vcf and cohort_pesr_vcf inputs. While the Terra workspace and documentation are still in progress, the draft of the workspace dashboard might still be useful documentation for a multi-batch cohort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

c2997108 commented Aug 20, 2024

mwalker174 commented Aug 20, 2024 •

edited

Loading

c2997108 commented Aug 20, 2024

mwalker174 commented Aug 20, 2024 •

edited

Loading

c2997108 commented Aug 22, 2024

epiercehoffman commented Aug 26, 2024

I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715

Comments

c2997108 commented Aug 20, 2024

mwalker174 commented Aug 20, 2024 • edited Loading

c2997108 commented Aug 20, 2024

mwalker174 commented Aug 20, 2024 • edited Loading

c2997108 commented Aug 22, 2024

epiercehoffman commented Aug 26, 2024

mwalker174 commented Aug 20, 2024 •

edited

Loading

mwalker174 commented Aug 20, 2024 •

edited

Loading