-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to combine data using MakeCohortVcf after processing 500 samples multiple times in cohort mode #715
Comments
Hi @c2997108, GATKSVPipelineBatch runs MakeCohortVcf already. If you ran all 500 samples together, then output you're looking for is the "clean_vcf." We have some new downstream filtering steps that you can run to further reduce false positives and are documented in #695, although with only 500 samples the benefits might be marginal. |
Thanks for your reply, mwalker174. Yes, as per your comment every 500 samples are merged and the output clean_vcf files are great. But I have 2000 samples and the README says to process 100-500 each, so I want to merge 500 x 4 outputs. |
I see, so GATKSVPipelineBatch is designed for cohorts consisting of a single batch. If there are multiple batches, we run each module individually. One important step is MergeBatchSites, which is run prior to genotyping and ensures that all sites across the cohort get genotyped into every batch. Better documentation and a Terra workspace are coming soon for this. If you saved all the outputs from GATKSVPipelineBatch, you should be able to pick up at that step. See the Quickstart section of the readme for directions on how to build example inputs. You'd want to start re-running from MergeBatchSites. If re-running is not an option for you, you could attempt to simply cluster the cleaned vcfs from all 4 batches using GATK SVCluster. However, this is not a recommended practice. |
I am looking forward to "Better documentation and a Terra workspace". |
Yes, that's correct: you can take your existing GATKSVPipelinePhase1 outputs, run MergeBatchSites on the filtered_pesr_vcf and filtered_depth_vcf files from each batch, then run GenotypeBatch and onwards, using the outputs of MergeBatchSites as the cohort_depth_vcf and cohort_pesr_vcf inputs. While the Terra workspace and documentation are still in progress, the draft of the workspace dashboard might still be useful documentation for a multi-batch cohort. |
I was able to complete the processing of 500 samples in cohort mode using GATKSVPipelineBatch.wdl and would like to merge the results according to the instructions in the README.
I would like to follow the GATK-SV workflow for merging because the result of GATK-SV cohort mode is very nice and removes a lot of noise.
How do I pass the output of GATKSVPipelineBatch.wdl as input to MakeCohortVcf? An example of input.json or a script to create input.json would be very nice.
The text was updated successfully, but these errors were encountered: