-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs/Args cleanup for HaplotypeCaller and other tools, and SplitNCigarReads change #3891
Conversation
The dsde-pipelines repo should be updated when the arg changes go in (all the GenomicsDBImport args are long args that have been changed, for example). |
Codecov Report
@@ Coverage Diff @@
## master #3891 +/- ##
===============================================
+ Coverage 78.974% 79.206% +0.232%
- Complexity 16534 18009 +1475
===============================================
Files 1059 1174 +115
Lines 59160 64889 +5729
Branches 9616 10224 +608
===============================================
+ Hits 46721 51396 +4675
- Misses 8703 9562 +859
- Partials 3736 3931 +195
|
e0a5336
to
c371f99
Compare
@sooheelee I think this is ready to go. Want to double-check? I just made a couple doc fixes, but tests were passing and that shouldn't change. |
I still need to review the read transformer stuff -- will have a look when I get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier We no longer have an RNAProgramGroup but rather the two tools are now in two different categories and I have noted these in the review. I have various comments, many of them minor, and so if you have better things to do, let us know.
Also, I just assume the example commands have been kebabbed as needed. Thank you for these updates!
@@ -56,9 +56,14 @@ | |||
* | |||
* This is an implementation of {@link HaplotypeCaller} using spark to distribute the computation. | |||
* It is still in an early stage of development and does not yet support all the options that the non-spark version does. | |||
* | |||
* Specifically it does not support the --dbsnp, --comp, and --bamOutput options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what this section looks like:
We need formatting elements as I outline to Yossi in #3853, using HaplotypeCaller as the example.
- Period at end of line 55 if intended as first sentence in paragraph.
- Break elements if you want to paragraph the next section.
- Is the
{@link}
element a placeholder for us to later amend @vdauwera? - Use
<pre> </pre>
to surround code blocks.
* for one sample | ||
* 3. The path to the GenomicsDB workspace must be specified | ||
* 4. User may optionally specify paths to which to write JSON files | ||
* Import single-sample GVCFs into GenomicsDB before joint genotyping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar formatting suggestions to above.
* | ||
* The current GATK4 Best Practices for SNP and INDEL Calling use GenomicsDBImport to merge GVCFs from multiple samples, thus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use --> uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End sentence with samples.
Remove thus
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The Best Practices ... use" or "The Best Practices Workflow ... uses" I stand by my subject-verb agreement.
I think I meant to finish that sentence with "thus replacing CombineGVCFs"
* | ||
* The current GATK4 Best Practices for SNP and INDEL Calling use GenomicsDBImport to merge GVCFs from multiple samples, thus | ||
* | ||
* GenomicsDB is a utility built on top of TileDB, both open-source projects started by the Intel Science and Technology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is now Intel-HLS, where HLS=health and life sciences.
I think you will find the blurb from the ASHG2017 poster helpful:
GATK4 supports use of the GenomicsDB datastore towards joint variant calling. This functionality comes from the Intel-Broad Center for Genomics and enables scaling exemplified by the joint calling of gnomAD’s 20k WGS cohort. The datastore transposes sample-centric variant information across genomic loci to make data more accessible to tools and to manual probing. See https://github.com/Intel-HLS/GenomicsDB/wiki for details and Article#10061 to get started.
- So let's emphasize the collaboration and name the Intel-Broad Center for Genomics instead.
- I do like your explanation of sparse data.
* GenomicsDB contains code to specialize TileDB for genomics applications, such as VCF parsing and INFO field annotation | ||
* calculation. For more details about GenomicsDB see the github wiki at https://github.com/Intel-HLS/GenomicsDB/wiki/ | ||
* | ||
* The GenomicsDB format is not meant to be accessed directly. To query the contents of the datastore, use SelectVariants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, remember to use <p></p>
elements to demarcate paragraphs.
@@ -70,7 +79,7 @@ | |||
@CommandLineProgramProperties( | |||
summary = "Counts filtered reads at het sites for allele specific expression estimate", | |||
oneLineSummary = "Generates table of filtered base counts at het sites for allele specific expression", | |||
programGroup = ReadProgramGroup.class | |||
programGroup = RNAProgramGroup.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RNAProgramGroup --> CoverageAnalysisProgramGroup
/** | ||
* Created by gauthier on 11/28/17. | ||
*/ | ||
public final class RNAProgramGroup implements CommandLineProgramGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a category. The tools under it are dispersed to other categories now. Thanks for the initiative in this though. Appreciate it.
@@ -85,22 +85,22 @@ | |||
* | |||
* <h4>Refine genotypes based on the discovered allele frequency in an input VCF containing many samples</h4> | |||
* <pre> | |||
* gatk-launch --javaOptions "-Xmx4g" CalculateGenotypePosteriors \ | |||
* ./gatk --javaOptions "-Xmx4g" CalculateGenotypePosteriors \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, also to below.
* -V input.vcf \ | ||
* -selectType SNP \ | ||
* -O output.vcf | ||
* </pre> | ||
* | ||
* <h4>Query Chromosome 20 Variants from a GenomicsDB</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome you remembered to add this example.
@@ -78,8 +78,8 @@ | |||
* | |||
* <h4>Applying rcelibration/filtering to SNPs</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rcelibration --> recalibration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on your comments. Otherwise everything else is fine to change.
* | ||
* The current GATK4 Best Practices for SNP and INDEL Calling use GenomicsDBImport to merge GVCFs from multiple samples, thus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The Best Practices ... use" or "The Best Practices Workflow ... uses" I stand by my subject-verb agreement.
I think I meant to finish that sentence with "thus replacing CombineGVCFs"
* -L chr1:1000-10000 \ | ||
* -sampleNameMap cohort.sample_map \ | ||
* -readerThreads 5 | ||
* </pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./gatk is only valid if you're running from the same directory as the script. Presumably users will add that directory to their path and be able to invoke gatk
anywhere.
I thought you wanted commands representative of how we typically run the tools, but if that's not true then you can take the memory arg out.
I missed the kebab change here.
* <li>A single interval must be provided</li> | ||
* <li>Each input GVCF must contain only one sample</li> | ||
* <li>Input GVCFs cannot contain multiple entries for a single genomic position</li> | ||
* <li>Currently, a GenomicsDB must be used from the directory structure in which it was created. If copied between file systems, the absolute path must be the same at the destination.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 1) maps to her 2).
I'm not 100% sure that her 1) is true in general, but it is if you're interacting with the GDB through GATK.
Her 3) is worth adding
My 3) is irrelevant if people are using HC GVCFs, but it's still true that if you, for example, split multi-allelics there's no guarantee what will happen.
For my 4) this was a huge issue when I debug stuff on the cloud. You can't download the GDB unless you put it in /cromwell_root/<cromwell_server>/<workflow_name>/<workflow_hash>/<task_name>/<shard_number>/genomicsdb.tar, which isn't possible if you don't have root access to create the /cromwell_root directory. If you ever want to give a GDB to someone else they need to exactly recapitulate the absolute path that you have.
@lbergelson may also have other caveats related to the 20K fire. Like sample names must be unique? sample_map file must use the sample name in the GVCF?
* <li>Currently, a GenomicsDB must be used from the directory structure in which it was created. If copied between file systems, the absolute path must be the same at the destination.</li> | ||
* </ul> | ||
* | ||
* <h3>Developer Note</h3> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a class in the repo, but not a runable tool. This note is for people who want to develop walkers that take GDBs and input.
@@ -48,21 +49,19 @@ | |||
* | |||
* <h3>Usage example</h3> | |||
* | |||
* <h4>Perform joint genotyping on a set of GVCFs enumerated in the command line</h4> | |||
* <h4>Perform joint genotyping on a single single-sample GVCF or a single multi-sample GVCF</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you're running a pipeline that runs on cohorts and encounters a cohort of n=1. Also, allele-specific annotations are only available through the GVCF workflow, not from HaplotypeCaller alone.
@@ -76,19 +75,22 @@ | |||
* for non-diploid organisms.</p> | |||
* | |||
*/ | |||
@CommandLineProgramProperties(summary = "Perform joint genotyping on one or more samples pre-called with HaplotypeCaller", oneLineSummary = "Perform joint genotyping on one or more samples pre-called with HaplotypeCaller", programGroup = VariantProgramGroup.class) | |||
@CommandLineProgramProperties(summary = "Perform joint genotyping on one or more samples pre-called with HaplotypeCaller", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This summary is old and now inaccurate. The GATK4 tool cannot take multiple HaplotypeCaller GVCFs. "Perform joint genotyping on a single-sample GVCF from HaplotypeCaller or a multi-sample GVCF from CombineGVCFs"
* The current GATK4 Best Practices for SNP and INDEL Calling use GenomicsDBImport to merge GVCFs from multiple samples, thus | ||
* <p>The GATK4 Best Practice Workflow for SNP and Indel calling uses GenomicsDBImport to merge GVCFs from multiple samples. | ||
* GenomicsDBImport offers the same functionality as CombineGVCFs and comes from the <i>Intel-Broad Center for Genomics</i>. | ||
* The datastore transposes sample-centric variant information across genomic loci to make data more accessible to tools and to manual probing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There really is no manual probing. God help you if you start to dig into the subdirectories. The big selling point is that the storage footprint is much smaller than the output of CombineGVCFs.
06352d4
to
77ee39d
Compare
77ee39d
to
83a1082
Compare
I just remembered @ldgauthier we need a special note for GenomicsDBImport from #3137. To quote @droazen:
Also need to propogate the Xmx option to the 2nd example command. Shall I do this? |
One more thing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review complete, back to @ldgauthier
@@ -60,32 +60,54 @@ | |||
*/ | |||
public static final int NO_INTERVAL_SHARDING = -1; | |||
|
|||
//NOTE: these argument names are referenced by HaplotypeCallerSpark | |||
public static final String SHARD_SIZE_LONG_NAME = "read-shard-size"; | |||
public static final String SHARD_SIZE_SHORT_NAME = "readShardSize"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the short names be camel-case while the long names are kebab-style seems very confusing. I would just make the short names equal to the long names for all but the most common engine arguments (eg., -R
, -L
, etc.). If you omit the declaration of the short names completely, they will default to the long names, which would allow you to just delete half of these constants.
Same comment applies to the rest of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with @vdauwera, and she agrees.
public static final String PROFILE_OUT_LONG_NAME = "activity-profile-out"; | ||
public static final String PROFILE_OUT_SHORT_NAME = "APO"; | ||
public static final String ASSEMBLY_REGION_OUT_LONG_NAME = "assembly-region-out"; | ||
public static final String ASSEMBLY_REGION_OUT_SHORT_NAME = "ARO"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure these debugging args for activity profile/assembly region output are worthy of having special short names.
* -V data/gvcfs/father.g.vcf.gz \ | ||
* -V data/gvcfs/son.g.vcf.gz \ | ||
* -genomicsDBWorkspace my_database \ | ||
* --intervals 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the more familiar -L
here I think.
* -V data/gvcfs/mother.g.vcf.gz \ | ||
* -V data/gvcfs/father.g.vcf.gz \ | ||
* -V data/gvcfs/son.g.vcf.gz \ | ||
* -genomicsDBWorkspace my_database \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-genomicsDBWorkspace
-> --genomicsdb-workspace-path
* <pre> | ||
* gatk --javaOptions "-Xmx4g -Xms4g" \ | ||
* GenomicsDBImport \ | ||
* -genomicsDBWorkspace my_database \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-genomicsDBWorkspace
-> --genomicsdb-workspace-path
@@ -36,7 +36,7 @@ | |||
* using the variational Bayes algorithm. | |||
*/ | |||
@Advanced | |||
@Argument(fullName = "maxGaussians", shortName = "mG", doc = "Max number of Gaussians for the positive model", optional = true) | |||
@Argument(fullName = "max-gaussians", shortName = "mG", doc = "Max number of Gaussians for the positive model", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these other short names should definitely go, though, I think
|
||
/** | ||
* A read transformer to modify the mapping quality of reads | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention in docs how the mapping quality is modified, exactly.
@@ -162,7 +163,7 @@ public void testDeletions() throws IOException { | |||
@Test | |||
public void testUnfilteredBecomesFilteredAndPass() throws IOException { | |||
final IntegrationTestSpec spec = new IntegrationTestSpec( | |||
baseTestString("unfilteredForFiltering.vcf", " --filterExpression 'FS > 60.0' --filterName SNP_FS "), | |||
baseTestString("unfilteredForFiltering.vcf", " -filter 'FS > 60.0' -filterName SNP_FS "), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use arg name constants here?
Arrays.asList(largeFileTestDir + "expected.NA12878.RNAseq.splitNcigarReads.maxMismatchesInOverhang0.bam")); | ||
spec.executeTest("test splits with overhangs 0 mismatches", this); | ||
} | ||
|
||
@Test | ||
public void testSplitsWithOverhangs5BasesInOverhang() throws Exception { | ||
IntegrationTestSpec spec = new IntegrationTestSpec( | ||
"--maxBasesInOverhang 5 -R " + b37_reference_20_21 + " -I " + largeFileTestDir + "NA12878.RNAseq.bam -O %s --processSecondaryAlignments", | ||
"--max-bases-in-overhang 5 -R " + b37_reference_20_21 + " -I " + largeFileTestDir + "NA12878.RNAseq.bam -O %s --process-secondary-alignments", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these tests use constants for the arg names, since we're touching them anyway?
args.add("--excludeFiltered"); | ||
args.add("-ts_filter_level"); | ||
args.add("-exclude-filtered"); | ||
args.add("-truth-sensitivity-filter-level"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use long names here
@@ -21,7 +23,8 @@ | |||
oneLineSummary = "Gathers scattered VQSLOD tranches into a single file", | |||
programGroup = VariantProgramGroup.class | |||
) | |||
@DocumentedFeature | |||
@DocumentedFeature(enable=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to comment on this: what's up with the enable=false
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause the doc for the feature to not be included in the output. The enable property was ported over from the GTAK3 doc system, though I'm not sure what purpose it served there. Although it will work here (it will hide the doc), you can achieve the same effect achieved by just removing the @DocumentedFeature
annotation altogether.
@droazen Do you really feel strongly about these short names? I left them because scripts that don't use the long names won't break. But if you really want to break literally ALL the WDLs involving GATK, I will change them but then I'm not going to be able to get this PR done by the deadline. |
@ldgauthier Well, I think that having camel-cased short names and kebab-style long names is pretty confusing, and this might be our only chance to fix them to make them consistent, but I defer to @vdauwera on the issue of whether fixing the short names now is worth the extra effort. @vdauwera Can you please weigh in here? |
Re-assigning to @jamesemery to address the outstanding review comments, since @ldgauthier has indicated that she won't have time this week. |
* --resource omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.hg38.sites.vcf.gz \ | ||
* --resource 1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg38.vcf.gz \ | ||
* --resource dbsnp,known=true,training=false,truth=false,prior=2.0 Homo_sapiens_assembly38.dbsnp138.vcf.gz \ | ||
* -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ | ||
* -mode SNP \ | ||
* --recalFile output.recal \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i'm not mistaken, it looks like there isn't actually an argument for --recalFile on this class. Should I remove mention of it from the docs? Is there supposed to be one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that's GATK3 style. That should be --output. I prefer to leave the extension as .recal (it gets parsed as a VCF, but it doesn't have all the data you'd expect), thought @sooheelee might have thoughts.
3086421
to
516bbc0
Compare
Sans some tests I possibly missed fixing, I think I have responded to all of the comments and cut back on a number of the short argument names. I have also made an attempt to fix the wdls but I have not meticulously tested that they now necessarily work. @droazen |
Is this ready to merge? @droazen @ldgauthier? |
Looks like checks have passed but we need a green light from @droazen. |
@sooheelee This is NOT ready to merge -- it needs another review pass from me, and likely some additional changes. |
Alright. Just hopeful that I can take stock of what remains for documentation this weekend. I am looking into an octopus merge now. |
Do not merge -- I'm taking this & the other argument-renaming branches over to modify them for consistency and get them merged before Monday. |
97dd4a3
to
1c8f2cc
Compare
Ugh, this branch was a complete nightmare to rebase... |
1c8f2cc
to
89f0f18
Compare
75b6721
to
9615f62
Compare
This one is (finally!) good to go once tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Lots of docs changed, args kebabed, integration tests args updated * Refactor SplitNCigarReads and add MQ read transformer
9615f62
to
eb1da39
Compare
* Lots of docs changed, args kebabed, integration tests args updated * Refactor SplitNCigarReads and add MQ read transformer
Updated docs with usage examples and kebabed long args for:
Elaborated on/fixed docs for:
Hid GatherTranches
Added a ReadTransformer to SplitNCigarReads to simplify the command from the old RNA best practices (https://software.broadinstitute.org/gatk/documentation/article.php?id=3891) NOTE: this slightly changes the default behavior @vdauwera
Unfortunately I squashed the SplitNCigarReads changes into the doc fixes. :( If that's a problem I can split into two commits.