Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All hands on deck: tool doc updates #3853

Closed
sooheelee opened this issue Nov 17, 2017 · 92 comments
Closed

All hands on deck: tool doc updates #3853

sooheelee opened this issue Nov 17, 2017 · 92 comments
Assignees
Milestone

Comments

@sooheelee
Copy link
Contributor

sooheelee commented Nov 17, 2017

Thank you everyone for your contributions towards this documentation effort.
Instructions from @vdauwera to follow at this Google doc
Favorite tool doc examples from @vdauwera NOW in her SOP doc.
Spreadsheet from @sooheelee to be posted here

@sooheelee
Copy link
Contributor Author

sooheelee commented Nov 20, 2017

I've tentatively categorized the tools and they are listed in speadsheet format at:

https://docs.google.com/a/broadinstitute.org/spreadsheets/d/19SvP6DHyXewm8Cd47WsM3NUku_czP2rkh4L_6fd-Nac/edit?usp=sharing

  • [1] GATK4 and Picard tool categories are up for discussion. They are meant to be functional and will be used at https://software.broadinstitute.org/gatk/documentation/tooldocs/current/. First pass by Soo Hee. If you have a better idea, please write to this issue ticket.
  • [2] We can do better than minimum. At minimum, each tool has a summary description and example command.
    • Authorship should not be picked up by gatkDocs (but can remain in javaDoc portion of code so long as masked). If * @author Valentin Ruano-Rubio <[email protected]> is placed at top of doc, causes javaDoc to not show. Such lines should be at the end of the javaDoc portion. @vdauwera prefers all author annotations be removed.
    • Tool commands should use gatk to invoke the launch script, not gatk-launch. Engine team tells me this change will be effective end of this month.
    • A number of tools need -Xmx to be defined and this should be reflected in the example command(s). Hopefully, if your tool needs it, you already know it. Otherwise, see Ensure that the -Xmx values in the example tool commands in GATKDocs are accurate #3137.
  • [3] **AMENDED** Documentation of Picard tools in the Best Practices are a priority as is categorization of Picard tools. In the forum tool list, Picard tools will be mixed with GATK tools alphabetically, with the PICARD label coming after the tool name.

To view docs, build with ./gradlew clean gatkDoc, then view local index in browser.

@sooheelee
Copy link
Contributor Author

sooheelee commented Nov 21, 2017

@vdauwera The tools are categorized and listed in the Google Spreadsheet above. It is waiting for you to assign tech leads to tools for documentation.

One thing that @chandrans brought to my attention is that for BaseRecalibrator one of the parameters (-bqsr) actually causes an error. One can no longer generate the 2nd recalibration table with correction on the fly and instead must use the recalibrated BAM through BaseRecalibrator to generate the 2nd recal table for plotting. This type of information is missing from the tool docs. Furthermore, updates I made to the BQSR slidedeck (that showcase this -bqsr parameter) are based on information from a developer and this information turns out to be incorrect now (perhaps correct at some point in development?). Soooo, I think it may be prudent that those responsible for tool docs test the commands on data.

What the gatkDocs look like as of commit of Mon Nov 20 17:30:46 2017 -0500 where we upgraded htsjdk to 2.13.1:

gatkdoc.zip

Download and load the index.html into a web browser to click through the docs.

@sooheelee
Copy link
Contributor Author

Geraldine says she is busy catching up this week so I think it best if tech leads assign the tools to members of their teams @droazen @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220.

@sooheelee
Copy link
Contributor Author

If we can agree on tool categorization sooner than later, this gives @cmnbroad time to engineer any changes that need engineering.

@samuelklee
Copy link
Contributor

Any chance we could break off legacy CNV tools into their own group? There are many more of them than there will be in the new pipelines---and many of them are experimental, deprecated, unsupported, or for validation only---that I think it makes sense to hide them and perhaps be less stringent about their documentation requirements. Anything we can do to reduce the support burden before release would be great.

@sooheelee
Copy link
Contributor Author

sooheelee commented Nov 21, 2017

I just learned that KEBAB case is different from SNAKE case @cmnbroad. Sorry if KEBAB is offensive @cmnbroad but it is meant to clarify syntax (e.g. https://lodash.com/docs#kebabCase). To be clear, Geraldine wants KEBAB case that uses hyphens, and not SNAKE case, which uses underscores.

  • So --emitRefConfidence would become --emit-ref-confidence.
  • So --contamination_fraction_to_filter would become --contamination-fraction-to-filter.

@vruano will describe how he uses constants to manage parameters.

@vruano
Copy link
Contributor

vruano commented Nov 21, 2017

Since we are going to change many of those argument names (camel-back to kebab-case) I think we should take this opportunity to use constants to specify argument names in the code and use them in our test code so further changes in argument names don't break tests.

Take as an example CombineReadCounts. Extract enclosed below.

It might be also beneficial to add public constant for the default values.

public final class CombineReadCounts extends CommandLineProgram {

    public static final String READ_COUNT_FILES_SHORT_NAME = StandardArgumentDefinitions.INPUT_SHORT_NAME;
    public static final String READ_COUNT_FILES_FULL_NAME  = StandardArgumentDefinitions.INPUT_LONG_NAME;
    public static final String READ_COUNT_FILE_LIST_SHORT_NAME = "inputList";
    public static final String READ_COUNT_FILE_LIST_FULL_NAME = READ_COUNT_FILE_LIST_SHORT_NAME;
    public static final String MAX_GROUP_SIZE_SHORT_NAME = "MOF";
    public static final String MAX_GROUP_SIZE_FULL_NAME = "maxOpenFiles";
    public static final int DEFAULT_MAX_GROUP_SIZE = 100;

    @Argument(
            doc =  "Coverage files to combine, they must contain all the targets in the input file (" +
                    TargetArgumentCollection.TARGET_FILE_LONG_NAME + ") and in the same order",
            shortName = READ_COUNT_FILE_LIST_SHORT_NAME,
            fullName  = READ_COUNT_FILE_LIST_FULL_NAME,
            optional = true
    )
    protected File coverageFileList;

    @Argument(
            doc = READ_COUNT_FILES_DOCUMENTATION,
            shortName = READ_COUNT_FILES_SHORT_NAME,
            fullName = READ_COUNT_FILES_FULL_NAME,
            optional = true
    )
    protected List<File> coverageFiles = new ArrayList<>();

    @Argument(
            doc = "Maximum number of files to combine simultaneously.",
            shortName = MAX_GROUP_SIZE_SHORT_NAME,
            fullName = MAX_GROUP_SIZE_FULL_NAME,
            optional = false
    )
    protected int maxMergeSize = DEFAULT_MAX_GROUP_SIZE;

    @ArgumentCollection
    protected TargetArgumentCollection targetArguments = new TargetArgumentCollection(() ->
            composeAndCheckInputReadCountFiles(coverageFiles, coverageFileList).stream().findFirst().orElseGet(null));

    @Argument(
            doc = "Output file",
            shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME,
            fullName  = StandardArgumentDefinitions.OUTPUT_LONG_NAME,
            optional  = false
    )
    protected File outputFile;

@sooheelee
Copy link
Contributor Author

@samuelklee Because our repo is open-source, even if we hide them from the docs, users end up asking questions on them. So no to hiding any tool that is in the repo.

  • More importantly, it is good to have documentation for our own selves for these tools that we use internally or legacy tools. Once a developer leaves, this is all that typically remains for the rest of us that may have to answer forum question on the code or further develop the functionality. We get questions on all ranges of tool versions and minor releases.

Even when we deprecate a tool or feature, we give people fair warning that the tool/feature will be deprecated before literally removing it from the codebase.

  • I understand you would rather folks drive the new car but the old clunky car model is out there on the road and is being driven by researchers right now. When they have questions, they come to the GATK forum for answers and we need to have at the least tool documentation.
  • If time is limiting, the least we should do for these legacy tools is to state within the doc that (i) folks should use another tool XYZ instead and (ii) the tool/option will be deprecated in the near future.

Besides the BETA label, another option that will soon become available is the Experimental label for internal tools. @cmnbroad is implementing now. It would be great to have additional categories, but @cmnbroad says that this is as much as he has time to do for us and perhaps this is for the best because we don't want to clutter our docs with too many labels. Perhaps @vdauwera can weigh in with thoughts and options.

  • Not an appropriate label for legacy/deprecated tools.

@samuelklee
Copy link
Contributor

Fair points. I agree that legacy tools/versions that are part of a canonical or relatively widely used pipeline should have good documentation.

However, there are many of the CNV tools that are basically prototypes---they have never been part of a pipeline, have no tutorial materials, and the chances that any external users have actually used them are probably extremely low. The sooner they are deprecated, the less the overall burden on both comms and methods---I don't think comms should need to feel protective of code or tools that developers are willing to scrap wholesale!

I'd like to cordon off or hide such tools so the program group doesn't get too cluttered---if we can do this in a way that doesn't require @cmnbroad to add more categories, that would be great. For example, we will have 5 tools that one might reasonably try to use for segmentation (PerformSegmentation, ModelSegments, PerformAlleleFractionSegmentation, PerformCopyRatioSegmentation, and PerformJointSegmentation). The first two are part of the legacy and new pipelines, respectively, but the last 3 were experimental prototypes. I think it's definitely confusing to have these 3 presented in the program group, and treating them the same as the other tools in terms of documentation is just extra work for everyone.

In any case, I definitely think an additional program group to separate the legacy and new tools is warranted, since many of the updated tools in the new pipeline have very similar names to the legacy tools. If this is OK with everyone, I'll just add a "LegacyCopyNumber" program group, which I don't think should require extra work on anyone else's part.

@vdauwera
Copy link
Contributor

Hiding / deprecating tools and their docs

@samuelklee To add to @sooheelee's answer, if there are any tools that you definitely want gone and already have a replacement for, I would encourage you to kill them off (ie delete from the code) before the 4.0 launch. While we're still in beta we can remove anything at the drop of a hat. Once 4.0 is out, we'll have a deprecation policy (exact details TBD) that will allow us to prune unwanted tools over time, but it will be less trivial. And as Soo Hee said, everything that's in the current code release MUST be documented. We used to hide tools/docs in the past and it caused us more headaches than not.

That being said, as part of that TBD deprecation policy it will probably make sense to make a "Deprecated" program group where tools go to die. If there are tools you plan to kill but don't want to do it before 4.0 is released for whatever reason, you could put them there. Documentation standards can be less stringent for tools in that bucket. To be clear I think the deprecation group name should be generic, ie not named to match any particular use case or functionality. That will help us avoid seeing deprecation buckets proliferate for each variant class/ use case. Does that sound like a reasonable compromise?

@vdauwera
Copy link
Contributor

Guidelines for converting arguments to kebab case

We're not following an external spec doc, so here some guidelines to follow instead. Keep in mind that the main thing we're going for here is readability and consistency across tools, not absolute purity, so feel free to raise discussion on any cases where you feel the guidelines should be relaxed. Some things are more negotiable than others.

  1. Use all lower-case (yes, even for file formats).
  2. Use only dash (-) as separator, no underscores (because lots of newbies struggle to differentiate the two, and underscores take more effort to type than dashes).
  3. Separate words rather than smushing them together, eg use --do-this-thing rather than --dothisthing (this is really important for readability, especially for non-native English speakers).
  4. Avoid cryptic abbreviations and acronyms; eg use --do-this-thing rather than --dtt
  5. If you end up with --really-long-argument-names-that-take-up-half-a-line, please reach out and ask for a consult; maybe we can find a more succinct way of expressing what you need.
  6. If you run into any situation not covered above, please bring it up in this thread.

@vdauwera
Copy link
Contributor

Using constants for argument names

Sounds like a fantastic idea -- I encourage everyone to follow @vruano's lead on this one.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 21, 2017

OK, great---I'll issue some PRs to delete some of the prototype tools soon and update the spreadsheet accordingly. A non-CNV-specific "Deprecated" program group seems reasonable to me if there is enough demand. If this is the only way to delineate the legacy CNV + ACNV pipeline from the new pipeline, I'm OK with it---but we should probably make the situation clear at any workshops, presentations, etc. between now and release that might focus on the legacy pipeline.

On a different note, are there any conventions for short names that we should follow?

@magicDGS
Copy link
Contributor

I propose to still hide from the command line and docs the example walkers. They are meant only for developers, to show how to use some kind of walkers and have a running tool for integration tests. Having then in the command line will generate software users to run them instead of use them for developmental purposes...

In addition, I think that this is a good moment to also generate a sub-module structure (as I suggested in #3838) to separate artifact for different pipelines/framework bits (e.g., engine, Spark-engine, experimental, example-code, CNV pipeline, general-tools, etc.). For the aim of this issue, this will be useful for setting documentation guidelines in each of the sub-modules: e.g., example-code should be documented for developers, but not for the final user; experimental module should have the @Experimental barclay annotation in every @DocumentedFeature; etc.

@cmnbroad
Copy link
Collaborator

A couple of comments:

  • If we're going to add a generic "Deprecated" program group, it should be added in Picard rather than GATK, so it can be used by both. If its defined in GATK, Picard can't use it, and deprecated Picard tools won't be grouped with deprecated GATK tools (we currently have this situation with the SAM/BAM group and the VCF group - the Picard tools are grouped separately from the GATK ones, but Identify Picard tools in help/doc using " (Picard)" suffix. #3824 fixes that).
  • There is a CommandLineProperties attribute called "omitFromCommandLine". If set to true, the tool isn't displayed in the list of tools on the command line. Currently, all of the example tools; about 5 SV tools; and FixCallSetSampleOrdering have this set to true. I personally don't think we should clog up the command line with the Example tools - they're for developers, not end users, but either way there should probably be a policy on how to employ this property.

@ldgauthier
Copy link
Contributor

To clarify the build process noted above "view local index in browser" means open the index.html file at gatk/build/docs/gatkdoc/

@sooheelee
Copy link
Contributor Author

We need standard arguments to show up in the documentation for both Picard and GATK. Can @droazen or @cmnbroad please let me know this is happening?

@cmnbroad
Copy link
Collaborator

The standard arguments for each tool are listed with that tool's arguments (if you look at the doc for a particular tool, you'll see an "Optional Common Arguments" heading, with the shared, common arguments listed there).

The GATK4 doc system doesn't generate a separate page for these like GATK3 did, and I think doing so would be of questionable value, since there are several classes of tools, each of which has it's own set of "common" arguments (GATK Walker tools, GATK Spark tools, Picard tools, and some GATK "cowboy" tools that do their own thing).

We did discuss an alternative design a while back with @droazen and @vdauwera, but that was never implemented, and was a variant of the current design where the common args are included with each tool.

@sooheelee
Copy link
Contributor Author

sooheelee commented Nov 27, 2017

@cmnbroad and @vdauwera Barclay doesn't pull the USAGE_DETAILS portion of Picard tools towards gatkDocs. So Picard documentation is minimal with just a summary description of each tool.

screenshot 2017-11-27 16 43 27

Doesn't seem right to duplicate the same information in a tool doc, once in the asterisked javaDoc portion and once in USAGE_DETAILs for whatever system creates this view, which I am to understand will go to the wayside someday in favor of Picard documentation being offered only through https://software.broadinstitute.org/gatk/.

Seems we should use the asterisked gatkDoc portion for GATK-specific documentation we want, e.g. commands that invoke Picard tools through the gatk launch script and using GATK4 syntax, and pull the rest of the documentation from the USAGE_DETAILS (Picard jar command example).

I've prioritized Picard tools in a second tab of the shared Google spreadsheet towards Picard doc updates. Please let me know how we want to approach Picard tool doc updates @vdauwera.

@vruano
Copy link
Contributor

vruano commented Dec 6, 2017

I guess we need to do some work on the doclet code, abstract out example code and use templates to transform it into the appropriate format/syntax depending what project is generating the documentation.

Alternatively and only if documentation html is well formed (xhtml like) then in theory we could use XSLT transformation style sheets to convert embedded code example encoded with xml/xhtml into the concrete syntax. Most major browsers support XSLT.

EDIT: The XSLT solution won't probably work since even if we try to change the output from html to xhtml, the fact that we are injecting the javadoc's html would probably break the xhtml.

@sooheelee
Copy link
Contributor Author

Thanks for taking this on @vruano.

@vruano
Copy link
Contributor

vruano commented Dec 6, 2017

I created a separated issue about this #3932

@sooheelee
Copy link
Contributor Author

sooheelee commented Dec 7, 2017

If your PR is ready for review, please copy-paste url to PR like @vruano did for Picard MergeVcfs in the spreadsheet AND tag @sooheelee so I can find it easily.

@sooheelee
Copy link
Contributor Author

Just to be clear folks, we are using gatk directly, not ./gatk in the example commands. And if you can, for those of you yet to make your updates, please use compressed file examples. Those of you who've already put in changes, thank you and Comms can tidy those bits later.

<h3>Usage examples</h3>
<pre>
gatk --javaOptions "-Xmx4g" GenotypeGVCFs \
	-R Homo_sapiens_assembly38.fasta
	-V combined.g.vcf.gz
	-O cohort.vcf.gz
</pre>

<pre>
gatk GenotypeGVCFs \
	-R reference.fa
	-V combined.g.vcf.gz
	-O cohort.vcf.gz
</pre>

@sooheelee
Copy link
Contributor Author

sooheelee commented Dec 8, 2017

Mutect2 Filters list is currently absent from docs in FilterMutectCalls

@vdauwera I think these need an explanation somewhere, perhaps the tool docs, much like the Read Filters and Variant Annotations. What say you?

screenshot 2017-12-08 13 03 22

@vdauwera
Copy link
Contributor

vdauwera commented Dec 8, 2017

Those should show up in the M2 tool doc -- or FilterMutectCalls, whatever tool actually takes those args

@sooheelee
Copy link
Contributor Author

Yes, they show up in FilterMutectCalls. Thanks for that pointer.

@sooheelee
Copy link
Contributor Author

Just a reminder to omit --javaOptions unless your tool needs it. If it does, remember to kebabify this too to --java-options.

@sooheelee
Copy link
Contributor Author

@vdauwera We need to also generate docs for Picard metrics. I think we can classify them separately like we do Read Filters and Variant Annotations. E.g. Picard Metrics.

@cmnbroad Does Barclay pull these in as well?

@cmnbroad
Copy link
Collaborator

@sooheelee If you add @DocumentedFeature to them, and assign them a groupName, groupSummary and summary, the top-level class javadoc will show up in the doc output. But since they don't have command line arguments, I think it would require additional dev work to make the fields level doc show up like it did in the old picard doc.

@magicDGS
Copy link
Contributor

@cmnbroad - Cannot be done annotating them as @Argument as a temporary solution? Then, a custom template for the metric group can be used...

@sooheelee
Copy link
Contributor Author

Since I have your attention on the matter, let me test what this looks like.

@sooheelee
Copy link
Contributor Author

@cmnbroad @magicDGS, I've moved the discussion to #3947.

@sooheelee
Copy link
Contributor Author

Hidden tools are coming out of the woodworks and needing classification.

@sooheelee
Copy link
Contributor Author

If at least one example command for a Spark tool does not utilize Spark, then I think we need this statement in each tool doc (to which it applies):

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See Tutorial#10060 for an example of how to set up and run a Spark tool on a cloud Spark cluster.

copy-pastable format:

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See <a href ="https://software.broadinstitute.org/gatk/documentation/article?id=10060">Tutorial#10060</a> for an example of how to set up and run a Spark tool on a cloud Spark cluster.

This tutorial isn't the best, as it only focuses on Google Dataproc, but it's what we have currently to get people started. Let's just say it's a placeholder until we get something better up.

@sooheelee
Copy link
Contributor Author

sooheelee commented Dec 18, 2017

@yfarjoun @vdauwera I've refined the tool categorization based on feedback on the tentative categorization. Thank you @yfarjoun for the review and feedback. The refinement is reflected in the new tabbed sheet in the shared Google Spreadsheet:1217Changes_categorization-and-assignments. I've separated out GATK vs Picard tools for each of the categories.

Here is a summary of the changes.

  1. New 11 to Diagnostics and QC:
    AnalyzeCovariates (from Alignment, Duplicate flagging and BQSR)
    GatherBQSRReports (from Alignment, Duplicate flagging and BQSR)
    FlagStat (from Read Data Manipulation)
    FlagStatSpark (from Read Data Manipulation)
    GetSampleName (from Read Data Manipulation)
    Picard BamIndexStats (from Read Data Manipulation)
    Picard CalculateReadGroupChecksum (from Read Data Manipulation)
    Picard CheckTerminatorBlock (from Read Data Manipulation)
    Picard CompareSAMs (from Read Data Manipulation)
    Picard ValidateSamFile (from Read Data Manipulation)
    Picard ViewSam (from Read Data Manipulation)

  2. Merge 14 tools remaining in Alignment, Duplicate flagging and BQSR with 37 tools in Read Data Manipulation. Keep latter name.

    51 tools

  3. Move these out of Read Data Manipulation:
    CompareDuplicatesSpark (to DxQC)
    ConvertHeaderlessHadoopBamShardToBam (to Other)
    CreateHadoopBamSplittingIndex (to Other)

  4. Move ValidateVariants into Variant Evaluation. Also:
    Variant Evaluation and Refinement --> Variant Evaluation
    VCF Manipulation --> Variant Manipulation

Let us know your thoughts. Thank you.

@sooheelee
Copy link
Contributor Author

sooheelee commented Dec 18, 2017

Hey developers working on GATK4 doc updates (tagging tech leads here) @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220. Just a reminder (from @vdauwera and @droazen) that we want to get rid of many of the short form arguments and only keep the long form for the more obscure arguments. Meaning we keep both forms for commonly used and understood arguments.

@sooheelee
Copy link
Contributor Author

Alright, everyone. If you can have your tooldoc updates merged by this Friday, that would be great. That gives me time to take stock of what is missing and update the documentation accordingly over the weekend.

@sooheelee
Copy link
Contributor Author

The Picard Program Group assignments broadinstitute/picard#1043 PR has been merged.

@sooheelee
Copy link
Contributor Author

sooheelee commented Jan 8, 2018

I've gone in and worked directly on the branches in #4025, #4068 and #4019, in the doc portions of the code. These are undergoing Travis testing and one of these is awaiting additional (dev) review.

@sooheelee
Copy link
Contributor Author

Thanks everyone for your contributions to updating the tool docs for yesterday's GATK4.0 release.

In my brief surveys of the status of tooldoc updates, I noticed a number of tools without example commands. I will fill in these missing ones going forward. Thanks again for your efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests