Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for various tools (MW) #4026

Merged
merged 1 commit into from
Jan 9, 2018
Merged

Conversation

mwalker174
Copy link
Contributor

Documentation update for the following tools:

BwaMemIndexImageCreator
ClipReads
CompareDuplicatesSpark
ConvertHeaderlessHadoopBamShardToBam
CreateHadoopBamSplittingIndex

@sooheelee
Copy link
Contributor

Just seeing this. I've only been looking at PRs where I am explicitly tagged within the comments (e.g. with @sooheelee). So I will take a quick look.

Copy link
Contributor

@sooheelee sooheelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content for ConvertHeaderlessHadoopBamShardToBam is particularly cryptic. I have general comments otherwise and changes to program groups.

import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.BetaFeature;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.barclay.help.DocumentedFeature;
import org.broadinstitute.hellbender.cmdline.CommandLineProgram;
import org.broadinstitute.hellbender.cmdline.programgroups.TestSparkProgramGroup;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a consensus that this and a few other odd tools should now be in the OtherProgramGroup category. There is a new spreadsheet tab 1217Changes if you want to see the additional tweaks. Sorry for the late-breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -17,37 +10,55 @@
import org.broadinstitute.barclay.help.DocumentedFeature;
import org.broadinstitute.hellbender.cmdline.CommandLineProgram;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.cmdline.programgroups.SparkProgramGroup;
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too goes into the OtherProgramGroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -13,7 +13,7 @@
import org.broadinstitute.barclay.argparser.BetaFeature;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.barclay.help.DocumentedFeature;
import org.broadinstitute.hellbender.cmdline.programgroups.TestSparkProgramGroup;
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool is now in the DiagnosticsAndQCProgramGroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Spark troubleshooting utility that converts a headerless Hadoop bam shard (eg., a part0000, part0001, etc. file
* produced by {@link ReadsSparkSink}) into a readable bam file by adding a header and a BGZF terminator.
* Convert a headerless Hadoop BAM shard into a readable BAM. This is a Spark troubleshooting utility that converts a
* headerless Hadoop BAM shard (eg., a part-r-00000, part-r-00001, etc. file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid our users will not know what a shard is and this needs explanation alongside an explanation for the 'Hadoop' label.

Actually, many things about this tool are unclear, e.g. how a user may have created a headerless BAM shard to begin with. I'm not aware of a tool called ReadsSparkSink, so I assume it is some utility called by our Spark tools. Do we have an example tool that we can list specifically to help orient users? This example should be sufficient to clarify the use-case.

For example:

...file produced by a Spark tool, e.g. XXX, using ReadsSparkSink...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadsSparkSink is an implementation detail, so I think we can exclude it from the doc. Instead, I've added an explanation that a sharded BAM is produced by any Spark tool invoked with --sharded-output set to true. Also removed the reference to Hadoop since it is redundant - as far as I know, all shards are Hadoop in GATK.

As a side note, I noticed that this flag is still in camelCase (--shardedOutput). I'm assuming this will get converted to kebab.

* already readable using samtools. Currently {@link ReadsSparkSink} saves the "shards" with a header for the
* {@link ReadsWriteFormat#SHARDED} case, and without a header for the {@link ReadsWriteFormat#SINGLE} case.
* {@link ReadsWriteFormat#SHARDED} case, and without a header for the {@link ReadsWriteFormat#SINGLE} case.</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a rather cryptic description here. I leave it to your judgment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this because it is also an implementation detail.

* gatk ConvertHeaderlessHadoopBamShardToBam \
* --bam-shard shard.bam \
* --bam-with-header input.bam \
* --O output.bam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example seems to imply you can use a file that is already in bam format (has header and terminator). I tested this command as you show and the command runs without error but produces an invalid output (that samtools refuses to display). Also --O-->-O.

Should this example access files that represent a Hadoop structure as mentioned above, e.g. with something like:

gatk ConvertHeaderlessHadoopBamShardToBam \
     --bam-shard part-r-00001 \
     --bam-with-header input.bam \
     -O part-r-00001.bam

Finally, could the bam-with-header be instead a reference dictionary? I think yes as I tested this to create a similarly truncated file that my setup produces. So add another example command to the Usage examples:

Use a reference dictionary file as the header:

gatk ConvertHeaderlessHadoopBamShardToBam \
     --bam-shard part-r-00001 \
     --bam-with-header reference.dict \
     -O part-r-00001.bam

Hopefully from a real shard part this produces a nontruncated full BAM.

Copy link
Contributor Author

@mwalker174 mwalker174 Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the -O argument. I've changed the shard name to part-r-00000.bam to be consistent with the current shard naming convention:

gatk ConvertHeaderlessHadoopBamShardToBam \
     --bam-shard part-r-00000.bam \
     --bam-with-header input.bam \
     -O output.bam

I'm hesitant to add a dictionary example to the doc because the code uses a SAM reader from htsjdk that expects a SAM/BAM file. It happens to work with a dictionary at the moment, but this is semantically different from the intended use of the reader.


/**
* Create a Hadoop BAM splitting index and optionally a BAM index from a BAM file.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know more about the significance of the 4096 splitting-index-granularity default value. For another time perhaps.

* gatk CompareDuplicatesSpark \
* -I input_reads_1.bam \
* -I2 input_reads_2.bam
* </pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given I can run this command without specifying any spark args, add the blurb:

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See <a href ="https://software.broadinstitute.org/gatk/documentation/article?id=10060">Tutorial#10060</a> for an example of how to set up and run a Spark tool on a cloud Spark cluster.

If possible, a Spark-enabled command would be nice to add. But the above blurb should be sufficient for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the blurb.

@sooheelee
Copy link
Contributor

Back to you @mwalker174

@droazen droazen self-assigned this Jan 6, 2018
@droazen
Copy link
Contributor

droazen commented Jan 6, 2018

Do not merge -- I'm taking this & the other argument-renaming branches over to modify them for consistency and get them merged before Monday.

@sooheelee
Copy link
Contributor

I don't believe @mwalker174 has incorporated feedback as of yet.

@droazen
Copy link
Contributor

droazen commented Jan 8, 2018

I'll address the comments myself when I rebase the branch later tonight.

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, just reviewing java correctness. I see no problems with code logic, and the changes look good to me.

@mwalker174
Copy link
Contributor Author

@sooheelee Thank you for your review. I've addressed your suggestions.

@droazen The branch is now also rebased on the latest master.

@codecov-io
Copy link

codecov-io commented Jan 8, 2018

Codecov Report

Merging #4026 into master will increase coverage by 0.111%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #4026       +/-   ##
===============================================
+ Coverage     79.016%   79.127%   +0.111%     
- Complexity     16535     16785      +250     
===============================================
  Files           1056      1059        +3     
  Lines          59150     59733      +583     
  Branches        9615      9758      +143     
===============================================
+ Hits           46738     47265      +527     
- Misses          8673      8716       +43     
- Partials        3739      3752       +13
Impacted Files Coverage Δ Complexity Δ
...tute/hellbender/tools/BwaMemIndexImageCreator.java 71.429% <ø> (ø) 2 <0> (ø) ⬇️
...der/tools/spark/CreateHadoopBamSplittingIndex.java 88.462% <ø> (ø) 14 <0> (ø) ⬇️
...org/broadinstitute/hellbender/tools/ClipReads.java 90.385% <100%> (ø) 35 <0> (ø) ⬇️
...tools/spark/validation/CompareDuplicatesSpark.java 82.927% <100%> (ø) 24 <0> (ø) ⬇️
...er/tools/ConvertHeaderlessHadoopBamShardToBam.java 76.923% <100%> (ø) 2 <0> (ø) ⬇️
...institute/hellbender/utils/help/HelpConstants.java 2.381% <0%> (-1.786%) 1% <0%> (ø)
...itute/hellbender/tools/walkers/bqsr/ApplyBQSR.java 90.476% <0%> (-1.19%) 10% <0%> (+4%)
.../hellbender/tools/spark/BaseRecalibratorSpark.java 86.047% <0%> (-0.91%) 15% <0%> (+6%)
...hellbender/utils/test/VariantContextTestUtils.java 78.972% <0%> (-0.758%) 53% <0%> (-4%)
...s/spark/ParallelCopyGCSDirectoryIntoHDFSSpark.java 75.258% <0%> (-0.252%) 17% <0%> (ø)
... and 30 more

@sooheelee
Copy link
Contributor

👍🏽
Please check with @droazen before merging.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased one last time and made some minor edits. Good to merge once tests pass 👍

@droazen droazen merged commit 058f4e2 into master Jan 9, 2018
@droazen droazen deleted the mw_other_docs branch January 9, 2018 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants