-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation for various tools (MW) #4026
Conversation
Just seeing this. I've only been looking at PRs where I am explicitly tagged within the comments (e.g. with @sooheelee). So I will take a quick look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content for ConvertHeaderlessHadoopBamShardToBam is particularly cryptic. I have general comments otherwise and changes to program groups.
import org.broadinstitute.barclay.argparser.Argument; | ||
import org.broadinstitute.barclay.argparser.BetaFeature; | ||
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties; | ||
import org.broadinstitute.barclay.help.DocumentedFeature; | ||
import org.broadinstitute.hellbender.cmdline.CommandLineProgram; | ||
import org.broadinstitute.hellbender.cmdline.programgroups.TestSparkProgramGroup; | ||
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions; | ||
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is a consensus that this and a few other odd tools should now be in the OtherProgramGroup category. There is a new spreadsheet tab 1217Changes
if you want to see the additional tweaks. Sorry for the late-breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -17,37 +10,55 @@ | |||
import org.broadinstitute.barclay.help.DocumentedFeature; | |||
import org.broadinstitute.hellbender.cmdline.CommandLineProgram; | |||
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions; | |||
import org.broadinstitute.hellbender.cmdline.programgroups.SparkProgramGroup; | |||
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too goes into the OtherProgramGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -13,7 +13,7 @@ | |||
import org.broadinstitute.barclay.argparser.BetaFeature; | |||
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties; | |||
import org.broadinstitute.barclay.help.DocumentedFeature; | |||
import org.broadinstitute.hellbender.cmdline.programgroups.TestSparkProgramGroup; | |||
import org.broadinstitute.hellbender.cmdline.programgroups.ReadDataProgramGroup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool is now in the DiagnosticsAndQCProgramGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Spark troubleshooting utility that converts a headerless Hadoop bam shard (eg., a part0000, part0001, etc. file | ||
* produced by {@link ReadsSparkSink}) into a readable bam file by adding a header and a BGZF terminator. | ||
* Convert a headerless Hadoop BAM shard into a readable BAM. This is a Spark troubleshooting utility that converts a | ||
* headerless Hadoop BAM shard (eg., a part-r-00000, part-r-00001, etc. file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid our users will not know what a shard is and this needs explanation alongside an explanation for the 'Hadoop' label.
Actually, many things about this tool are unclear, e.g. how a user may have created a headerless BAM shard to begin with. I'm not aware of a tool called ReadsSparkSink, so I assume it is some utility called by our Spark tools. Do we have an example tool that we can list specifically to help orient users? This example should be sufficient to clarify the use-case.
For example:
...file produced by a Spark tool, e.g. XXX, using ReadsSparkSink...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadsSparkSink is an implementation detail, so I think we can exclude it from the doc. Instead, I've added an explanation that a sharded BAM is produced by any Spark tool invoked with --sharded-output set to true. Also removed the reference to Hadoop since it is redundant - as far as I know, all shards are Hadoop in GATK.
As a side note, I noticed that this flag is still in camelCase (--shardedOutput). I'm assuming this will get converted to kebab.
* already readable using samtools. Currently {@link ReadsSparkSink} saves the "shards" with a header for the | ||
* {@link ReadsWriteFormat#SHARDED} case, and without a header for the {@link ReadsWriteFormat#SINGLE} case. | ||
* {@link ReadsWriteFormat#SHARDED} case, and without a header for the {@link ReadsWriteFormat#SINGLE} case.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, a rather cryptic description here. I leave it to your judgment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this because it is also an implementation detail.
* gatk ConvertHeaderlessHadoopBamShardToBam \ | ||
* --bam-shard shard.bam \ | ||
* --bam-with-header input.bam \ | ||
* --O output.bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example seems to imply you can use a file that is already in bam format (has header and terminator). I tested this command as you show and the command runs without error but produces an invalid output (that samtools refuses to display). Also --O
-->-O
.
Should this example access files that represent a Hadoop structure as mentioned above, e.g. with something like:
gatk ConvertHeaderlessHadoopBamShardToBam \
--bam-shard part-r-00001 \
--bam-with-header input.bam \
-O part-r-00001.bam
Finally, could the bam-with-header be instead a reference dictionary? I think yes as I tested this to create a similarly truncated file that my setup produces. So add another example command to the Usage examples:
Use a reference dictionary file as the header:
gatk ConvertHeaderlessHadoopBamShardToBam \
--bam-shard part-r-00001 \
--bam-with-header reference.dict \
-O part-r-00001.bam
Hopefully from a real shard part this produces a nontruncated full BAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the -O argument. I've changed the shard name to part-r-00000.bam to be consistent with the current shard naming convention:
gatk ConvertHeaderlessHadoopBamShardToBam \
--bam-shard part-r-00000.bam \
--bam-with-header input.bam \
-O output.bam
I'm hesitant to add a dictionary example to the doc because the code uses a SAM reader from htsjdk that expects a SAM/BAM file. It happens to work with a dictionary at the moment, but this is semantically different from the intended use of the reader.
|
||
/** | ||
* Create a Hadoop BAM splitting index and optionally a BAM index from a BAM file. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know more about the significance of the 4096 splitting-index-granularity default value. For another time perhaps.
* gatk CompareDuplicatesSpark \ | ||
* -I input_reads_1.bam \ | ||
* -I2 input_reads_2.bam | ||
* </pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given I can run this command without specifying any spark args, add the blurb:
This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See <a href ="https://software.broadinstitute.org/gatk/documentation/article?id=10060">Tutorial#10060</a> for an example of how to set up and run a Spark tool on a cloud Spark cluster.
If possible, a Spark-enabled command would be nice to add. But the above blurb should be sufficient for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the blurb.
Back to you @mwalker174 |
Do not merge -- I'm taking this & the other argument-renaming branches over to modify them for consistency and get them merged before Monday. |
I don't believe @mwalker174 has incorporated feedback as of yet. |
I'll address the comments myself when I rebase the branch later tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, just reviewing java correctness. I see no problems with code logic, and the changes look good to me.
e5bb6ad
to
3e63d86
Compare
@sooheelee Thank you for your review. I've addressed your suggestions. @droazen The branch is now also rebased on the latest master. |
Codecov Report
@@ Coverage Diff @@
## master #4026 +/- ##
===============================================
+ Coverage 79.016% 79.127% +0.111%
- Complexity 16535 16785 +250
===============================================
Files 1056 1059 +3
Lines 59150 59733 +583
Branches 9615 9758 +143
===============================================
+ Hits 46738 47265 +527
- Misses 8673 8716 +43
- Partials 3739 3752 +13
|
👍🏽 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased one last time and made some minor edits. Good to merge once tests pass 👍
Documentation update for the following tools:
BwaMemIndexImageCreator
ClipReads
CompareDuplicatesSpark
ConvertHeaderlessHadoopBamShardToBam
CreateHadoopBamSplittingIndex