Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding in codec to read from Gencode GTF files. Fixes #3277 #3410

Merged
merged 8 commits into from
Sep 18, 2017

Conversation

jonn-smith
Copy link
Collaborator

Includes tests for both hg19 and hg38.

@droazen droazen self-assigned this Aug 4, 2017
@droazen
Copy link
Collaborator

droazen commented Aug 4, 2017

@jonn-smith There are a bunch of test failures on this branch in travis -- could you fix these before we review? Thanks!

@jonn-smith
Copy link
Collaborator Author

@droazen @cmnbroad @lbergelson

Sorry about that. Fixed the warning. Should be ready for review.

@codecov-io
Copy link

codecov-io commented Aug 4, 2017

Codecov Report

Merging #3410 into master will decrease coverage by 0.066%.
The diff coverage is 66.96%.

@@               Coverage Diff               @@
##              master     #3410       +/-   ##
===============================================
- Coverage     79.901%   79.835%   -0.066%     
- Complexity     17918     18315      +397     
===============================================
  Files           1199      1217       +18     
  Lines          65102     67142     +2040     
  Branches       10142     10550      +408     
===============================================
+ Hits           52017     53603     +1586     
- Misses          9045      9329      +284     
- Partials        4040      4210      +170
Impacted Files Coverage Δ Complexity Δ
...der/utils/codecs/GENCODE/GencodeGtfCDSFeature.java 100% <100%> (ø) 4 <4> (?)
...der/utils/codecs/GENCODE/GencodeGtfUTRFeature.java 100% <100%> (ø) 4 <4> (?)
...odecs/GENCODE/GencodeGtfSelenocysteineFeature.java 100% <100%> (ø) 4 <4> (?)
...ils/codecs/GENCODE/GencodeGtfStopCodonFeature.java 100% <100%> (ø) 4 <4> (?)
...ls/codecs/GENCODE/GencodeGtfStartCodonFeature.java 100% <100%> (ø) 4 <4> (?)
...er/utils/codecs/GENCODE/GencodeGtfExonFeature.java 46.939% <46.939%> (ø) 12 <12> (?)
...tils/codecs/GENCODE/GencodeGtfFeatureBaseData.java 48.235% <48.235%> (ø) 3 <3> (?)
...er/utils/codecs/GENCODE/GencodeGtfGeneFeature.java 53.333% <53.333%> (ø) 8 <8> (?)
...llbender/utils/codecs/GENCODE/GencodeGtfCodec.java 53.36% <53.36%> (ø) 49 <49> (?)
...ls/codecs/GENCODE/GencodeGtfTranscriptFeature.java 58.333% <58.333%> (ø) 12 <12> (?)
... and 37 more

@droazen
Copy link
Collaborator

droazen commented Aug 7, 2017

@jonn-smith Thanks -- will start my review now

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review complete -- back to @jonn-smith for changes.

Looks really good overall! I might have found a couple of possible bugs here and there, and I had some refactoring/style-related comments as well.

* {@link htsjdk.tribble.Tribble} Codec to read data from a GENCODE GTF file.
*
* GENCODE GTF Files are defined here: https://www.gencodegenes.org/data_format.html
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document any limitations of the codec in the class-level javadoc. Eg., if tabix indexing doesn't work or hasn't been tested, mention that, and if tribble indexing has any issues, mention those as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also prominently document the fact that this codec, unlike most other tribble codecs, can consume multiple lines at a time in the process of decoding a Feature record. Include a brief overview of the high-level structure of a GTF record.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

*/
final public class GencodeGtfCodec extends AbstractFeatureCodec<GencodeGtfFeature, LineIterator> {

protected final Logger logger = LogManager.getLogger(this.getClass());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually make our loggers static final (passing in GencodeGtfCodec.class, in this case) -- unless you can think of a good reason to have a separate logger instance for every codec instance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - makes sense to me.

static final int NUM_COLUMNS = 9;

private long currentLineNum = 1;
protected List<String> header = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be protected? There are no subclasses of GencodeGtfCodec, as far as I can tell. Make private if possible, or at least make immutable. Or if it's protected for unit testing purposes, mark as @VisibleForTesting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making it private should be fine.

*
* Created by jonn on 7/21/17.
*/
final public class GencodeGtfCodec extends AbstractFeatureCodec<GencodeGtfFeature, LineIterator> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document for posterity why this codec extends AbstractFeatureCodec directly rather than the more common AsciiFeatureCodec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

GencodeGtfGeneFeature gene = null;
GencodeGtfTranscriptFeature transcript = null;
final ArrayList< GencodeGtfExonFeature > exonStore = new ArrayList<>();
final ArrayList< GencodeGtfFeature > leafFeatureStore = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use the interface type (List) on the left side when creating lists, unless downstream code is explicitly dependent on the List actually being an ArrayList.

Also, on a minor stylistic note, we typically write List<Whatever> rather than List< Whatever >. The latter brings back bad memories of ancient C++ where nested type parameters could get confused with the >> and << operators.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I'm not using anything ArrayList-specific. I'll update them to be Lists.

Sounds good. Yeah... Old habits die hard. I've spent a lot of time in C++ land... Sometimes my hands just put the spaces in.

new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS),
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno),
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"),
new Gencod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canDecodeProvider?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS),
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno),
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"),
new Gencod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before doing this assert, assert that expectedIterator.hasNext()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS),
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno),
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"),
new Gencod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the loop, assert that the total number of features returned from the decode() method is equal to expected.size()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS),
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno),
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"),
new Gencod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createTempDir() has already registered the directory for recursive deletion on exit, so this is redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.CCDS),
new GencodeGtfFeature.OptionalField<>("tag", GencodeGtfFeature.FeatureTag.seleno),
new GencodeGtfFeature.OptionalField<>("ccdsid", "CCDS43034.1"),
new Gencod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want this to test serializeToString() instead. toString() itself should only be for display purposes (no code should rely upon its return value being in a certain format).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droazen droazen assigned jonn-smith and unassigned droazen Aug 8, 2017
@magicDGS
Copy link
Contributor

magicDGS commented Aug 9, 2017

Hi @jonn-smith, I wonder if some part of this code could be ported to a more general extensible GTF codec in HTSJDK to solve samtools/htsjdk#537 and allow the implementation of specifications like GENCODE.

I know that in HTSJDK takes longer to include some new code, but it will be nice to implement here the support for any GTF specification and then port the code in HTSJDK and keeping in GATK only the GENCODE codec.

Thanks in advance!

@droazen
Copy link
Collaborator

droazen commented Aug 9, 2017

@magicDGS The plan is to eventually push this codec down into htsjdk for wider use by the community, but we want to develop it here first for the sake of fast iteration. Since the GATK project that depends on this branch has a looming deadline, we're unable to generalize this code further at this time, but we can revisit this when the codec is moved into htsjdk.

@magicDGS
Copy link
Contributor

magicDGS commented Aug 9, 2017

Thanks @droazen!

@jonn-smith jonn-smith force-pushed the jts_GENCODE_tribble_codec_3277 branch from fe41911 to 1291e8a Compare August 11, 2017 19:32
@jonn-smith jonn-smith assigned droazen and unassigned jonn-smith Aug 11, 2017
Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review complete, back to @jonn-smith. Found one bug, but otherwise mostly minor comments remaining. Merge after addressing them -- no need for another review round.

package org.broadinstitute.hellbender.utils.codecs.GENCODE;

import com.google.common.annotations.VisibleForTesting;
import com.sun.tools.internal.xjc.reader.xmlschema.bindinfo.BIConversion;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stray com.sun import

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.broadinstitute.hellbender.exceptions.UserException;
import scala.tools.nsc.Global;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stray scala import

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

break;
}

lastRecordWasTranscript = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of this flag (lastRecordWasTranscript) is confusing, since it's not actually keeping track of whether the previous encountered record was a transcript. Can you rename to reflect what it actually represents?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I'll call it needToFlushRecords

// If we have other records left over we should probably yell a lot,
// as this is bad.
//
// However, this should never actually happen.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw a GATKException.ShouldNeverReachHereException in this case instead of a RuntimeException. It's good to keep this sanity check around for now while this is still under development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.


// Now we validate our feature before returning it:
if ( ! validateGencodeGtfFeature( decodedFeature, versionNumber ) ) {
throw new RuntimeException("Decoded feature is not valid: " + decodedFeature);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserException.MalformedFile or just UserException. Bad input is the user's fault.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!


public void setCds(GencodeGtfCDSFeature cds) {
if (this.cds != null) {
throw new RuntimeException("Attempting to set cds, but it already contains a value!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use GATKException for errors that are internal (not the user's fault). Applies throughout this PR. (the exception is checking methods arguments, for which IllegalArgumentException is fine)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Fixed.

/**
* Populate this GencodeGtfFeature with the given data.
*/
protected GencodeGtfFeature(String[] gtfFields) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that gtfFields is the right length using Utils.validateArg()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

baseData.genomicPosition = new SimpleInterval (
gtfFields[CHROMOSOME_NAME_INDEX],
Integer.valueOf( gtfFields[START_LOCATION_INDEX] ),
Integer.valueOf( gtfFields[END_LOCATION_INDEX] )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all calls to Integer.valueOf(), trap the potential NumberFormatException and throw either a UserException or a GATKException depending on whether or not it was the user's fault.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

boolean isEqual = that instanceof GencodeGtfFeature;
if (isEqual) {
GencodeGtfFeature thatFeature = (GencodeGtfFeature) that;
Objects.equals(baseData, thatFeature.baseData);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUGBUG: you need to assign the return value of Objects.equals() back to isEqual!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Good find!

}

@Override
public int hashCode() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use toString() to generate the hashCode() -- toString() is just for display. Take the hashes of the individual fields and combine them, as you've done elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I missed this one!

Fixed!

@droazen droazen assigned jonn-smith and unassigned droazen Aug 24, 2017
@jonn-smith jonn-smith force-pushed the jts_GENCODE_tribble_codec_3277 branch from 1291e8a to 92d6c84 Compare August 28, 2017 16:03
* @return true if {@code other} is contained within the bounds of this {@link GencodeGtfFeature}, false otherwise.
*/
public boolean contains(final GencodeGtfFeature other) {
public <T extends Locatable> boolean contains(final T other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a generic? I think that using public boolean contains(Locatable other) might work...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right - I'll try some tests with it to see.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - I'll pull it back to being a pure Locatable.

@jonn-smith jonn-smith merged commit 4ac5b36 into master Sep 18, 2017
@jonn-smith jonn-smith deleted the jts_GENCODE_tribble_codec_3277 branch September 18, 2017 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants