Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refining handling of transcripts with missing sequence info and other fixes. #4817

Merged
merged 2 commits into from
May 29, 2018

Conversation

jonn-smith
Copy link
Collaborator

Fixes #4739
Refactored UTR VariantClassification handling.
Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
Added tests to prevent regression on data source date comparison bug.
Now can run on large data.
Fixed DNA Repair Genes getter script.
Fixed an issue in COSMIC to make it robust to bad COSMIC data.
Gencode no longer crashes when given an indel that starts just before an exon.
Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
Modified several methods to allow for negative start positions in
preparation for allowing indels that start outside exons.

Fixes #4739
Refactored UTR VariantClassification handling.
Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
Added tests to prevent regression on data source date comparison bug.
Now can run on large data.
Fixed DNA Repair Genes getter script.
Fixed an issue in COSMIC to make it robust to bad COSMIC data.
Gencode no longer crashes when given an indel that starts just before an
exon.
Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work
as delimiters (including characters used in regular expressions, such as
    pipes).
Modified several methods to allow for negative start positions in
preparation for allowing indels that start outside exons.
@codecov-io
Copy link

codecov-io commented May 25, 2018

Codecov Report

Merging #4817 into master will decrease coverage by 0.116%.
The diff coverage is 57.971%.

@@              Coverage Diff               @@
##             master     #4817       +/-   ##
==============================================
- Coverage     80.36%   80.244%   -0.116%     
- Complexity    17606     17611        +5     
==============================================
  Files          1087      1088        +1     
  Lines         63748     63849      +101     
  Branches      10262     10276       +14     
==============================================
+ Hits          51228     51235        +7     
- Misses         8528      8619       +91     
- Partials       3992      3995        +3
Impacted Files Coverage Δ Complexity Δ
...ataSources/xsv/SimpleKeyXsvFuncotationFactory.java 87.097% <100%> (ø) 27 <0> (ø) ⬇️
...r/dataSources/cosmic/CosmicFuncotationFactory.java 64.167% <20.833%> (-12.071%) 21 <0> (ø)
...dataSources/gencode/GencodeFuncotationFactory.java 82.504% <74.286%> (-0.478%) 152 <3> (ø)
...e/hellbender/tools/funcotator/FuncotatorUtils.java 80.029% <88.889%> (-0.058%) 157 <2> (+3)
...spark/sv/utils/SingleSequenceReferenceAligner.java 0% <0%> (ø) 0% <0%> (?)
...utils/smithwaterman/SmithWatermanIntelAligner.java 80% <0%> (+30%) 3% <0%> (+2%) ⬆️

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Minor stuff and a couple of questions

@@ -53,23 +53,35 @@
writer = csv.writer(f, delimiter='|', lineterminator="\n")

isFirstRow = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for additional comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prob.

@@ -327,15 +327,20 @@ public static int getAlignedEndPosition(final int alleleEndPosition) {
/**
* Gets the sequence aligned position (1-based, inclusive) for the given coding sequence position.
* This will produce the next lowest position evenly divisible by 3, such that a codon starting at this returned
* position would include the given position.
* @param position A sequence starting coordinate for which to produce an coding-aligned position. Must be > 0.
* position would include the given position. This can be a negative number, in which case the codon would start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment regarding UTRs and Flanks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a general utility method I don't think it makes sense to talk much about specific use cases here. I'll add an example for upstream UTRs / Flanks in the negative position section.

@@ -346,10 +351,21 @@ public static int getAlignedPosition(final int position) {
*/
public static boolean isInFrameWithEndOfRegion(final int startPosition, final int regionLength) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc is no longer correct for the start position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

}
catch (final IllegalArgumentException ex) {
// If we have poorly bounded genomic positions, we need to warn the user and move on.
// These may occur occasionally in the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this warning come up much on real data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added this in because of an issue I found in the COSMIC data set. I logged this as #4812

}
}
return null;
}

/**
* Print the given {@link ResultSet} to stdout.
* @param resultSet The {@link ResultSet} to print.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this method? Typically, we should log it (as a debug if necessary)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a debug method that I created to help track down issue #4812.

I'd prefer to leave it in since I've already created it. It may be useful later to debug other issues.

@@ -597,16 +597,27 @@ private GencodeFuncotation createGencodeFuncotationOnTranscript(final VariantCon
// Find the sub-feature of transcript that contains our variant:
final GencodeGtfFeature containingSubfeature = getContainingGtfSubfeature(variant, transcript);

// Make sure the start and end of the variant are both in the transcript:
// Make sure the sub-regions in the transcript actually contain the variant:
// TODO: this is slow, and repeats work that is done later in the process (we call getSortedCdsAndStartStopPositions when creating the sequence comparison)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file an issue for this TODO?

if ( startPosInTranscript == -1 ) {
// we overlap an exon but we don't start in one. Right now this case cannot be handled.
// Bubble up an exception and let the caller handle this case.
// TODO: fix this case, issue #4804 (https://github.com/broadinstitute/gatk/issues/4804)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad that #4804 is a beta blocker. No further action needed for this PR.

else {
gencodeFuncotationBuilder.setVariantClassification(GencodeFuncotation.VariantClassification.FIVE_PRIME_UTR);
logger.warn("Attempted to process transcript information for transcript WITHOUT sequence data. Ignoring sequence information for Gencode Transcript ID: " + transcript.getTranscriptId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen? Not sure if any action is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would happen when the Gencode data source does not have a sequence for a given transcript. The information in the resulting funcotation will be correct, but it will be a subset of what a user may expect. This does happen in practice, so I wanted to make sure it was logged as a warning.

@@ -1441,8 +1456,9 @@ static SequenceComparison createSequenceComparison(final VariantContext variant,
if ( processSequenceInformation ) {
if ( transcriptIdMap.containsKey(transcript.getTranscriptId()) ) {

final String transcriptSequence;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't these two lines be merged back into one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@jonn-smith jonn-smith merged commit c77101a into master May 29, 2018
@jonn-smith jonn-smith deleted the jts_missing_transcript_fix_4739 branch May 29, 2018 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants