-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refining handling of transcripts with missing sequence info and other fixes. #4817
Conversation
Fixes #4739 Refactored UTR VariantClassification handling. Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions). Added tests to prevent regression on data source date comparison bug. Now can run on large data. Fixed DNA Repair Genes getter script. Fixed an issue in COSMIC to make it robust to bad COSMIC data. Gencode no longer crashes when given an indel that starts just before an exon. Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes). Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
Codecov Report
@@ Coverage Diff @@
## master #4817 +/- ##
==============================================
- Coverage 80.36% 80.244% -0.116%
- Complexity 17606 17611 +5
==============================================
Files 1087 1088 +1
Lines 63748 63849 +101
Branches 10262 10276 +14
==============================================
+ Hits 51228 51235 +7
- Misses 8528 8619 +91
- Partials 3992 3995 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Minor stuff and a couple of questions
@@ -53,23 +53,35 @@ | |||
writer = csv.writer(f, delimiter='|', lineterminator="\n") | |||
|
|||
isFirstRow = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for additional comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prob.
@@ -327,15 +327,20 @@ public static int getAlignedEndPosition(final int alleleEndPosition) { | |||
/** | |||
* Gets the sequence aligned position (1-based, inclusive) for the given coding sequence position. | |||
* This will produce the next lowest position evenly divisible by 3, such that a codon starting at this returned | |||
* position would include the given position. | |||
* @param position A sequence starting coordinate for which to produce an coding-aligned position. Must be > 0. | |||
* position would include the given position. This can be a negative number, in which case the codon would start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment regarding UTRs and Flanks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a general utility method I don't think it makes sense to talk much about specific use cases here. I'll add an example for upstream UTRs / Flanks in the negative position section.
@@ -346,10 +351,21 @@ public static int getAlignedPosition(final int position) { | |||
*/ | |||
public static boolean isInFrameWithEndOfRegion(final int startPosition, final int regionLength) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc is no longer correct for the start position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
} | ||
catch (final IllegalArgumentException ex) { | ||
// If we have poorly bounded genomic positions, we need to warn the user and move on. | ||
// These may occur occasionally in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see this warning come up much on real data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added this in because of an issue I found in the COSMIC data set. I logged this as #4812
} | ||
} | ||
return null; | ||
} | ||
|
||
/** | ||
* Print the given {@link ResultSet} to stdout. | ||
* @param resultSet The {@link ResultSet} to print. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this method? Typically, we should log it (as a debug if necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a debug method that I created to help track down issue #4812.
I'd prefer to leave it in since I've already created it. It may be useful later to debug other issues.
@@ -597,16 +597,27 @@ private GencodeFuncotation createGencodeFuncotationOnTranscript(final VariantCon | |||
// Find the sub-feature of transcript that contains our variant: | |||
final GencodeGtfFeature containingSubfeature = getContainingGtfSubfeature(variant, transcript); | |||
|
|||
// Make sure the start and end of the variant are both in the transcript: | |||
// Make sure the sub-regions in the transcript actually contain the variant: | |||
// TODO: this is slow, and repeats work that is done later in the process (we call getSortedCdsAndStartStopPositions when creating the sequence comparison) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you file an issue for this TODO
?
if ( startPosInTranscript == -1 ) { | ||
// we overlap an exon but we don't start in one. Right now this case cannot be handled. | ||
// Bubble up an exception and let the caller handle this case. | ||
// TODO: fix this case, issue #4804 (https://github.com/broadinstitute/gatk/issues/4804) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad that #4804 is a beta blocker. No further action needed for this PR.
else { | ||
gencodeFuncotationBuilder.setVariantClassification(GencodeFuncotation.VariantClassification.FIVE_PRIME_UTR); | ||
logger.warn("Attempted to process transcript information for transcript WITHOUT sequence data. Ignoring sequence information for Gencode Transcript ID: " + transcript.getTranscriptId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen? Not sure if any action is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would happen when the Gencode data source does not have a sequence for a given transcript. The information in the resulting funcotation will be correct, but it will be a subset of what a user may expect. This does happen in practice, so I wanted to make sure it was logged as a warning.
@@ -1441,8 +1456,9 @@ static SequenceComparison createSequenceComparison(final VariantContext variant, | |||
if ( processSequenceInformation ) { | |||
if ( transcriptIdMap.containsKey(transcript.getTranscriptId()) ) { | |||
|
|||
final String transcriptSequence; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't these two lines be merged back into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Fixes #4739
Refactored UTR VariantClassification handling.
Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
Added tests to prevent regression on data source date comparison bug.
Now can run on large data.
Fixed DNA Repair Genes getter script.
Fixed an issue in COSMIC to make it robust to bad COSMIC data.
Gencode no longer crashes when given an indel that starts just before an exon.
Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
Modified several methods to allow for negative start positions in
preparation for allowing indels that start outside exons.