-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refining handling of transcripts with missing sequence info and other fixes. #4817
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -327,15 +327,20 @@ public static int getAlignedEndPosition(final int alleleEndPosition) { | |
/** | ||
* Gets the sequence aligned position (1-based, inclusive) for the given coding sequence position. | ||
* This will produce the next lowest position evenly divisible by 3, such that a codon starting at this returned | ||
* position would include the given position. | ||
* @param position A sequence starting coordinate for which to produce an coding-aligned position. Must be > 0. | ||
* position would include the given position. This can be a negative number, in which case the codon would start | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment regarding UTRs and Flanks? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is a general utility method I don't think it makes sense to talk much about specific use cases here. I'll add an example for upstream UTRs / Flanks in the negative position section. |
||
* at the given position relative to the normal starting position (1). | ||
* @param position A sequence starting coordinate for which to produce an coding-aligned position. | ||
* @return A coding-aligned position (1-based, inclusive) corresponding to the given {@code position}. | ||
*/ | ||
public static int getAlignedPosition(final int position) { | ||
|
||
ParamUtils.isPositive( position, "Genomic positions must be > 0." ); | ||
|
||
return position - ((position - 1) % 3); | ||
if ( position > 0 ) { | ||
return position - ((position - 1) % 3); | ||
} | ||
else { | ||
final int adjustedPos = 1 - position; | ||
return -(adjustedPos - ((adjustedPos - 1) % 3) + 1); | ||
} | ||
} | ||
|
||
/** | ||
|
@@ -346,10 +351,21 @@ public static int getAlignedPosition(final int position) { | |
*/ | ||
public static boolean isInFrameWithEndOfRegion(final int startPosition, final int regionLength) { | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The javadoc is no longer correct for the start position. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed! |
||
ParamUtils.isPositive( startPosition, "Genomic positions must be > 0." ); | ||
ParamUtils.isPositiveOrZero( regionLength, "Region length must be >= 0." ); | ||
|
||
return (((regionLength - startPosition + 1) % 3) == 0); | ||
// Micro-optimization to split the return statements based on | ||
// the if statement so we don't have to do unnecessary math in the "normal" case: | ||
if ( startPosition > 0 ) { | ||
return (((regionLength - startPosition + 1) % 3) == 0); | ||
} | ||
else { | ||
// If we have a position before our region starts, we must calculate an offset | ||
// between the position and the actual start. | ||
// We can then add this offset to the start and region length to simplify the calculation. | ||
final int preFlankOffset = 1 - startPosition; | ||
|
||
return ((((regionLength + preFlankOffset) - (startPosition + preFlankOffset) + 1) % 3) == 0); | ||
} | ||
} | ||
|
||
/** | ||
|
@@ -1387,11 +1403,12 @@ public static String getAlignedCodingSequenceAllele(final String codingSequence, | |
|
||
/** | ||
* Get the aligned coding sequence for the given reference allele. | ||
* NOTE: alignedRefAlleleStart must be less than or equal to codingSequenceRefAlleleStart. | ||
* @param referenceSnippet {@link String} containing a short excerpt of the reference sequence around the given reference allele. Must not be {@code null}. | ||
* @param referencePadding Number of bases in {@code referenceSnippet} before the reference allele starts. This padding exists at the end as well (plus some other bases to account for the length of the alternate allele if it is longer than the reference). Must be >= 0. | ||
* @param refAllele The reference {@link Allele}. Must not be {@code null}. | ||
* @param codingSequenceRefAlleleStart The position (1-based, inclusive) in the coding sequence where the {@code refAllele} starts. Must be > 0. | ||
* @param alignedRefAlleleStart The in-frame position (1-based, inclusive) of the first base of the codon containing the reference allele. Must be > 0. | ||
* @param codingSequenceRefAlleleStart The position (1-based, inclusive) in the coding sequence where the {@code refAllele} starts. | ||
* @param alignedRefAlleleStart The in-frame position (1-based, inclusive) of the first base of the codon containing the reference allele. | ||
* @return A {@link String} of in-frame codons that contain the entire reference allele. | ||
*/ | ||
public static String getAlignedRefAllele(final String referenceSnippet, | ||
|
@@ -1402,14 +1419,14 @@ public static String getAlignedRefAllele(final String referenceSnippet, | |
|
||
Utils.nonNull(referenceSnippet); | ||
Utils.nonNull(refAllele); | ||
|
||
ParamUtils.isPositiveOrZero( referencePadding, "Padding must be >= 0." ); | ||
ParamUtils.isPositive( codingSequenceRefAlleleStart, "Genome positions must be > 0." ); | ||
ParamUtils.isPositive( alignedRefAlleleStart, "Genome positions must be > 0." ); | ||
Utils.validate( alignedRefAlleleStart <= codingSequenceRefAlleleStart, "The alignedRefAlleleStart must be less than or equal to codingSequenceRefAlleleStart!" ); | ||
|
||
|
||
final int extraBasesNeeded = (codingSequenceRefAlleleStart - alignedRefAlleleStart); | ||
int refStartPos = referencePadding - extraBasesNeeded; | ||
|
||
// TODO: This should probably be an error condition: | ||
if ( refStartPos < 0 ) { | ||
refStartPos = 0; | ||
} | ||
|
@@ -1489,6 +1506,7 @@ public static int getProteinChangePosition(final Integer alignedCodingSequenceAl | |
|
||
Utils.nonNull(alignedCodingSequenceAlleleStart); | ||
ParamUtils.isPositive( alignedCodingSequenceAlleleStart, "Genome positions must be > 0." ); | ||
|
||
return ((alignedCodingSequenceAlleleStart-1) / 3) + 1; // Add 1 because we're 1-based. | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -252,9 +252,9 @@ protected List<Funcotation> createFuncotationsOnVariant(final VariantContext var | |
try ( final ResultSet resultSet = statement.executeQuery(RESULT_QUERY_TEMPLATE + "\"" + geneName + "\";") ) { | ||
// iterate through our results: | ||
while ( resultSet.next() ) { | ||
// Get the genome position and protein position: | ||
|
||
// Get the genome position: | ||
final SimpleInterval cosmicGenomePosition = getGenomePositionFromResults(resultSet); | ||
final SimpleInterval cosmicProteinPosition = getProteinPositionFromResults(resultSet); | ||
|
||
// Try to match on genome position first: | ||
if ( cosmicGenomePosition != null ) { | ||
|
@@ -265,6 +265,9 @@ protected List<Funcotation> createFuncotationsOnVariant(final VariantContext var | |
} | ||
} | ||
|
||
// Get the protein position: | ||
final SimpleInterval cosmicProteinPosition = getProteinPositionFromResults(resultSet); | ||
|
||
// Now try to match on protein position: | ||
if ( proteinPosition != null ) { | ||
// If we overlap the records, we update the counter: | ||
|
@@ -332,7 +335,15 @@ private final SimpleInterval getGenomePositionFromResults(final ResultSet result | |
final int start = Integer.valueOf(matcher.group(2)); | ||
final int end = Integer.valueOf(matcher.group(3)); | ||
|
||
return new SimpleInterval(contig, start, end); | ||
try { | ||
return new SimpleInterval(contig, start, end); | ||
} | ||
catch (final IllegalArgumentException ex) { | ||
// If we have poorly bounded genomic positions, we need to warn the user and move on. | ||
// These may occur occasionally in the data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you see this warning come up much on real data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. I added this in because of an issue I found in the COSMIC data set. I logged this as #4812 |
||
logger.warn("Warning - unable to parse genome position string due to invalid position information. Ignoring potential COSMIC match with genome position: " + rawPosition); | ||
return null; | ||
} | ||
} | ||
} | ||
catch (final SQLException ex) { | ||
|
@@ -369,18 +380,51 @@ private final SimpleInterval parseProteinString(final String proteinPositionStri | |
|
||
final Matcher matcher = PROTEIN_POSITION_REGEX.matcher(proteinPositionString); | ||
if ( matcher.matches() ) { | ||
// We have a position, so we should parse it. | ||
// If we have an end position, we should use it: | ||
if ( matcher.group(2) != null ) { | ||
return new SimpleInterval(PROTEIN_CONTIG, Integer.valueOf(matcher.group(1)), Integer.valueOf(matcher.group(2))); | ||
try { | ||
|
||
// We have a position, so we should parse it. | ||
// If we have an end position, we should use it: | ||
if ( matcher.group(2) != null ) { | ||
return new SimpleInterval(PROTEIN_CONTIG, Integer.valueOf(matcher.group(1)), Integer.valueOf(matcher.group(2))); | ||
} | ||
else { | ||
return new SimpleInterval(PROTEIN_CONTIG, Integer.valueOf(matcher.group(1)), Integer.valueOf(matcher.group(1))); | ||
} | ||
} | ||
else { | ||
return new SimpleInterval(PROTEIN_CONTIG, Integer.valueOf(matcher.group(1)), Integer.valueOf(matcher.group(1))); | ||
catch (final IllegalArgumentException ex) { | ||
// If we have poorly bounded protein changes, we need to warn the user and move on. | ||
// These occur occasionally in the data. | ||
logger.warn("Warning - unable to parse protein string due to invalid position information. Ignoring potential COSMIC match with protein sequence: " + proteinPositionString); | ||
return null; | ||
} | ||
} | ||
return null; | ||
} | ||
|
||
/** | ||
* Print the given {@link ResultSet} to stdout. | ||
* @param resultSet The {@link ResultSet} to print. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this method? Typically, we should log it (as a debug if necessary) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a debug method that I created to help track down issue #4812. I'd prefer to leave it in since I've already created it. It may be useful later to debug other issues. |
||
*/ | ||
private final void printResultSet(final ResultSet resultSet) { | ||
try { | ||
final ResultSetMetaData metadata = resultSet.getMetaData(); | ||
final int columnCount = metadata.getColumnCount(); | ||
for (int i = 1; i <= columnCount; i++) { | ||
System.out.print(metadata.getColumnName(i) + ", "); | ||
} | ||
System.out.println(); | ||
while ( resultSet.next() ) { | ||
for ( int i = 1; i <= columnCount; i++ ) { | ||
System.out.print(resultSet.getString(i) + ", "); | ||
} | ||
System.out.println(); | ||
} | ||
} | ||
catch (final SQLException ex) { | ||
throw new GATKException("Cannot print resultSet!"); | ||
} | ||
} | ||
|
||
//================================================================================================================== | ||
// Helper Data Types: | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for additional comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prob.