-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More funcotator fixes #4783
More funcotator fixes #4783
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4783 +/- ##
===============================================
+ Coverage 80.111% 80.385% +0.275%
- Complexity 17442 17860 +418
===============================================
Files 1083 1087 +4
Lines 63165 64360 +1195
Branches 10193 10540 +347
===============================================
+ Hits 50602 51736 +1134
- Misses 8574 8581 +7
- Partials 3989 4043 +54
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments.
Most importantly, I don't think the new VCF file you added (0816201804HC0_R01C01.pik3ca.vcf
) is publicly releasable. Louis said that it wasn't and that we'd need to sanitize it first.
gencodeFuncotationFactories.add( (GencodeFuncotationFactory) factory ); | ||
} | ||
} | ||
dataSourceFactories.sort(DataSourceUtils::datasourceComparator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to this comment that gencode factories will always be first as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -471,8 +470,13 @@ private void enqueueAndHandleVariant(final VariantContext variant, final Referen | |||
for ( final FeatureInput<? extends Feature> featureInput : manualLocatableFeatureInputs ) { | |||
@SuppressWarnings("unchecked") | |||
final List<Feature> featureList = (List<Feature>)featureContext.getValues(featureInput, hg19Interval); | |||
|
|||
// TODO: This is a little sloppy, since it checks every datasource twice. Once for hg19 contig names and once for b37 contig names. | |||
featureList.addAll(featureContext.getValues(featureInput)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not detect the reference and only add in the features with the correct names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Make sure we don't double up on the Gencodes: | ||
if ( funcotationFactory.getType().equals(FuncotatorArgumentDefinitions.DataSourceType.GENCODE) ) { | ||
continue; | ||
// Perform the actual annotation. Note that we leverage the ordering of datasources here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add another comment here explicitly calling out that the comparator guarantees that gencode data sources are all first in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
private static String retrieveSanitizedFuncotation(final Funcotation funcotation, final String manualAnnotationSerializedString) { | ||
final String initialString = funcotation.serializeToVcfString(manualAnnotationSerializedString); | ||
return StringUtils.replaceEach(initialString, new String[]{",", ";", "=", "\t", "|"}, new String[]{"_%2C_", "_%3B_", "_%3D_", "_%09_", "_%7C_"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the HTML escape sequences are what we want to put in here? Do we need to preserve all the original values for the escape chars or can we replace them all with some other char and move on (e.g. mapping most restricted chars to _
and another to something else)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Previous experience has taught me that the mapping should be invertible. I'm open to suggestions for changing this...
No action (for now).
@@ -63,7 +63,7 @@ | |||
static final Logger logger = LogManager.getLogger(GencodeGtfCodec.class); | |||
|
|||
public static final int GENCODE_GTF_MIN_VERSION_NUM_INCLUSIVE = 19; | |||
public static final int GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE = 27; | |||
public static final int GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE = 28; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our conversation earlier today, I think it makes sense to remove the GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE
field altogether and just attempt to use the specified gencode.
We can add in a warning / exception when an error occurs to investigate the Gencode version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private static final boolean doDebugTests = false; | ||
private static final String LARGE_DATASOURCES_FOLDER = "funcotator_dataSources_latest"; | ||
|
||
private static final String PIK3CA_VCF_HG19 = toolsTestDir + "funcotator/0816201804HC0_R01C01.pik3ca.vcf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this file is publicly releasable - I believe it needs to be sanitized before inclusion in our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is publicly releasable...
No action.
variantContexts.add(vc); | ||
} | ||
|
||
int NUM_VARIANTS = 21; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
variantContexts.add(vc); | ||
} | ||
|
||
int NUM_VARIANTS = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor comments. Is it really necessary to add ~400mb of additional references for this? Can we work with the existing reference files that we have or shrink them somehow?
* | ||
* @return The {@link FeatureInput} used as the key for this data source. | ||
*/ | ||
protected FeatureInput<? extends Feature> addFeatureInputsAfterInitialization(final String filePath, final String name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unused, lets delete it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -63,6 +63,7 @@ default void setFieldSerializationOverrideValues(final Map<String,String> fieldS | |||
} | |||
|
|||
/** | |||
* TODO: This should not be specific to a VCF. That should be the job of the VCFOutputRenderer to sanitize any strings. A Funcotation should not care whether it is being rendered to a VCF or MAF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a todo for the future or meant to be addressed in this pr? If it's for the future could we create a ticket and refer to it in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just forgot to file the issue. Done...
#4797
) | ||
protected boolean includeFilteredVariants = false; | ||
protected boolean ignoreFilteredVariants = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would removeFilteredVariants be a better name? From this it's unclear if it means don't funcotate them or don't emit them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Done
|
||
private static String retrieveSanitizedFuncotation(final Funcotation funcotation, final String manualAnnotationSerializedString) { | ||
final String initialString = funcotation.serializeToVcfString(manualAnnotationSerializedString); | ||
return StringUtils.replaceEach(initialString, new String[]{",", ";", "=", "\t", "|"}, new String[]{"_%2C_", "_%3B_", "_%3D_", "_%09_", "_%7C_"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't actually support percent encoding in vcf yet, it's a 4.3 feature and we're only up to 4.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that this comment was for our information. I just wanted a reversible transform. We will probably have to adjust for VCF 4.3. @lbergelson Should I take any more steps?
No action (for now).
- Ordering specified in the header did not match the variants and hg19/b37 flexibility was being applied to VCF datasources and inducing a lot of missed annotations. - Sanitizing funcotations in VCF. - Implemented lookahead caching and merged with master. - Added Funcotator test for clinvar and Gencode v28 in hg38 - Added funcotator test to make sure that mixed chr/no-chr GENCODE and datasources would still annotate properly for hg19. - Eased restrictions so that Gencode v28 would be recognized as a valid gtf - Added warning so that future versions of Gencode would not fail just based on the version number and warning would be emitted instead (@jonn-smith) - Altered the lookahead default value to specific for backend.
872cd63
to
7602872
Compare
@jonn-smith @lbergelson Why wouldn't it be publicly releasable? It's NA12878. I also confirmed with one of the engineers that there is no issue. |
@lbergelson @jonn-smith Regarding references: I am open to suggestions. These made life much easier. We would have to do some work to get rid of them. Can we file an issue and address later? Since these are in the large file repo, we do not need to worry about history bloat, correct? |
@jonn-smith @lbergelson Back to you guys. I have a couple of questions for y'all. |
@lbergelson @jonn-smith As per offline discussion, I will targz the references and untar as needed. And fix tests. |
- Created a class to minimize the number of copies we make of each untarred test reference.
@jonn-smith @lbergelson Back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one quick comment, but I think after that's addressed you should be good to go.
ArchiveEntry entry = null; | ||
while ((entry = i.getNextEntry()) != null) { | ||
if (!i.canReadEntryData(entry)) { | ||
// log something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - we should log something in this case. The tests could be affected by this, but may not be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
This reverts commit c8f53cd.
These fixes are addressing issues for annotating germline VCFs.