-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New tools for annotation-based filtering. #7724
Comments
ExtractVariantAnnotations: This tool extracts annotations, labels, and other relevant metadata from variants (or alleles, in allele-specific mode) that do or do not overlap with specified resources. The former are considered labeled and each variant/allele can have multiple labels. The latter are considered unlabeled and can be randomly downsampled using reservoir sampling; extraction of these is optional. The outputs of the tool are HDF5 files containing the extracted data for labeled and (optional) unlabeled variant sets, as well as a sites-only VCF containing the labeled variants. This VCF can be used in ScoreVariantAnnotations to in turn specify an additional "extracted" label, which can be useful for indicating those sites that were actually extracted from the provided resources (since we may only extract over a subset of the genome). TODOs:
Minor TODOs:
Future work:
|
TrainVariantAnnotationsModel: Trains a model for scoring variant calls based on site-level annotations. TODOs:
Minor TODOs:
Future work:
|
ScoreVariantAnnotations: Scores variant calls in a VCF file based on site-level annotations using a previously trained model. TODOs:
Minor TODOs:
Future work:
|
Bayesian GMM: This is essentially an exact port of the sklearn implementation, but only allowing for full covariance matrices. I think it might be good for those in the Bishop reading group to take a look during review. I decided to split this off into its own branch (just updated the existing branch https://github.com/broadinstitute/gatk/tree/sl_sklearn_bgmm_port) and only include stubs for the BGMM backend in the above tools. This is so we can prioritize merging the IsolationForest implementation for @meganshand. We can easily add this module back when it's been reviewed separately. TODOs:
Future work:
|
@samuelklee Some of our collaborators are currently working on updating |
Thanks for the question @droazen. No, these tools are more meant to be an update to VQSR, i.e., they do not assume that the BAM/reads will be available and only use the annotations. I think such tools will remain useful going forward, especially for joint genotyping. We can probably eventually push CNN/etc.-based generation of additional features/annotations from the BAM/reads upstream of filtering, so that they’re generated at the same time as our traditional “handcrafted” annotations, after which we can throw everything through the annotation-based filtering tools here. |
Rebasing, squashing, and reorganizing files into new commits to prep for the PR, but here's a copy of the commit messages for posterity: |
PR Punts:
Next steps:
|
A few minor issues:
|
This is a meta issue to track remaining and future work for the new tools for annotation-based filtering, which will hopefully replace VQSR. Internal developers may want to see further discussion at https://github.com/broadinstitute/dsp-methods-model-prototyping/discussions/9.
The text was updated successfully, but these errors were encountered: