-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: introduces support for specific reference genomes
Previously, the tool simply had a hardcoded set of PRIMARY_CHROMOSOMES that were hardcoded to the hg38 primary chromosomes. Now, the tool has a supported set of reference genomes, namely (to start): * GRCh38NoAlt (from the NCBI) * hs37d5 (from the 1000 Genomes Project) These two genomes were selected simply because (a) GRCh38NoAlt is probably the most popular GRCh38 genome and (b) hs37d5 is the genome used for phase 2 and phase 3 of the 1000 Genomes project: a fairly popular publicly available resource and the subject of many QC papers. Introducing a reference genome into the code required multiple QC facets to be updated to use this functionality. For each of these, I chose to simply pass the reference genome to the initialization function for the facet: it's up to the facet to take what it needs from the reference genome and store it for later use (as opposed to adding a lifecycle hook injecting it). Other notable, related changes: * I include now a check at the beginning of the `qc` command to ensure that the sequences in the header of the file match the reference genome the user specified on the commmand line. In the future, I also plan to add checks that the actual FASTA file matches the specified reference genome (if provided) _and_ that the GFF file matches the specified reference genome (if provided). There were some other changes that are introduced in this changeset that, at first, don't appear directly related: * We've now moved away from using `async`/`await` for the `qc` subcommand, as there is an obscure bug that doesn't allow two generic lifetimes and one static lifetime with an `async` function. Thus, I decided to just move away from using `async`/`await` altogether, as I had been considering that regardless (we already moved away from using the lazy evaluation facilities in noodles). See issues rust-lang/rust#63033 and rust-lang/rust#99190 for more details. * In testing this code, I was running into an error where a record fell outside of the valid range of a sequence. This was annoying, so I just decided to fix it as part of this changeset. There is no other deep reason why those changes are included here.
- Loading branch information
1 parent
b644e47
commit 0f8e9e4
Showing
8 changed files
with
820 additions
and
48 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.