Skip to content

Commit

Permalink
feat: introduces support for specific reference genomes
Browse files Browse the repository at this point in the history
Previously, the tool simply had a hardcoded set of PRIMARY_CHROMOSOMES that
were hardcoded to the hg38 primary chromosomes. Now, the tool has a supported
set of reference genomes, namely (to start):

* GRCh38NoAlt (from the NCBI)
* hs37d5      (from the 1000 Genomes Project)

These two genomes were selected simply because (a) GRCh38NoAlt is probably the
most popular GRCh38 genome and (b) hs37d5 is the genome used for phase 2 and
phase 3 of the 1000 Genomes project: a fairly popular publicly available
resource and the subject of many QC papers.

Introducing a reference genome into the code required multiple QC facets to be
updated to use this functionality. For each of these, I chose to simply pass
the reference genome to the initialization function for the facet: it's up to
the facet to take what it needs from the reference genome and store it for
later use (as opposed to adding a lifecycle hook injecting it).

Other notable, related changes:

* I include now a check at the beginning of the `qc` command to ensure that the
  sequences in the header of the file match the reference genome the user
  specified on the commmand line. In the future, I also plan to add checks that
  the actual FASTA file matches the specified reference genome (if provided)
  _and_ that the GFF file matches the specified reference genome (if provided).

There were some other changes that are introduced in this changeset that, at
first, don't appear directly related:

* We've now moved away from using `async`/`await` for the `qc` subcommand, as
  there is an obscure bug that doesn't allow two generic lifetimes and one
  static lifetime with an `async` function. Thus, I decided to just move away
  from using `async`/`await` altogether, as I had been considering that
  regardless (we already moved away from using the lazy evaluation facilities
  in noodles). See issues rust-lang/rust#63033 and
  rust-lang/rust#99190 for more details.
* In testing this code, I was running into an error where a record fell outside
  of the valid range of a sequence. This was annoying, so I just decided to fix
  it as part of this changeset. There is no other deep reason why those changes
  are included here.
  • Loading branch information
claymcleod committed Sep 27, 2022
1 parent b644e47 commit 0f8e9e4
Show file tree
Hide file tree
Showing 8 changed files with 820 additions and 48 deletions.
115 changes: 83 additions & 32 deletions src/commands/qc.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
use bam::bai;
use futures::TryStreamExt;
use noodles_sam::Header;
use tokio::fs::File;

use std::path::PathBuf;
use std::{fs::File, path::PathBuf, rc::Rc};

use anyhow::Context;
use anyhow::{bail, Context};
use clap::{value_parser, Arg, ArgMatches, Command};
use noodles_bam as bam;
use noodles_core::{Position, Region};
Expand All @@ -19,7 +17,12 @@ use crate::lib::{
quality_scores::QualityScoreFacet, results::Results, template_length::TemplateLengthFacet,
RecordBasedQualityCheckFacet, SequenceBasedQualityCheckFacet,
},
utils::formats::sam::parse_header,
utils::{
formats::sam::parse_header,
genome::{
get_all_reference_genomes, get_all_sequences, get_reference_genome, ReferenceGenome,
},
},
};

/// A utility struct for passing feature name arguments from the command line
Expand Down Expand Up @@ -68,6 +71,7 @@ pub fn get_record_based_qc_facets<'a>(
features_gff: Option<&str>,
feature_names: &'a FeatureNames,
header: &'a Header,
reference_genome: Rc<Box<dyn ReferenceGenome>>,
) -> anyhow::Result<Vec<Box<dyn RecordBasedQualityCheckFacet + 'a>>> {
// Default facets that are loaded within the qc subcommand.
let mut facets: Vec<Box<dyn RecordBasedQualityCheckFacet>> = vec![
Expand All @@ -83,6 +87,7 @@ pub fn get_record_based_qc_facets<'a>(
s,
feature_names,
header,
reference_genome,
)?));
}

Expand All @@ -92,10 +97,11 @@ pub fn get_record_based_qc_facets<'a>(
pub fn get_sequence_based_qc_facets<'a>(
reference_fasta: Option<&PathBuf>,
header: &'a Header,
reference_genome: Rc<Box<dyn ReferenceGenome>>,
) -> anyhow::Result<Vec<Box<dyn SequenceBasedQualityCheckFacet<'a> + 'a>>> {
// Default facets that are loaded within the qc subcommand.
let mut facets: Vec<Box<dyn SequenceBasedQualityCheckFacet<'_>>> =
vec![Box::new(CoverageFacet::default())];
vec![Box::new(CoverageFacet::new(reference_genome))];

if let Some(fasta) = reference_fasta {
facets.push(Box::new(EditsFacet::try_from(fasta, header)?))
Expand All @@ -122,6 +128,13 @@ pub fn get_command<'a>() -> Command<'a> {
.value_parser(value_parser!(PathBuf))
.takes_value(true),
)
.arg(
Arg::new("reference-genome")
.long("--reference-genome")
.help("Reference genome used as the basis for the file.")
.takes_value(true)
.required(true),
)
.arg(
Arg::new("features-gff")
.long("--features-gff")
Expand Down Expand Up @@ -213,10 +226,29 @@ pub fn get_command<'a>() -> Command<'a> {
/// Prepares the arguments for running the main `qc` subcommand.
pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
info!("Starting qc command...");

let src: &PathBuf = matches
.get_one("src")
.expect("Could not parse the arguments that were passed in for src.");

let provided_reference_genome = matches
.get_one::<String>("reference-genome")
.expect("Did not receive a reference genome.");

let reference_genome = match get_reference_genome(provided_reference_genome) {
Some(s) => Rc::new(s),
None => bail!(
"reference genome is not supported: {}. List of supported reference \
genomes is: {}.",
provided_reference_genome,
get_all_reference_genomes()
.iter()
.map(|s| s.name())
.collect::<Vec<&str>>()
.join(", ")
),
};

let reference_fasta = matches.get_one("reference-fasta");
let features_gff = matches.value_of("features-gff");

Expand Down Expand Up @@ -272,22 +304,16 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
.expect("Could not create output directory.");
}

let rt = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap();

let app = app(
app(
src,
reference_fasta,
features_gff,
reference_genome,
output_prefix,
output_directory,
num_records,
feature_names,
);

rt.block_on(app)
)
}

/// Runs the `qc` subcommand.
Expand All @@ -297,6 +323,7 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
/// * `src` — The filepath to the NGS file to run QC on.
/// * `reference_fasta` — Optionally, the path to a
/// when you want to run the Genomic Features facet.
/// * `reference_genome` — The reference genome to be used.
/// * `features_gff` — Optionally, the path to a GFF gene model file. Useful
/// when you want to run the Genomic Features facet.
/// * `output_prefix` — Output prefix for all files generated by this
Expand All @@ -305,31 +332,54 @@ pub fn qc(matches: &ArgMatches) -> anyhow::Result<()> {
/// * `num_records` — Maximum number of records to process. Anything less than 0
/// is considered infinite.
/// * `feature_names` — Feature names for lookup within the GFF file.
async fn app(
#[allow(clippy::too_many_arguments)]
fn app(
src: &PathBuf,
reference_fasta: Option<&PathBuf>,
features_gff: Option<&str>,
reference_genome: Rc<Box<dyn ReferenceGenome>>,
output_prefix: &str,
output_directory: PathBuf,
num_records: i64,
feature_names: FeatureNames,
) -> anyhow::Result<()> {
//==================================================//
// First pass: set up file handles and prepare file //
//==================================================//
//=====================================================//
// Preprocessing: set up file handles and prepare file //
//=====================================================//

let mut reader = File::open(src).await.map(bam::AsyncReader::new)?;
let mut reader = File::open(src).map(bam::Reader::new)?;

let ht = reader.read_header().await?;
let ht = reader.read_header()?;
let header = parse_header(ht);

reader.read_reference_sequences().await?;
let reference_sequences = reader.read_reference_sequences()?;

//=====================================================//
// Preprocessing: reference sequence concordance check //
//=====================================================//

let supported_sequences = get_all_sequences(Rc::clone(&reference_genome));

for (sequence, _) in reference_sequences {
if !supported_sequences
.iter()
.map(|s| s.name())
.any(|x| x == sequence)
{
bail!("Sequence \"{}\" not found in specified reference genome. Did you set the right reference genome?", sequence);
}
}

//===========================================================//
// First pass: print out which facets we're going to analyze //
//===========================================================//

let mut record_facets = get_record_based_qc_facets(features_gff, &feature_names, &header)?;
let mut record_facets = get_record_based_qc_facets(
features_gff,
&feature_names,
&header,
Rc::clone(&reference_genome),
)?;
info!("");
info!("First pass with the following facets enabled:");
info!("");
Expand All @@ -344,9 +394,10 @@ async fn app(

debug!("Starting first pass for QC stats.");
let mut record_count = 0;
let mut records = reader.records();

while let Some(record) = records.try_next().await? {
for result in reader.records() {
let record = result?;

for facet in &mut record_facets {
match facet.process(&record) {
Ok(_) => {}
Expand Down Expand Up @@ -387,7 +438,8 @@ async fn app(
// Second pass: print out which facets we're going to analyze //
//============================================================//

let mut sequence_facets = get_sequence_based_qc_facets(reference_fasta, &header)?;
let mut sequence_facets =
get_sequence_based_qc_facets(reference_fasta, &header, Rc::clone(&reference_genome))?;
info!("");
info!("Second pass with the following facets enabled:");
info!("");
Expand All @@ -400,10 +452,8 @@ async fn app(
// Second pass: set up file handles and prepare file //
//===================================================//

let mut reader = File::open(src).await.map(bam::AsyncReader::new)?;
let index = bai::r#async::read(src.with_extension("bam.bai"))
.await
.with_context(|| "bam index")?;
let mut reader = File::open(src).map(bam::Reader::new)?;
let index = bai::read(src.with_extension("bam.bai")).with_context(|| "bam index")?;

for (name, seq) in header.reference_sequences() {
let start = Position::MIN;
Expand All @@ -419,7 +469,7 @@ async fn app(
}
}

let mut query = reader
let query = reader
.query(
header.reference_sequences(),
&index,
Expand All @@ -428,7 +478,8 @@ async fn app(
.unwrap();

debug!(" [*] Processing records from sequence.");
while let Some(record) = query.try_next().await.unwrap() {
for result in query {
let record = result?;
for facet in &mut sequence_facets {
if facet.supports_sequence_name(name) {
facet.process_record(seq, &record).unwrap();
Expand Down
53 changes: 46 additions & 7 deletions src/lib/qc/coverage.rs
Original file line number Diff line number Diff line change
@@ -1,18 +1,28 @@
use std::collections::HashMap;
use std::{collections::HashMap, rc::Rc};

use noodles_sam::header::ReferenceSequence;
use serde::Serialize;
use tracing::error;

use crate::lib::utils::{genome::PRIMARY_CHROMOSOMES, histogram::SimpleHistogram};
use crate::lib::utils::{
genome::{get_primary_assembly, ReferenceGenome, Sequence},
histogram::SimpleHistogram,
};

use super::SequenceBasedQualityCheckFacet;

#[derive(Clone, Default, Serialize)]
pub struct IgnoredMetrics {
nonsensical_records: usize,
pileup_too_large_positions: HashMap<String, usize>,
}

#[derive(Clone, Default, Serialize)]
pub struct CoverageMetrics {
mean_coverage: HashMap<String, f64>,
median_coverage: HashMap<String, f64>,
median_over_mean_coverage: HashMap<String, f64>,
ignored: HashMap<String, usize>,
ignored: IgnoredMetrics,
histograms: HashMap<String, SimpleHistogram>,
}

Expand All @@ -21,10 +31,20 @@ pub struct CoverageHistograms<'a> {
storage: HashMap<&'a str, SimpleHistogram>,
}

#[derive(Default)]
pub struct CoverageFacet<'a> {
by_position: CoverageHistograms<'a>,
metrics: CoverageMetrics,
primary_assembly: Vec<Sequence>,
}

impl<'a> CoverageFacet<'a> {
pub fn new(reference_genome: Rc<Box<dyn ReferenceGenome>>) -> Self {
Self {
by_position: CoverageHistograms::default(),
metrics: CoverageMetrics::default(),
primary_assembly: get_primary_assembly(reference_genome),
}
}
}

impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
Expand All @@ -37,7 +57,10 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
}

fn supports_sequence_name(&self, name: &str) -> bool {
PRIMARY_CHROMOSOMES.contains(&name)
self.primary_assembly
.iter()
.map(|s| s.name())
.any(|x| x == name)
}

fn setup_sequence(&mut self, _: &ReferenceSequence) -> anyhow::Result<()> {
Expand All @@ -62,7 +85,20 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
let record_end = usize::from(record.alignment_end().unwrap());

for i in record_start..=record_end {
h.increment(i).unwrap();
if h.increment(i).is_err() {
error!(
"Record crosses the sequence boundaries in an expected way. \
This usually means that the record is malformed. Please examine \
the record closely to ensure it fits within the sequence. \
Ignoring record. Read name: {}, Start Alignment: {}, End \
Alignment: {}, Cigar: {}",
record.read_name().unwrap(),
record.alignment_start().unwrap(),
record.alignment_end().unwrap(),
record.cigar()
);
self.metrics.ignored.nonsensical_records += 1;
}
}

Ok(())
Expand Down Expand Up @@ -99,7 +135,10 @@ impl<'a> SequenceBasedQualityCheckFacet<'a> for CoverageFacet<'a> {
self.metrics
.histograms
.insert(seq.name().to_string(), coverages);
self.metrics.ignored.insert(seq.name().to_string(), ignored);
self.metrics
.ignored
.pileup_too_large_positions
.insert(seq.name().to_string(), ignored);

Ok(())
}
Expand Down
Loading

0 comments on commit 0f8e9e4

Please sign in to comment.