Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition

This repository contains tools to separate sequences from different sources by composition, as described here: https://cobiontid.github.io

In many cases, samples of target organisms collected in the wild contain sequences from additional organisms. Identifying the source of a given sequence can be challenging if there are few reference datasets available from sufficiently closely related species. However, differences in sequence composition can nevertheless be used to separate different components of a sample.

Learning two-dimensional embeddings of sequence composition (in this case tetranucleotide counts) with a Variational Autoencoder (VAE) provides a framework to visually explore long-read datasets and detect contaminants or organisms interacting with the target. Sequence characteristics, such as estimated coding density and approximate read coverage, provide additional clues about the contents of the sample. For example, even without taxonomic labels, a microbe could be distinguished from an insect based on its higher density of coding sequences.

A preprint describing the approach in detail is available here: https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1. In addition to the VAE-based workflow for reads, the repository includes some tools to assess sequence assemblies. The documentation in this repository is currently still under construction.

Workflows

Read k-mer decomposition and visualisation

Tallies k-mers in a read set, reduce to two dimensions and visualise read clusters (defaults to tetranucleotides). Annotates read plots with additional sequence features, such as estimated coding density, approximate k-mer coverage and sequence k-mer diversity.

Example data set: Erannis defoliaria

Decomposed read tetranucleotides from Erannis defoliaria indicate the presence of bacteria in the sample (top). In this static plot, The reads are coloured by estimated coding density. The resulting data can also be explored interactively.

Contig/scaffold k-mer decomposition and visualisation

As with the reads, tallies and reduces tetranucleotide composition to two dimensions and plots with annotations. In addition to estimated coding density and k-mer diversity, FastK provides a measure of repetitiveness, and coverage for primary Hifiasm assemblies can be extracted and used to annotate the plots. A selection tool allows sequences that are of interest to be selected and downloaded with their annotations. Where Hi-C data are available, a SALSA or YaHs pair file may also be provided to annotate plots with scaffold connectivity information. Take a look at an interactive version of the plot here.

Example data set: Hylocomiadelphus triquetrus

Tools

Variational Autoencoder for k-mer decomposition

vae.py

Read k-mer counts are reduced to two dimensions following the method of Kingma and Welling (2013). Outputs two-dimensional representation of the read set and a basic plot.

Plotting tools for reads

Generate colour-coded plots of 2D representations learned by the VAE.

Interactive read visualisations

Interactively filter and query annotated 2D representations of read data.

Visualisations for contigs

Workflow and utilities to generate interactive HTML file of decomposed tetranucleotide plots with binned annotations.

Standalone tools used in workflows

kmer-counter

Counts the number of occurences of each k-mer of size k for each record in a fasta file of nucleotide sequences (canonicalised or non-canonicalised). Implemented in Rust, runs approximately ten times faster than the equivalent code in Python.

unique-kmer-counts

Count the number of distinct k-mers of size k for each record in a fasta files of nucleotide sequences, and divide by sequence length. Implemented in Rust.

hexamer

Estimates the coding density using the sum of lengths of putative coding sequences divided by sequence length. The cobiont pipelines previously used a modified version of the old hexamer code. The relevant functionality is now available in an updated version of hexamer from https://github.com/richarddurbin/hexamer (to extract the estimated density, pipe stdout to awk '{ print $3/$2}')

fastk-medians

Calculates the median number of times each k-mer of size k (in this case k = 31) occurs across the whole set of sequences. Provides an approximation of coverage for reads (provided they are not highly repetitive), or repetitiveness for assembled contigs or scaffolds.

Citation

If you use any of the code in this repository, please cite https://www.biorxiv.org/content/10.1101/2024.05.30.596622v1

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
contig_tools		contig_tools
read_tools		read_tools
supplementary		supplementary
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
env_kmerviz.yaml		env_kmerviz.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition

Workflows

Read k-mer decomposition and visualisation

Example data set: Erannis defoliaria

Contig/scaffold k-mer decomposition and visualisation

Example data set: Hylocomiadelphus triquetrus

Tools

Variational Autoencoder for k-mer decomposition

vae.py

Plotting tools for reads

Interactive read visualisations

Visualisations for contigs

Standalone tools used in workflows

kmer-counter

unique-kmer-counts

hexamer

fastk-medians

Citation

About

Releases

Packages

Languages

License

CobiontID/read_VAE

Folders and files

Latest commit

History

Repository files navigation

Disentangling Cobionts and Contamination in Long-Read Genomic Data using Sequence Composition

Workflows

Example data set: Erannis defoliaria

Example data set: Hylocomiadelphus triquetrus

Tools

Standalone tools used in workflows

hexamer

Citation

About

Resources

License

Stars

Watchers

Forks

Languages