The MGEfinder workflow is described by the schematic below:
All that is required is a reference genome and short-read (less than 300 bp) sequencing data of the organism under study.
To make full use of the software, assemblies of the short-read sequencing data is necessary.
The MGEfinder software suite provides tools to perform most of the analyses depicted above, including:
- Identifying candidate insertion sites and generates consensus sequence of insertion termini (see
find
and command) - Pairs candidate insertion termini with each other (see
pair
command) - Infers sequence identity from a sequency assembly (see
inferseq-assembly
command) - Infers sequence identity from a reference genome (see
inferseq-reference
command) - Infers sequence identity from a database of known inserted elements (see
inferseq-database
command) - Infers sequence identity by attempting to merge termini in the event that they are long enough to overlap and span the
entire insertion (see
inferseq-overlap
command) - Clusters identified sequences by similarity
- Genotypes insertions with respect to a reference genome.
- Produces FASTA files of all identified elements.
It does NOT perform the following steps:
- Alignment of isolate to sample (We recommend BWA MEM for this step.)
- Assembly of short-read sequencing reads (We recommend SPAdes for this step.)
- (We recommend CD-HIT-EST).
Identifying large structural variants from short-read sequencing data is always an inferential task. The insertions identified in this pipeline are inferred, but we try to increase our sensitivity by combining several different inference approaches. However, there are limitations that should be taken into account with this analysis, such as:
- We only identify those insertions that occur within the reference genome used.
- More complicated insertions, such as insertions within insertions, are likely to be missed, especially if they include repetitive elements and cannot be assembled within the expected sequence context.
- Assembly errors can be mistaken as insertions, so carefully inspect results before making broad conclusions.
For more detailed information on how MGEfinder works, see the tutorial and the manual.