Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 2.47 KB

howitworks.md

File metadata and controls

44 lines (33 loc) · 2.47 KB

Back to main page

How it works

The MGEfinder workflow is described by the schematic below: alt text

All that is required is a reference genome and short-read (less than 300 bp) sequencing data of the organism under study.

To make full use of the software, assemblies of the short-read sequencing data is necessary.

The MGEfinder software suite provides tools to perform most of the analyses depicted above, including:

  • Identifying candidate insertion sites and generates consensus sequence of insertion termini (see find and command)
  • Pairs candidate insertion termini with each other (see pair command)
  • Infers sequence identity from a sequency assembly (see inferseq-assembly command)
  • Infers sequence identity from a reference genome (see inferseq-reference command)
  • Infers sequence identity from a database of known inserted elements (see inferseq-database command)
  • Infers sequence identity by attempting to merge termini in the event that they are long enough to overlap and span the entire insertion (see inferseq-overlap command)
  • Clusters identified sequences by similarity
  • Genotypes insertions with respect to a reference genome.
  • Produces FASTA files of all identified elements.

It does NOT perform the following steps:

  • Alignment of isolate to sample (We recommend BWA MEM for this step.)
  • Assembly of short-read sequencing reads (We recommend SPAdes for this step.)
  • (We recommend CD-HIT-EST).

Limitations

Identifying large structural variants from short-read sequencing data is always an inferential task. The insertions identified in this pipeline are inferred, but we try to increase our sensitivity by combining several different inference approaches. However, there are limitations that should be taken into account with this analysis, such as:

  1. We only identify those insertions that occur within the reference genome used.
  2. More complicated insertions, such as insertions within insertions, are likely to be missed, especially if they include repetitive elements and cannot be assembled within the expected sequence context.
  3. Assembly errors can be mistaken as insertions, so carefully inspect results before making broad conclusions.

For more detailed information on how MGEfinder works, see the tutorial and the manual.

NEXT: Install or update software