Skip to content
This repository has been archived by the owner on Aug 26, 2023. It is now read-only.

core features

Paulo Roberto de Oliveira Castro edited this page Apr 3, 2015 · 4 revisions

IO

We are planning to explore using ragel to generate parsers for as many file types as possible (see this thread). Writers will still need to be written manually.

Sequence formats

  • FASTA
  • FASTQ
  • GenBank
  • EMBL

Loads more at http://www.bioperl.org/wiki/HOWTO:SeqIO, but many of these are antiquated formats. I think we should prioritise by popularity. The sooner BioJulia is useful the better for the community.

Annotation formats

  • GFF & GTF (this is messy in most languages - it would be great if we could cleanly handle all the quirks)
  • BED
  • VCF

Alignment formats

  • BLAST (tabular/long form)
  • MultiFASTA aligned
  • CLUSTAL
  • BAM/SAM
  • Phylip
  • PFAM

Tree formats

  • Newick (can be ported from Phylogenetics.jl)
  • Nexus
  • PhyloXML

also database connectors, for e.g. BioSQL

Datastructures

We'll want to have representations of:

  • DNA, RNA and amino acid sequences
  • ranges and features of sequences (where the sequence may or may not be present)
  • alignments - pairwise and multiple
  • graph-derivative structures like phylogenetic trees, genetic networks and biochemical pathways
  • probabilistic models of sequences (e.g. motifs - perhaps this isn't a high priority)

Having a solid interval tree implementation would enable a lot of common operations on genome annotations: counting, intersecting, extending, etc. We should also look at what parts of Diego's BioSeq.jl we can incorporate. We'll have to extend those sequence representations to attach metadata to sequences, but that shouldn't be too hard.

Tool wrappers

  • BLAST
  • Blat
  • bowtie/2
  • bwa
  • HMMER
  • Primer3
  • Phylogenetic tools (clustal, mafft, PAML, phylip)
  • samtools (unless we can do something faster in our own sam/bam implementation)
  • signalP/targetP
  • assemblers: velvet/oases, trinity, soapdenovo

Service APIs

  • BioMart
  • Ensembl
  • EMBL
  • NCBI
  • SRA

Datasets

  • genome sequences
  • genome annotations
  • gene ontologies
Clone this wiki locally