Skip to content

A python library for creating simulated regulatory DNA sequences

License

Notifications You must be signed in to change notification settings

kundajelab/simdna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

license DOI Build Status

Installation

git clone https://github.com/kundajelab/simdna.git
cd simdna
python setup.py develop

Overview

This is a tool for generating simulated regulatory sequence for use in experiments/analyses. There are essentially two "phases" to design that SimDNA facilitates. First is generating realistic background sequences. Second is embedding elements of interest in an existing sequence or set of secquences. These phases are captured by the two core classes underlying the packages basic functioning: backgrounds are generated by a BackgroundGenerator, and then elements are embedded by calling Embedders on the generated sequences.

Backgrounds

SimDNA afford several ways of generating background sequences to embed elements into. The simplest of these is generating a set of completely randomized background sequences according to a set of probabilities for each individual nucleotide. See [Background Sequence](###Background Sequence) for more information on different ways to generate a background.

Embedders

An Embedder often consist of an EmbeddableGenerator and a PositionGenerator. The EmbeddableGenerator produces the motif/grammar instance to be embedded, and the PositionGenerator determines the placement of the motif/grammar. A single Embedder may insert multiple motifs into a sequence. The RepeatedEmbedder class (which itself is a kind of Embedder) may be used to call any Embedder class multiple times, where the number of times to call the Embedder is generated by a QuantityGenerator will return a number of times to embed. This number can be fixed, or sampled from some distribution. See Embedders for more information on different ways to construct embedders.

Pipeline

At a high level, SimDNA sequence generators are assembled in a modular fashion, where the high-level classes build upon the outputs of lower-level classes. The most high-level class is the SequenceSetGenerator, which can be supplied to the printSequences function to generate a collection of sequences.

  • An example of a SequenceSetGenerator is GenerateSequenceNTimes, which takes two arguments: a SingleSequenceGenerator (which generates individual sequences) and a number N that determines how many times to call the SingleSequenceGenerator
  • An example of a SingleSequenceGenerator is EmbedInABackground. The EmbedInABackground class takes two arguments: a BackgroundGenerator, and a list of Embedder objects. The backgroundGenerator generates the background sequence, and then the Embedder objects are called successively to insert patterns into the background sequence.
  • An example of an Embedder object is the SubstringEmbedder. A SubstringEmbedder consists of two parts: a SubstringGenerator and a PositionGenerator. The SubstringGenerator produces the DNA string to be embedded, and the PositionGenerator determines the position at which the DNA string will be embedded.
  • An example of a SubstringGenerator is a PwmSampler, which samples from a PWM (or, more accurately, a PFM a.k.a. a Position Frequency Matrix). A PwmSampler is instantiated using a Pwm object that is defined by specifying the matrix of letter frequencies.

A PwmSampler can be optionally wrapped in a ReverseComplementWrapper. The ReverseComplementWrapper is itself a type of SubstringGenerator that contains an inner SubstringGenerator. The ReverseComplementWrapper will call the inner SubstringGenerator and reverse-complement the resulting DNA string with 50% probability. Similarly, the SubstringEmbedder can be wrapped in a RepeatedEmbedder. The RepeatedEmbedder is itself a type of Embedder that has two parts: an inner Embedder object and a QuantityGenerator. The RepeatedEmbedder will call the inner Embedder class a number of times that is determined by the QuantityGenerator. By building up the sequence generator in this modular way, it becomes easy to mix-and-match functionality. For example, it is possible to define a new type of quantity generator that may act as a wrapper for another quantity generators (such as the ZeroInflator or MinMaxWrapper) while still having access to all the other simdna classes.

A simple example

Here SimDNA is used to construct 1000 sequences, each 200 nucleotides long, with backgrounds sampled from a specified distribution. Each of these sequences will have the TAL4 motif (in both forward and reverse compliment) embedded between one and three times, at random positions.

The embedding pipeline:

import simdna
from simdna import synthetic
# create a PWM object to represent the motif
thepwm = simdna.pwm.PWM('TAL4').addRows(matrix_of_letter_probabilities).finalise(pseudocountProb=0.001)
# the class that samples from the pwm
pwmsampler = synthetic.PwmSampler(thepwm)
# a wrapper that will randomly take the reverse complement of an embeddable string returned by pwm sampler;
# this allows embedding motifs in both orientations
rc_pwmsampler = synthetic.ReverseComplementWrapper(pwmsampler)
# a wrapper that embeds a returned element at; SubstringEmbedder samples the positionGenerator each time it embeds
# the positionGenerator can return the same or different positions each time; here it draws a random position from a
# uniform distribution; SubstringEmbedder will not overwrite a previously embedded element
mult_rc_pwmembedder = synthetic.SubstringEmbedder(rc_pwmsampler, positionGenerator=synthetic.UniformPositionGenerator())
# a wrapper that embeds a returned element multiple times; SubstringEmbedder samples the quantityGenerator each 
# time it is called and embeds that many elements in the given sequence
repeatedpwmembedder = synthetic.RepeatedEmbedder(
                          mult_rc_pwmembedder, 
                          quantityGenerator=synthetic.UniformIntegerGenerator(minVal=1, maxVal=3)
                      )

The background generator:

# the background generator (“zero order” refers to the order of the markov model; this randomly samples each
# nucelotide independently)
bggen = synthetic.ZeroOrderBackgroundGenerator(seqLength=200, 
                                               discreteDistribution={'A': 0.27, 'C': 0.23, 'G': 0.23, 'T': 0.27})

Putting it all together:

# this combines the background and the pipeline for generation
seq_sim = synthetic.EmbedInABackground(backgroundGenerator=bggen, embedders=[repeatedpwmembedder])
# create a generator to run the pipeline, from generating a background through the embedding pipeline N times
sequence_set = synthetic.GenerateSequenceNTimes(seq_sim, 1000)
# actually generate and save the sequences frmo the pipeline
synthetic.printSequences("sequences.simdata", sequence_set, 
                         includeFasta=True, includeEmbeddings=True, prefix="myprefix")

Reading a simdata file

The simdaata file encodes all the sequences, as well as all of the embedded motifs in each sequence. These can beread using the read_simdata_file function:

import simdna.synthetic
data = simdna.synthetic.read_simdata_file("sim.simdata")  

for sequence, embeddings in zip(data.sequences, data.embeddings):
    ...

This code allows iterating over all generated sequences and the motifs embedded in those sequences. This is helpful in actually using the simulated sequences to perform computational experiments.

Examples

Please see the scripts folder for example scripts generating simulations and the scripts_test folder for example arguments.

  • densityMotifSimulation.py generates a simulated dataset where multiple instances of motifs are present per sequence, as determined by a poisson distribution which could optionally be subject to zero-inflation.
  • motifGrammarSimulation.py illustates how to set up a simulation where two motifs have a fixed-spacing or variable-spacing grammar (set --generationSetting to twoMotifsFixedSpacing or twoMotifsVariableSpacing as desired).
  • emptyBackground.py just generates a background sequence with no motifs embedded.

Creating Custom Simulations

Loading motifs

The SimDNA package comes with the ENCODE and HOCOMO databases of motifs so any motif in those databases can be used with the LoadedMotifs class to load PWMs (as in the examples). ENCODE, Homer, and Jaspar motifs. To load your own motifs you can create a single file with multiple motifs from in any of these formats load it accordingly.

Encode:

loadedMotifs = synthetic.LoadedEncodeMotifs({path})

Homer:

loadedMotifs = synthetic.LoadedHomerMotifs({path})

Jaspar:

loadedMotifs = synthetic.LoadedJasparRawPMFMotifs({path})

Background Sequence

SimDNA afford several ways of generating background sequences to embed elements into.

The first and simplest of these is simply generating a set of completely randomized background sequences according to a set of probabilities for each individual nucleotide.

background_gen = synthetic.ZeroOrderBackgroundGenerator(seqLength={seqLength}, 
                                                        discreteDistribution={nucleotide_distribution})

The second, slightly more sophisticated randomized generator uses a first order markov chain to guarantee a given dinucleotide frequency distribution in the generated sequences.

background_gen = synthetic.FirstOrderBackgroundGenerator(seqLength={seqLength},
                                                         priorFrequencies={nuc_distribution},
                                                         dinucFrequencies={dinuc_distribution})

A third method is to supply a sequence and dinucleotide shuffle that sequence.

PWM Sampler

PwmSampler includes three ways to sample PWMs: first is simply sampling randomly from the PWM; second is sampling the best hit of the PWM; third sampling only motifs achieving some minimum logodds score relative to a background via the minScore argument.

Embeddables

In addition to the basic string embeddable there is also a PairEmbeddable which allows embedding two embeddables with a given seperation. PairEmbeddables can be nested to allows embedding any number of motifs with a fixed spacing . Being able to set a given spacing is useful in contexts where yo want to simulate motif-motif interactions, and other more complex elements of cis-regulatory grammars.

Embedders

An embedder is an object that embeds an embeddable. Embedders take a EmbeddableGenerator and a PositionGenerator and draw a position to embed the embeddable in the background string passed into the _embed function. The most common embedder is the SubstringEmbedder, which will be the main embedder used to add motifs to background sequences. A RepeatedEmbedder takes an embedder and embeds it a number of times drawn from a QuantityGenerator. An XOREmbedder takes two embedders and with probability embedes on or the other. RandomSubsetOfEmbedders takes a list of embedders and draws from a quantity generator to select a number of them to embed, then selecting from the list that many at random. Finally AllEmbedders takes a list of embedders and embeds them all.

PositionGenerators

The main types of PositionGenerator are the FixedPositionGenerator and UniformPositionGenerator generators, but SimDNA also supports a NormalDistributionPositionGenerator (which acts as a truncated normal centered in the sequence -- with optional offsets) and InsideCentralBP and OutsideCentralBP generators whcih can be used to uniformly sample from subsets of the sequence.

QuantityGenerators

SimDNA supports FixedQuantityGenerator, which returns a fixed vale; ChooseValueFromASet, which randomly samples from a given set with given probabiliities; and UniformIntegerGenerator , PoissonQuantityGenerator, BernoulliQuantityGenerator each of which sample from the specified distribution. There are also two important wrapper generators that manipulate the other generators: MinMaxWrapper which forces a generated sample to be within a set range and ZeroInflater which with probability returns a 0 instead of the quantity generated.