rnn: recurrent neural networks

This is a Recurrent Neural Network library that extends Torch's nn. You can use it to build RNNs, LSTMs, BRNNs, BLSTMs, and so forth and so on. This library includes documentation for the following objects:

Modules that consider successive calls to forward different time-steps in a sequence :

AbstractRecurrent : an abstract class inherited by Recurrent and LSTM;
Recurrent : a generalized recurrent neural network container;
LSTM : a vanilla Long-Short Term Memory module;
FastLSTM : a faster LSTM;
Recursor : decorates a module to make it conform to the AbstractRecurrent interface;
Recurrence : decorates a module that outputs output(t) given {input(t), output(t-1)};

Modules that forward entire sequences through a decorated AbstractRecurrent instance :

AbstractSequencer : an abstract class inherited by Sequencer, Repeater, RecurrentAttention, etc.;
Sequencer : applies an encapsulated module to all elements in an input sequence;
BiSequencer : used for implementing Bidirectional RNNs and LSTMs;
BiSequencerLM : used for implementing Bidirectional RNNs and LSTMs for language models;
Repeater : repeatedly applies the same input to an AbstractRecurrent instance;
RecurrentAttention : a generalized attention model for REINFORCE modules;

Miscellaneous modules :

MaskZero : zeroes the output rows of the decorated module for commensurate input rows which are tensors of zeros.
LookupTableMaskZero : extends nn.LookupTable to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros.

Criterions used for handling sequential inputs and targets :

SequencerCriterion : sequentially applies the same criterion to a sequence of inputs and targets;
RepeaterCriterion : repeatedly applies the same criterion with the same target on a sequence;

Examples

The following are example training scripts using this package :

RNN/LSTM for Penn Tree Bank dataset;
Recurrent Model for Visual Attention for the MNIST dataset;
RNN/LSTM/BRNN/BLSTM training for Penn Tree Bank or Google Billion Words datasets;
A brief (1 hours) overview of Torch7, which includes some details about the rnn packages (at the end), is available via this NVIDIA GTC Webinar video. In any case, this presentation gives a nice overview of Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks using Torch7.

AbstractRecurrent

An abstract class inherited by Recurrent and LSTM. The constructor takes a single argument :

rnn = nn.AbstractRecurrent(rho)

Argument rho is the maximum number of steps to backpropagate through time (BPTT). Sub-classes can set this to a large number like 99999 if they want to backpropagate through the entire sequence whatever its length. Setting lower values of rho are useful when long sequences are forward propagated, but we only whish to backpropagate through the last rho steps, which means that the remainder of the sequence doesn't need to be stored (so no additional cost).

[recurrentModule] getStepModule(step)

Returns a module for time-step step. This is used internally by sub-classes to obtain copies of the internal recurrentModule. These copies share parameters and gradParameters but each have their own output, gradInput and any other intermediate states.

maskZero(nInputDim)

Decorates the internal recurrentModule with MaskZero. The output Tensor (or table thereof) of the recurrentModule will have each row (samples) zeroed when the commensurate row of the input is a tensor of zeros.

The nInputDim argument must specify the number of non-batch dims in the first Tensor of the input. In the case of an input table, the first Tensor is the first one encountered when doing a depth-first search.

Calling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors. Warning: padding must come before any real data in the input sequence (padding after the real data is not supported and will yield unpredictable results without failing).

[output] updateOutput(input)

Forward propagates the input for the current step. The outputs or intermediate states of the previous steps are used recurrently. This is transparent to the caller as the previous outputs and intermediate states are memorized. This method also increments the step attribute by 1.

updateGradInput(input, gradOutput)

It is important to understand that the actual BPTT happens in the updateParameters, backwardThroughTime or backwardUpdateThroughTime methods. So this method just keeps a copy of the gradOutput. These are stored in a table in the order that they were provided.

accGradParameters(input, gradOutput, scale)

Again, the actual BPTT happens in the updateParameters, backwardThroughTime or backwardUpdateThroughTime methods. So this method just keeps a copy of the scales for later.

backwardThroughTime([step, rho])

This method calls updateGradInputThroughTime followed by accGradParametersThroughTime. This is where the actual BPTT happens.

Argument step specifies that the BPTT should only happen starting from time-step step (which defaults to self.step, i.e. the current time-step). Argument rho specifies for how many time-steps the BPTT should happen (which defaults to self.rho). For example, supposing we called updageOutput 5 times (so self.step=6), if we want to backpropagate through step 5 only, we can call :

rnn:backwardThroughTime(6, 1)

updateGradInputThroughTime([step, rho])

Iteratively calls updateGradInput for all time-steps in reverse order (from the end to the start of the sequence). Returns the gradInput of the first time-step.

See backwardThroughTime for an explanation of optional arguments step and rho.

accGradParametersThroughTime([step, rho])

Iteratively calls accGradParameters for all time-steps in reverse order (from the end to the start of the sequence).

See backwardThroughTime for an explanation of optional arguments step and rho.

accUpdateGradParametersThroughTime(learningRate)

Iteratively calls accUpdateGradParameters for all time-steps in reverse order (from the end to the start of the sequence).

backwardUpdateThroughTime(learningRate)

This method calls updateGradInputThroughTime and accUpdateGradParametersThroughTime(learningRate) and returns the gradInput of the first step.

updateParameters(learningRate)

Unless backwardThroughTime or accGradParameters where called since the last call to updateOutput, backwardUpdateThroughTime is called. Otherwise, it calls updateParameters on all encapsulated Modules.

recycle(offset)

This method goes hand in hand with forget. It is useful when the current time-step is greater than rho, at which point it starts recycling the oldest recurrentModule sharedClones, such that they can be reused for storing the next step. This offset is used for modules like nn.Recurrent that use a different module for the first step. Default offset is 0.

forget(offset)

This method brings back all states to the start of the sequence buffers, i.e. it forgets the current sequence. It also resets the step attribute to 1. It is highly recommended to call forget after each parameter update. Otherwise, the previous state will be used to activate the next, which will often lead to instability. This is caused by the previous state being the result of now changed parameters. It is also good practice to call forget at the start of each new sequence.

maxBPTTstep(rho)

This method sets the maximum number of time-steps for which to perform backpropagation through time (BPTT). So say you set this to rho = 3 time-steps, feed-forward for 4 steps, and then backpropgate, only the last 3 steps will be used for the backpropagation. If your AbstractRecurrent instance is wrapped by a Sequencer, this will be handled auto-magically by the Sequencer. Otherwise, setting this value to a large value (i.e. 9999999), is good for most, if not all, cases.

backwardOnline([online])

Call this method with online=true (the default) to make calls to backward (including updateGradInput and accGradParameters) perform backpropagation through time. This requires that calls to these backward methods be performed in the opposite order of the forward calls.

So for example, given the following data and rnn :

-- backpropagate every 5 time steps
rho = 5

-- generate some dummy inputs and gradOutputs sequences
inputs, gradOutputs = {}, {}
for step=1,rho do
   inputs[step] = torch.randn(3,10)
   gradOutputs[step] = torch.randn(3,10)
end

-- an AbstractRecurrent instance
rnn = nn.LSTM(10,10)

We could feed-forward and backpropagate through time like this :

for step=1,rho do
   rnn:forward(inputs[step])
   rnn:backward(inputs[step], gradOutputs[step])
end
rnn:backwardThroughTime()
rnn:updateParameters(0.1)
rnn:forget()

In the above example, each call to backward only saves the sequence of gradOutput Tensors. It is the call to backwardThroughTime() that actually does the backpropagation through time.

Alternatively, we could backpropagate through time online. To do so, we need to activate this feature by calling the backwardOnline method (once, at the start of training). Then we will make calls to backward in reverse order of the calls to forward. Each such call will backpropagate through a time-step, begining at the last time-step, ending at the first. So the above example can be implemented like this instead :

rnn:backwardOnline()
-- forward
for step=1,rho do
   rnn:forward(inputs[step])
end

-- backward (in reverse order of forward calls)
gradInputs = {}
for step=rho,1,-1 do
   gradInputs[step] = rnn:backward(inputs[step], gradOutputs[step])
end

rnn:updateParameters(0.1)
rnn:forget()

Also notice that backwardOnline makes the calls to backward generate a gradInput for every time-step. Whereas without this, these would only be made available via the rnn.gradInputs table after the call to backwardThroughTime().

training()

In training mode, the network remembers all previous rho (number of time-steps) states. This is necessary for BPTT.

evaluate()

During evaluation, since their is no need to perform BPTT at a later time, only the previous step is remembered. This is very efficient memory-wise, such that evaluation can be performed using potentially infinite-length sequence.

Recurrent

References :

A composite Module for implementing Recurrent Neural Networks (RNN), excluding the output layer.

The nn.Recurrent(start, input, feedback, [transfer, rho, merge]) constructor takes 5 arguments:

start : the size of the output (excluding the batch dimension), or a Module that will be inserted between the input Module and transfer module during the first step of the propagation. When start is a size (a number or torch.LongTensor), then this start Module will be initialized as nn.Add(start) (see Ref. A).
input : a Module that processes input Tensors (or Tables). Output must be of same size as start (or its output in the case of a start Module), and same size as the output of the feedback Module.
feedback : a Module that feedbacks the previous output Tensor (or Tables) up to the transfer Module.
transfer : a non-linear Module used to process the element-wise sum of the input and feedback module outputs, or in the case of the first step, the output of the start Module.
rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Due to the vanishing gradients effect, references A and B recommend rho = 5 (or lower). Defaults to 5.
merge : a table Module that merges the outputs of the input and feedback Module before being forwarded through the transfer Module.

An RNN is used to process a sequence of inputs. Each step in the sequence should be propagated by its own forward (and backward), one input (and gradOutput) at a time. Each call to forward keeps a log of the intermediate states (the input and many Module.outputs) and increments the step attribute by 1. A call to backward doesn't result in a gradInput. It only keeps a log of the current gradOutput and scale. Back-Propagation Through Time (BPTT) is done when the updateParameters or backwardThroughTime method is called. The step attribute is only reset to 1 when a call to the forget method is made. In which case, the Module is ready to process the next sequence (or batch thereof). Note that the longer the sequence, the more memory will be required to store all the output and gradInput states (one for each time step).

To use this module with batches, we suggest using different sequences of the same size within a batch and calling updateParameters every rho steps and forget at the end of the Sequence.

Note that calling the evaluate method turns off long-term memory; the RNN will only remember the previous output. This allows the RNN to handle long sequences without allocating any additional memory.

Example :

require 'rnn'

batchSize = 8
rho = 5
hiddenSize = 10
nIndex = 10000
-- RNN
r = nn.Recurrent(
   hiddenSize, nn.LookupTable(nIndex, hiddenSize), 
   nn.Linear(hiddenSize, hiddenSize), nn.Sigmoid(), 
   rho
)

rnn = nn.Sequential()
rnn:add(r)
rnn:add(nn.Linear(hiddenSize, nIndex))
rnn:add(nn.LogSoftMax())

criterion = nn.ClassNLLCriterion()

-- dummy dataset (task is to predict next item, given previous)
sequence = torch.randperm(nIndex)

offsets = {}
for i=1,batchSize do
   table.insert(offsets, math.ceil(math.random()*batchSize))
end
offsets = torch.LongTensor(offsets)

lr = 0.1
updateInterval = 4
i = 1
while true do
   -- a batch of inputs
   local input = sequence:index(1, offsets)
   local output = rnn:forward(input)
   -- incement indices
   offsets:add(1)
   for j=1,batchSize do
      if offsets[j] > nIndex then
         offsets[j] = 1
      end
   end
   local target = sequence:index(1, offsets)
   local err = criterion:forward(output, target)
   print(err)
   local gradOutput = criterion:backward(output, target)
   -- the Recurrent layer is memorizing its gradOutputs (up to memSize)
   rnn:backward(input, gradOutput)
   
   i = i + 1
   -- note that updateInterval < rho
   if i % updateInterval == 0 then
      -- backpropagates through time (BPTT) :
      -- 1. backward through feedback and input layers,
      rnn:backwardThroughTime()
      -- 2. updates parameters
      rnn:updateParameters(lr)
      rnn:zeroGradParameters()
      -- 3. reset the internal time-step counter
      rnn:forget()
   end
end

Another option, is to perform the backpropagation through time using the normal Module interface. The only requirement is that you wrap your rnn into a Recursor and call backwardOnline() and then call backward in reverse order of the forward calls:

rnn = nn.Recursor(rnn, rho)
rnn:backwardOnline()

i=1
inputs, outputs, targets = {}, {}
while true do
   -- a batch of inputs
   local input = sequence:index(1, offsets)
   local output = rnn:forward(input)
   -- incement indices
   offsets:add(1)
   for j=1,batchSize do
      if offsets[j] > nIndex then
         offsets[j] = 1
      end
   end
   local target = sequence:index(1, offsets)
   local err = criterion:forward(output, target)
   
   -- save these for the BPTT
   table.insert(inputs, input)
   table.insert(outputs, output)
   table.insert(targets, target)
   
   i = i + 1
   -- note that updateInterval < rho
   if i % updateInterval == 0 then
      for step=updateInterval,1,-1 do
         local gradOutput = criterion:backward(outputs[step], targets[step])
         rnn:backward(inputs[step], gradOutput)
      end
      rnn:updateParameters(lr)
      rnn:zeroGradParameters()
      
      inputs, outputs, targets = {}, {}, {}
   end
end

This is basically what a Sequencer does internally.

Decorate it with a Sequencer

Note that any AbstractRecurrent instance can be decorated with a Sequencer such that an entire sequence (a table) can be presented with a single forward/backward call. This is actually the recommended approach as it allows RNNs to be stacked and makes the rnn conform to the Module interface, i.e. a forward, backward and updateParameters are all that is required ( Sequencer handles the backwardThroughTime internally ).

seq = nn.Sequencer(module)

The following example is similar to the previous one, except that

updateInterval=rho (a Sequencer constraint);
the mean of the previous rho errors err is printed every rho time-steps (instead of printing the err of every time-step); and
the model uses Sequencers to decorate each module such that rho=5 time-steps can be forward,backward and updated for each batch (i.e. training loop):

require 'rnn'

batchSize = 8
rho = 5
hiddenSize = 10
nIndex = 10000

mlp = nn.Sequential()
   :add(nn.Recurrent(
      hiddenSize, nn.LookupTable(nIndex, hiddenSize), 
      nn.Linear(hiddenSize, hiddenSize), nn.Sigmoid(), 
      rho
   )
   :add(nn.Linear(hiddenSize, nIndex))
   :add(nn.LogSoftMax())

rnn = nn.Sequencer(mlp)

criterion = nn.SequencerCriterion(nn.ClassNLLCriterion())

-- dummy dataset (task is to predict next item, given previous)
sequence = torch.randperm(nIndex)

offsets = {}
for i=1,batchSize do
   table.insert(offsets, math.ceil(math.random()*batchSize))
end
offsets = torch.LongTensor(offsets)

lr = 0.1
i = 1
while true do
   -- prepare inputs and targets
   local inputs, targets = {},{}
   for step=1,rho do
      -- a batch of inputs
      table.insert(inputs, sequence:index(1, offsets))
      -- incement indices
      offsets:add(1)
      for j=1,batchSize do
         if offsets[j] > nIndex then
            offsets[j] = 1
         end
      end
      -- a batch of targets
      table.insert(targets, sequence:index(1, offsets))
   end
   
   local outputs = rnn:forward(inputs)
   local err = criterion:forward(outputs, targets)
   print(i, err/rho)
   i = i + 1
   local gradOutputs = criterion:backward(outputs, targets)
   rnn:backward(inputs, gradOutputs)
   rnn:updateParameters(lr)
   rnn:zeroGradParameters()
end

You should only think about using the AbstractRecurrent modules without a Sequencer if you intend to use it for real-time prediction. Actually, you can even use an AbstractRecurrent instance decorated by a Sequencer for real time prediction by calling Sequencer:remember() and presenting each time-step input as {input}.

Other decorators can be used such as the Repeater or RecurrentAttention. The Sequencer is only the most common one.

LSTM

References :

This is an implementation of a vanilla Long-Short Term Memory module. We used Ref. A's LSTM as a blueprint for this module as it was the most concise. Yet it is also the vanilla LSTM described in Ref. C.

The nn.LSTM(inputSize, outputSize, [rho]) constructor takes 3 arguments:

inputSize : a number specifying the size of the input;
outputSize : a number specifying the size of the output;
rho : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999.

The actual implementation corresponds to the following algorithm:

i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + W[c->i]c[t−1] + b[1->i])      (1)
f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + W[c->f]c[t−1] + b[1->f])      (2)
z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (3)
c[t] = f[t]c[t−1] + i[t]z(t)                                         (4)
o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + W[c->o]c[t] + b[1->o])        (5)
h[t] = o[t]tanh(c[t])                                                (6)

where W[s->q] is the weight matrix from s to q, t indexes the time-step, b[1->q] are the biases leading into q, σ() is Sigmoid, x[t] is the input, i[t] is the input gate (eq. 1), f[t] is the forget gate (eq. 2), z[t] is the input to the cell (which we call the hidden) (eq. 3), c[t] is the cell (eq. 4), o[t] is the output gate (eq. 5), and h[t] is the output of this module (eq. 6). Also note that the weight matrices from cell to gate vectors are diagonal W[c->s], where s is i,f, or o.

As you can see, unlike Recurrent, this implementation isn't generic enough that it can take arbitrary component Module definitions at construction. However, the LSTM module can easily be adapted through inheritance by overriding the different factory methods :

buildGate : builds generic gate that is used to implement the input, forget and output gates;
buildInputGate : builds the input gate (eq. 1). Currently calls buildGate;
buildForgetGate : builds the forget gate (eq. 2). Currently calls buildGate;
buildHidden : builds the hidden (eq. 3);
buildCell : builds the cell (eq. 4);
buildOutputGate : builds the output gate (eq. 5). Currently calls buildGate;
buildModel : builds the actual LSTM model which is used internally (eq. 6).

Note that we recommend decorating the LSTM with a Sequencer (refer to this for details).

FastLSTM

A faster version of the LSTM. Basically, the input, forget and output gates, as well as the hidden state are computed at one fell swoop.

Recursor

This module recorates a module to be used within an AbstractSequencer instance. It does this by making the decorated module conform to the AbstractRecurrent interface, which like the LSTM and Recurrent classes, this class inherits.

rec = nn.Recursor(module[, rho])

For each successive call to updateOutput (i.e. forward), this decorator will create a stepClone() of the decorated module. So for each time-step, it clones the module. Both the clone and original share parameters and gradients w.r.t. parameters. However, for modules that already conform to the AbstractRecurrent interface, the clone and original module are one in the same (i.e. no clone).

Examples :

Let's assume I want to stack two LSTMs. I could use two sequencers :

lstm = nn.Sequential()
   :add(nn.Sequencer(nn.LSTM(100,100)))
   :add(nn.Sequencer(nn.LSTM(100,100)))

Using a Recursor, I make the same model with a single Sequencer :

lstm = nn.Sequencer(
   nn.Recursor(
      nn.Sequential()
         :add(nn.LSTM(100,100))
         :add(nn.LSTM(100,100))
      )
   )

Actually, the Sequencer will wrap any non-AbstractRecurrent module automatically, so I could simplify this further to :

lstm = nn.Sequencer(
   nn.Sequential()
      :add(nn.LSTM(100,100))
      :add(nn.LSTM(100,100))
   )

I can also add a Linear between the two LSTMs. In this case, a Linear will be cloned (and have its parameters shared) for each time-step, while the LSTMs will do whatever cloning internally :

lstm = nn.Sequencer(
   nn.Sequential()
      :add(nn.LSTM(100,100))
      :add(nn.Linear(100,100))
      :add(nn.LSTM(100,100))
   )

AbstractRecurrent instances like Recursor, Recurrent and LSTM are expcted to manage time-steps internally. Non-AbstractRecurrent instances can be wrapped by a Recursor to have the same behavior.

Every call to forward on an AbstractRecurrent instance like Recursor will increment the self.step attribute by 1, using a shared parameter clone for each successive time-step (for a maximum of rho time-steps, which defaults to 9999999). In this way, with the help of backwardOnline we can then call backward in reverse order of the forward calls to perform backpropagation through time (BPTT). Which is exactly what AbstractSequencer instances do internally. The backward call, which is actually divided into calls to updateGradInput and accGradParameters, decrements by 1 the self.udpateGradInputStep and self.accGradParametersStep respectively, starting at self.step. Successive calls to backward will decrement these counters and use them to backpropagate through the appropriate internall step-wise shared-parameter clones.

Anyway, in most cases, you will not have to deal with the Recursor object directly as AbstractSequencer instances automatically decorate non-AbstractRecurrent instances with a Recursor in their constructors.

Recurrence

A extremely general container for implementing pretty much any type of recurrence.

rnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])

Unlike Recurrent, this module doesn't manage a separate modules like inputModule, startModule, mergeModule and the like. Instead, it only manages a single recurrentModule, which should output a Tensor or table : output(t) given an input table : {input(t), output(t-1)}. Using a mix of Recursor (say, via Sequencer) with Recurrence, one can implement pretty much any type of recurrent neural network, including LSTMs and RNNs.

For the first step, the Recurrence forwards a Tensor (or table thereof) of zeros through the recurrent layer (like LSTM, unlike Recurrent). So it needs to know the outputSize, which is either a number or torch.LongStorage, or table thereof. The batch dimension should be excluded from the outputSize. Instead, the size of the batch dimension (i.e. number of samples) will be extrapolated from the input using the nInputDim argument. For example, say that our input is a Tensor of size 4 x 3 where 4 is the number of samples, then nInputDim should be 1. As another example, if our input is a table of table [...] of tensors where the first tensor (depth first) is the same as in the previous example, then our nInputDim is also 1.

As an example, let's use Sequencer and Recurrence to build a Simple RNN for language modeling :

rho = 5
hiddenSize = 10
outputSize = 5 -- num classes
nIndex = 10000

-- recurrent module
rm = nn.Sequential()
   :add(nn.ParallelTable()
      :add(nn.LookupTable(nIndex, hiddenSize))
      :add(nn.Linear(hiddenSize, hiddenSize)))
   :add(nn.CAddTable())
   :add(nn.Sigmoid())

rnn = nn.Sequencer(
   nn.Sequential()
      :add(nn.Recurrence(rm, hiddenSize, 1))
      :add(nn.Linear(hiddenSize, outputSize))
      :add(nn.LogSoftMax())
)

Note : We could very well reimplement the LSTM module using the newer Recursor and Recurrent modules, but that would mean breaking backwards compatibility for existing models saved on disk.

AbstractSequencer

This abastract class implements a light interface shared by subclasses like : Sequencer, Repeater, RecurrentAttention, BiSequencer and so on.

Sequencer

The nn.Sequencer(module) constructor takes a single argument, module, which is the module to be applied from left to right, on each element of the input sequence.

seq = nn.Sequencer(module)

This Module is a kind of decorator used to abstract away the intricacies of AbstractRecurrent modules. While an AbstractRecurrent instance requires that a sequence to be presented one input at a time, each with its own call to forward (and backward), the Sequencer forwards an input sequence (a table) into an output sequence (a table of the same length). It also takes care of calling forget, backwardOnline and other such AbstractRecurrent-specific methods.

For example, rnn : an instance of nn.AbstractRecurrent, can forward an input sequence one forward at a time:

input = {torch.randn(3,4), torch.randn(3,4), torch.randn(3,4)}
rnn:forward(input[1])
rnn:forward(input[2])
rnn:forward(input[3])

Equivalently, we can use a Sequencer to forward the entire input sequence at once:

seq = nn.Sequencer(rnn)
seq:forward(input)

The Sequencer can also take non-recurrent Modules (i.e. non-AbstractRecurrent instances) and apply it to each input to produce an output table of the same length. This is especially useful for processing variable length sequences (tables).

Internally, the Sequencer expects the decorated module to be an AbstractRecurrent instance. When this is not the case, the module is automatically decorated with a Recursor module, which makes it conform to the AbstractRecurrent interface.

Note : this is due a recent update (27 Oct 2015), as before this AbstractRecurrent and and non-AbstractRecurrent instances needed to be decorated by their own Sequencer. The recent update, which introduced the Recursor decorator, allows a single Sequencer to wrap any type of module, AbstractRecurrent, non-AbstractRecurrent or a composite structure of both types. Nevertheless, existing code shouldn't be affected by the change.

remember([mode])

When mode='both' (the default), the Sequencer will not call forget at the start of each call to forward, which is the default behavior of the class. This behavior is only applicable to decorated AbstractRecurrent modules. Accepted values for argument mode are as follows :

'eval' only affects evaluation (recommended for RNNs)
'train' only affects training
'neither' affects neither training nor evaluation (default behavior of the class)
'both' affects both training and evaluation (recommended for LSTMs)

forget()

Calls the decorated AbstractRecurrent module's forget method.

BiSequencer

Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It is used for implementing Bidirectional RNNs and LSTMs.

brnn = nn.BiSequencer(fwd, [bwd, merge])

The input to the module is a sequence (a table) of tensors and the output is a sequence (a table) of tensors of the same length. Applies a fwd rnn (an AbstractRecurrent instance to each element in the sequence in forward order and applies the bwd rnn in reverse order (from last element to first element). The bwd rnn defaults to:

bwd = fwd:clone()
bwd:reset()

For each step (in the original sequence), the outputs of both rnns are merged together using the merge module (defaults to nn.JoinTable(1,1)). If merge is a number, it specifies the JoinTable constructor's nInputDim argument. Such that the merge module is then initialized as :

merge = nn.JoinTable(1,merge)

Internally, the BiSequencer is implemented by decorating a structure of modules that makes use of 3 Sequencers for the forward, backward and merge modules.

Similarly to a Sequencer, the sequences in a batch must have the same size. But the sequence length of each batch can vary.

BiSequencerLM

Applies encapsulated fwd and bwd rnns to an input sequence in forward and reverse order. It is used for implementing Bidirectional RNNs and LSTMs for Language Models (LM).

brnn = nn.BiSequencerLM(fwd, [bwd, merge])

The input to the module is a sequence (a table) of tensors and the output is a sequence (a table) of tensors of the same length. Applies a fwd rnn (an AbstractRecurrent instance to the first N-1 elements in the sequence in forward order. Applies the bwd rnn in reverse order to the last N-1 elements (from second-to-last element to first element). This is the main difference of this module with the BiSequencer. The latter cannot be used for language modeling because the bwd rnn would be trained to predict the input it had just be fed as input.

The bwd rnn defaults to:

bwd = fwd:clone()
bwd:reset()

While the fwd rnn will output representations for the last N-1 steps, the bwd rnn will output representations for the first N-1 steps. The missing outputs for each rnn ( the first step for the fwd, the last step for the bwd) will be filled with zero Tensors of the same size the commensure rnn's outputs. This way they can be merged. If nn.JoinTable is used (the default), then the first and last output elements will be padded with zeros for the missing fwd and bwd rnn outputs, respectively.

For each step (in the original sequence), the outputs of both rnns are merged together using the merge module (defaults to nn.JoinTable(1,1)). If merge is a number, it specifies the JoinTable constructor's nInputDim argument. Such that the merge module is then initialized as :

merge = nn.JoinTable(1,merge)

Similarly to a Sequencer, the sequences in a batch must have the same size. But the sequence length of each batch can vary.

Note that LMs implemented with this module will not be classical LMs as they won't measure the probability of a word given the previous words. Instead, they measure the probabiliy of a word given the surrounding words, i.e. context. While for mathematical reasons you may not be able to use this to measure the probability of a sequence of words (like a sentence), you can still measure the pseudo-likeliness of such a sequence (see this for a discussion).

Repeater

This Module is a decorator similar to [Sequencer]. It differs in that the sequence length is fixed before hand and the input is repeatedly forwarded through the wrapped module to produce an output table of length nStep:

r = nn.Repeater(module, nStep)

Argument module should be an AbstractRecurrent instance. This is useful for implementing models like RCNNs, which are repeatedly presented with the same input.

RecurrentAttention

References :

This module can be used to implement the Recurrent Attention Model (RAM) presented in Ref. A :

ram = nn.RecurrentAttention(rnn, action, nStep, hiddenSize)

rnn is an AbstractRecurrent instance. Its input is {x, z} where x is the input to the ram and z is an action sampled from the action module. The output size of the rnn must be equal to hiddenSize.

action is a Module that uses a REINFORCE module (ref. B) like ReinforceNormal, ReinforceCategorical, or ReinforceBernoulli to sample actions given the previous time-step's output of the rnn. During the first time-step, the action module is fed with a Tensor of zeros of size input:size(1) x hiddenSize. It is important to understand that the sampled actions do not receive gradients backpropagated from the training criterion. Instead, a reward is broadcast from a Reward Criterion like VRClassReward Criterion to the action's REINFORCE module, which will backprogate graidents computed from the output samples and the reward. Therefore, the action module's outputs are only used internally, within the RecurrentAttention module.

nStep is the number of actions to sample, i.e. the number of elements in the output table.

hiddenSize is the output size of the rnn. This variable is necessary to generate the zero Tensor to sample an action for the first step (see above).

A complete implementation of Ref. A is available here.

MaskZero

This module zeroes the output rows of the decorated module for commensurate input rows which are tensors of zeros.

mz = nn.MaskZero(module, nInputDim)

The output Tensor (or table thereof) of the decorated module will have each row (samples) zeroed when the commensurate row of the input is a tensor of zeros.

The nInputDim argument must specify the number of non-batch dims in the first Tensor of the input. In the case of an input table, the first Tensor is the first one encountered when doing a depth-first search.

This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.

LookupTableMaskZero

This module extends nn.LookupTable to support zero indexes. Zero indexes are forwarded as zero tensors.

lt = nn.LookupTableMaskZero(nIndex, nOutput)

The output Tensor will have each row zeroed when the commensurate row of the input is a zero index.

This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.

SequencerCriterion

This Criterion is a decorator:

c = nn.SequencerCriterion(criterion)

Both the input and target are expected to be a sequence (a table). For each step in the sequence, the corresponding elements of the input and target tables will be applied to the criterion. The output of forward is the sum of all individual losses in the sequence. This is useful when used in conjuction with a Sequencer.

WARNING : assumes that the decorated criterion is stateless, i.e. a backward shouldn't need to be preceded by a commensurate forward.

RepeaterCriterion

This Criterion is a decorator:

c = nn.RepeaterCriterion(criterion)

The input is expected to be a sequence (a table). A single target is repeatedly applied using the same criterion to each element in the input sequence. The output of forward is the sum of all individual losses in the sequence. This is useful for implementing models like RCNNs, which are repeatedly presented with the same target.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
doc/image		doc/image
examples		examples
rocks		rocks
scripts		scripts
test		test
AbstractRecurrent.lua		AbstractRecurrent.lua
AbstractSequencer.lua		AbstractSequencer.lua
BiSequencer.lua		BiSequencer.lua
BiSequencerLM.lua		BiSequencerLM.lua
CMakeLists.txt		CMakeLists.txt
FastLSTM.lua		FastLSTM.lua
LICENSE.txt		LICENSE.txt
LSTM.lua		LSTM.lua
LinearNoBias.lua		LinearNoBias.lua
LookupTableMaskZero.lua		LookupTableMaskZero.lua
MaskZero.lua		MaskZero.lua
Module.lua		Module.lua
Padding.lua		Padding.lua
README.md		README.md
Recurrence.lua		Recurrence.lua
Recurrent.lua		Recurrent.lua
RecurrentAttention.lua		RecurrentAttention.lua
Recursor.lua		Recursor.lua
Repeater.lua		Repeater.lua
RepeaterCriterion.lua		RepeaterCriterion.lua
Sequencer.lua		Sequencer.lua
SequencerCriterion.lua		SequencerCriterion.lua
ZeroGrad.lua		ZeroGrad.lua
init.lua		init.lua
recursiveUtils.lua		recursiveUtils.lua

License

douwekiela/rnn

Folders and files

Latest commit

History

Repository files navigation

rnn: recurrent neural networks

Examples

AbstractRecurrent

[recurrentModule] getStepModule(step)

maskZero(nInputDim)

[output] updateOutput(input)

updateGradInput(input, gradOutput)

accGradParameters(input, gradOutput, scale)

backwardThroughTime([step, rho])

updateGradInputThroughTime([step, rho])

accGradParametersThroughTime([step, rho])

accUpdateGradParametersThroughTime(learningRate)

backwardUpdateThroughTime(learningRate)

updateParameters(learningRate)

recycle(offset)

forget(offset)

maxBPTTstep(rho)

backwardOnline([online])

training()

evaluate()

Recurrent

Decorate it with a Sequencer

LSTM

FastLSTM

Recursor

Recurrence

AbstractSequencer

Sequencer

remember([mode])

forget()

BiSequencer

BiSequencerLM

Repeater

RecurrentAttention

MaskZero

LookupTableMaskZero

SequencerCriterion

RepeaterCriterion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages