This repo is an implementation of an Attentive RNN model for the task of language modelling.
Language modelling is done on both the PennTreeBank and Wikitext-02 datasets. The files are parsed such that each training example consists of one sentence from the corpus, padded to a max batch length of 35. Longer sentences are clipped. This is done in order to manage the attention and attend to only words in the sentence (before timestep t if at timestep t).
The A-RNN-LM (Attention based Recurrent Neural Network for Language Modelling) was originally proposed in Coherent Dialogue with Attention-based Language Models (Hongyuan Mei et al. 2016, link), and in Attentive Language Models (Salton et al. 2017, link).
The model consists of running a traditional attention mechanism on the previous hidden states of the encoder RNN layer(s) to encode a context vector which is then combined with the last encoded hidden state in order to predict the next word in the sequence.
Dependencies:
python=3.7
torch>=1.0.0
nltk
matplotlib
tensorboardX
Install all depedencies and run python main.py
.
The datasets will be downloaded and pre-processed automatically.
Multiple options for running are possible run python main.py --help
for full list.
usage: main.py [-h] [--batch-size N] [--epochs N] [--lr LR] [--patience P]
[--seed S] [--log-interval N] [--dataset [{wiki-02,ptb}]]
[--embedding-size N] [--n-layers N] [--hidden-size N]
[--positioning-embedding N] [--input-dropout D]
[--rnn-dropout D] [--decoder-dropout D] [--clip N]
[--optim [{sgd,adam,asgd}]] [--salton-lr-schedule]
[--early-stopping-patience P] [--attention]
[--no-positional-attention] [--tie-weights]
[--file-name FILE_NAME] [--parallel]
PyTorch Attentive RNN Language Modeling
optional arguments:
-h, --help show this help message and exit
--batch-size N input batch size for training (default: 64)
--epochs N number of epochs to train (default: 40)
--lr LR learning rate (default: 30.0)
--patience P patience for lr decrease (default: 5)
--seed S random seed (default: 123)
--log-interval N how many batches to wait before logging training
status (default 10)
--dataset [{wiki-02,ptb}]
Select which dataset (default: ptb)
--embedding-size N embedding size for embedding layer (default: 20)
--n-layers N layer size for RNN encoder (default: 1)
--hidden-size N hidden size for RNN encoder (default: 20)
--positioning-embedding N
hidden size for positioning generator (default: 20)
--input-dropout D input dropout (default: 0.5)
--rnn-dropout D rnn dropout (default: 0.0)
--decoder-dropout D decoder dropout (default: 0.5)
--clip N value at which to clip the norm of gradients (default:
0.25)
--optim [{sgd,adam,asgd}]
Select which optimizer (default: sgd)
--salton-lr-schedule Enables same training schedule as Salton et al. 2017
(default: False)
--early-stopping-patience P
early stopping patience (default: 25)
--attention Enable standard attention (default: False)
--no-positional-attention
Disable positional attention (default: False)
--tie-weights Tie embedding and decoder weights (default: False)
--file-name FILE_NAME
Specific filename to save under (default: uses params
to generate)
--parallel Enable using GPUs in parallel (default: False)
Model | Number of parameters | Validation Perplexity | Test Perplexity |
---|---|---|---|
LSTM Baseline (Merity et al., 2017) | 7.86M | 66.77 | 64.96 |
Attentive LM (Salton et al. 2017) | 7.06M | 79.09 | 76.56 |
Positional Attentive LM | 6.9M | 72.69 | 70.92 |
Model | Number of parameters | Validation Perplexity | Test Perplexity |
---|---|---|---|
LSTM Baseline (Merity et al., 2017) | 7.86M | 72.43 | 68.50 |
Attentive LM (Salton et al. 2017) | 7.06M | 78.43 | 74.37 |
Positional Attentive LM | 6.9M | 74.39 | 70.73 |
You can rerun all the models which generated the tables above by simply running:
python test.py
However please note that some of these models take upwards of 8hours to converge on a single 1080 GPU, so the total run-time of the experiment could be approximately 2 days.
Multi-GPU support is disabled by default as it was shown to have a negative impact on results. On top of that since batches are small in practice it is not actually much faster since a lot of time is spent sending the tensors to the respective GPUs.
Here are shown side by side comparaisons of the two attention distributions on an example:
The words in the X-axis are the inputs at each time step and the words in the Y-axis are the targets. Both models were trained on the Wikitext-02 dataset until convergence.