This repository provides a toolkit for abstractive summarization, which can assist researchers to implement the common baseline, the attention-based sequence-to-sequence model, as well as three models proposed by our group LancoPKU recently. These models can achieve the improved performance and are capable of generating summaries of higher quality. By modifying the '.yaml' configuration file or the command options, one can easily apply the models to his own work. Names of these models and their corresponding papers are listed as follows:
- Global Encoding for Abstractive Summarization [pdf]
- Word Embedding Attention Network (WEAN) [pdf]
- SuperAE [pdf]
- Ubuntu 16.0.4
- Python 3.5
- Pytorch 0.3.1
- pyrouge
- matplotlib (for the visualization of attention heatmaps)
- Tensorflow (>=1.5.0) and TensorboardX (for data visualization on Tensorboard)
Install PyTorch
Clone the LancoSum repository:
git clone https://github.com/lancopku/LancoSum.git
cd LancoSum
In order to use pyrouge, set rouge path with the line below:
pip install pyrouge
pyrouge_set_rouge_path script/RELEASE-1.5.5
python3 preprocess.py -load_data path_to_data -save_data path_to_store_data
Remember to put the data (plain text filea) into a folder and name them train.src, train.tgt, valid.src, valid.tgt, test.src and test.tgt, and make a new folder inside called data
python3 train.py -log log_name -config config_yaml -gpus id
python3 train.py -log log_name -config config_yaml -gpus id -restore checkpoint -mode eval
Conventional attention-based seq2seq model for abstractive summarization suffers from repetition and semantic irrelevance. Therefore, we propose a model containing a convolutional neural network (CNN) fitering the encoder outputs so that they can contain some information of the global context. Self-attention mechanism is implemented as well in order to dig out the correlations among these new representations of encoder outputs.
python3 train.py -log log_name -config config_yaml -gpus id -swish -selfatt
In the decoding process, conventional seq2seq models typically use a dense vector in each time step to generate a distribution over the vocabulary to choose the correct word output. However, such a method takes no account of the relationships between words in the vocabulary and also suffers from a large amount of parameters (hidden_size * vocab_size). Thus, in this model, we use a query system. The output of decoder is a query, the candidate words are the values, and the corresponding word representations are the keys. By refering to the word embeddings, our model is able to capture the semantic meaning of the words.
python3 train.py -log log_name -config config_yaml -gpus id -score_fn function_name('general', 'dot', 'concat')
Corpus from social media is generally long, containing many errors. A conventional seq2seq model fails to compress a long sentence into an accurate representation. So we intend to use the representation of summary (which is shorter and easier to encode) to help supervise the encoder to generate better semantic representations of the source content during training. Moreover, ideas of adverserial network is used so as to dynamically dertermine the strength of such supervision.
python3 train.py -log log_name -config config_yaml -gpus id -sae -loss_reg ('l2', 'l1', 'cos')
Plese cite these papers when using relevant models in your research.
@inproceedings{globalencoding,
title = {Global Encoding for Abstractive Summarization},
author = {Junyang Lin and Xu Sun and Shuming Ma and Qi Su},
booktitle = {{ACL} 2018},
year = {2018}
}
@inproceedings{wean,
author = {Shuming Ma and Xu Sun and Wei Li and Sujian Li and Wenjie Li and Xuancheng Ren},
title = {Query and Output: Generating Words by Querying Distributed Word
Representations for Paraphrase Generation},
booktitle = {{NAACL} {HLT} 2018, The 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies},
year = {2018}
}
@inproceedings{Ma2016superAE,
title = {Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization},
author = {Shuming Ma and Xu Sun and Junyang Lin and Houfeng Wang},
booktitle = {{ACL} 2018},
year = {2018}
}