Skip to content

Latest commit

 

History

History
154 lines (116 loc) · 6.22 KB

README.md

File metadata and controls

154 lines (116 loc) · 6.22 KB

LancoPKU Summarization

This repository provides a toolkit for abstractive summarization, which can assist researchers to implement the common baseline, the attention-based sequence-to-sequence model, as well as three models proposed by our group LancoPKU recently. These models can achieve the improved performance and are capable of generating summaries of higher quality. By modifying the '.yaml' configuration file or the command options, one can easily apply the models to his own work. Names of these models and their corresponding papers are listed as follows:

  1. Global Encoding for Abstractive Summarization [pdf]
  2. Word Embedding Attention Network (WEAN) [pdf]
  3. SuperAE [pdf]


1 How to Use

--- 1.1 Requirements

  • Ubuntu 16.0.4
  • Python 3.5
  • Pytorch 0.3.1
  • pyrouge
  • matplotlib (for the visualization of attention heatmaps)
  • Tensorflow (>=1.5.0) and TensorboardX (for data visualization on Tensorboard)

--- 1.2 Configuration

Install PyTorch

Clone the LancoSum repository:

git clone https://github.com/lancopku/LancoSum.git
cd LancoSum

In order to use pyrouge, set rouge path with the line below:

pip install pyrouge
pyrouge_set_rouge_path script/RELEASE-1.5.5

--- 1.3 Preprocessing

python3 preprocess.py -load_data path_to_data -save_data path_to_store_data

Remember to put the data (plain text filea) into a folder and name them train.src, train.tgt, valid.src, valid.tgt, test.src and test.tgt, and make a new folder inside called data


--- 1.4 Training

python3 train.py -log log_name -config config_yaml -gpus id

--- 1.5 Evaluation

python3 train.py -log log_name -config config_yaml -gpus id -restore checkpoint -mode eval

2 Introduction to Models

--- 2.1 Global Encoding

Motivation & Idea

Conventional attention-based seq2seq model for abstractive summarization suffers from repetition and semantic irrelevance. Therefore, we propose a model containing a convolutional neural network (CNN) fitering the encoder outputs so that they can contain some information of the global context. Self-attention mechanism is implemented as well in order to dig out the correlations among these new representations of encoder outputs. Model

Options
python3 train.py -log log_name -config config_yaml -gpus id -swish -selfatt

--- 2.2 WEAN

Motivation & Idea

In the decoding process, conventional seq2seq models typically use a dense vector in each time step to generate a distribution over the vocabulary to choose the correct word output. However, such a method takes no account of the relationships between words in the vocabulary and also suffers from a large amount of parameters (hidden_size * vocab_size). Thus, in this model, we use a query system. The output of decoder is a query, the candidate words are the values, and the corresponding word representations are the keys. By refering to the word embeddings, our model is able to capture the semantic meaning of the words. Model

Options
python3 train.py -log log_name -config config_yaml -gpus id -score_fn function_name('general', 'dot', 'concat')

--- 2.3 SuperAE

Motivation & Idea

Corpus from social media is generally long, containing many errors. A conventional seq2seq model fails to compress a long sentence into an accurate representation. So we intend to use the representation of summary (which is shorter and easier to encode) to help supervise the encoder to generate better semantic representations of the source content during training. Moreover, ideas of adverserial network is used so as to dynamically dertermine the strength of such supervision. Model

Options
python3 train.py -log log_name -config config_yaml -gpus id -sae -loss_reg ('l2', 'l1', 'cos')

3 Citation

Plese cite these papers when using relevant models in your research.

Global Encoding:

@inproceedings{globalencoding,
  title     = {Global Encoding for Abstractive Summarization},
  author    = {Junyang Lin and Xu Sun and Shuming Ma and Qi Su},
  booktitle = {{ACL} 2018},
  year      = {2018}
}

WEAN:

@inproceedings{wean,
  author    = {Shuming Ma and Xu Sun and Wei Li and Sujian Li and Wenjie Li and Xuancheng Ren},
  title     = {Query and Output: Generating Words by Querying Distributed Word
	       Representations for Paraphrase Generation},
  booktitle = {{NAACL} {HLT} 2018, The 2018 Conference of the North American Chapter
	       of the Association for Computational Linguistics: Human Language Technologies},
  year      = {2018}
}

SuperAE:

@inproceedings{Ma2016superAE,
  title   = {Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization},
  author  = {Shuming Ma and Xu Sun and Junyang Lin and Houfeng Wang},
  booktitle = {{ACL} 2018},
  year      = {2018}
}