Abstractive Summarization Results #340

mataney · 2017-10-15T11:47:44Z

Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.

Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?

Thanks?

srush · 2017-11-05T17:19:03Z

Hey, so we are getting close to these results, but still a little bit below.

Summarization Experiment Description

This document describes how to replicate summarization experiments on the CNNDM and gigaword datasets using OpenNMT-py.
In the following, we assume access to a tokenized form of the corpus split into train/valid/test set.

An example article-title pair from Gigaword should look like this:

Input
australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .

Output
australian current account deficit narrows sharply

Preprocessing the data

Since we are using copy-attention [1] in the model, we need to preprocess the dataset such that source and target are aligned and use the same dictionary. This is achieved by using the options dynamic_dict and share_vocab.
We additionally turn off truncation of the source to ensure that inputs longer than 50 words are not truncated.
For CNNDM we follow See et al. [2] and additionally truncate the source length at 400 tokens and the target at 100.

command used:

(1) CNNDM

python preprocess.py -train_src data/cnndm/train.txt.src -train_tgt data/cnn-no-sent-tag/train.txt.tgt -valid_src data/cnndm/val.txt.src -valid_tgt data/cnn-no-sent-tag/val.txt.tgt -save_data data/cnn-no-sent-tag/cnndm -src_seq_length 10000 -tgt_seq_length 10000 -src_seq_length_trunc 400 -tgt_seq_length_trunc 100 -dynamic_dict -share_vocab

(2) Gigaword

python preprocess.py -train_src data/giga/train.article.txt -train_tgt data/giga/train.title.txt -valid_src data/giga/valid.article.txt -valid_tgt data/giga/valid.title.txt -save_data data/giga/giga -src_seq_length 10000 -dynamic_dict -share_vocab

Training

The training procedure described in this section for the most part follows parameter choices and implementation similar to that of See et al. [2]. As mentioned above, we use copy attention as a mechanism for the model to decide whether to either generate a new word or to copy from the source (copy_attn).
A notable difference to See's model is that we are using the attention mechanism introduced by Bahdanau et al. [3] (global_attention mlp) instead of that by Luong et al. [4] (global_attention dot). Both options typically perform very similar to each other with Luong attention often having a slight advantage.
We are using using a 128-dimensional word-embedding, and 512-dimensional 1 layer LSTM. On the encoder side, we use a bidirectional LSTM (brnn), which means that the 512 dimensions are split into 256 dimensions per direction.
We also share the word embeddings between encoder and decoder (share_embeddings). This option drastically reduces the number of parameters the model has to learn. However, we found only minimal impact on performance of a model without this option.

For the training procedure, we are using SGD with an initial learning rate of 1 for a total of 16 epochs. In most cases, the lowest validation perplexity is achieved around epoch 10-12. We also use OpenNMT's default learning rate decay, which halves the learning rate after every epoch once the validation perplexity increased after an epoch (or after epoch 8).
Alternative training procedures such as adam with initial learning rate 0.001 converge faster than sgd, but achieve slightly worse. We additionally set the maximum norm of the gradient to 2, and renormalize if the gradient norm exceeds this value.

commands used:

(1) CNNDM

python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

(2) Gigaword

python train.py -save_model logs/giga_sgd3_512 -data data/giga/giga -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

Inference

During inference, we use beam-search with a beam-size of 10.
We additionally use the replace_unk option which replaces generated <UNK> tokens with the source token of highest attention. This acts as safety-net should the copy attention fail which should learn to copy such words.

commands used:

(1) CNNDM

python translate.py -gpu 2 -batch_size 1 -model logs/notag_try3_acc_49.29_ppl_14.62_e16.pt -src data/cnndm/test.txt.src -output sgd3_out.txt -beam_size 10 -replace_unk

(2) Gigaword

python translate.py -gpu 2 -batch_size 1 -model logs/giga_sgd3_512_acc_51.10_ppl_12.04_e16.pt -src data/giga/test.article.txt -output giga_sgd3.out.txt -beam_size 10 -replace_unk

Evaluation

CNNDM

To evaluate the ROUGE scores on CNNDM, we extended the pyrouge wrapper with additional evaluations such as the amount of repeated n-grams (typically found in models with copy attention), found here.

It can be run with the following command:

python baseline.py -s sgd3_out.txt -t ~/datasets/cnn-dailymail/sent-tagged/test.txt.tgt -m no_sent_tag -r

Note that the no_sent_tag option strips tags around sentences - when a sentence previously was <s> w w w w . </s>, it becomes w w w w ..

Gigaword

For evaluation of large test sets such as Gigaword, we use the a parallel python wrapper around ROUGE, found here.

command used:
files2rouge giga_sgd3.out.txt test.title.txt --verbose

Running the commands above should yield the following scores:

ROUGE-1 (F): 0.352127
ROUGE-2 (F): 0.173109
ROUGE-3 (F): 0.098244
ROUGE-L (F): 0.327742
ROUGE-S4 (F): 0.155524

References

[1] Vinyals, O., Fortunato, M. and Jaitly, N., 2015. Pointer Network. NIPS

[2] See, A., Liu, P.J. and Manning, C.D., 2017. Get To The Point: Summarization with Pointer-Generator Networks. ACL

[3] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR

[4] Luong, M.T., Pham, H. and Manning, C.D., 2015. Effective approaches to attention-based neural machine translation. EMNLP

mataney · 2017-11-08T09:12:42Z

This is massive!
Absolutely massive! Thank you very much.

By the way, I found using See's tokenized dataset (can be downloaded here) to work better.

What data do you pass to preprocess.py

srush · 2017-11-09T19:55:33Z

Cool. Can you let us know what results you got? When you say "better", do you mean compared to what?

mataney · 2017-11-10T09:01:44Z

Hey, you wrote:

python baseline.py ...

Can't seem to find this file. Can you link me to the project?

I meant "better" by comparing the accuracy results of the original dataset to See's preprocessed runs.

sebastianGehrmann · 2017-11-10T15:10:49Z

The script we've been using is this one: https://github.com/falcondai/pyrouge/
This is a slightly modified version of the script described here: http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85

Thanks for the note about See's dataset. I will try and compare models with the different datasets

mataney · 2017-11-12T15:35:59Z

Still not sure where this baseline.py file is.
I can run the script as in https://github.com/falcondai/pyrouge/
But I believe using the baseline.py with its no_sent_tag will be smarter.

pltrdy · 2017-11-13T08:55:18Z

Interesting discussion.

@srush your examples shows the -brnn flag that it now deprecated. You may want to replace it with -encoder_type brnn.

sebastianGehrmann · 2017-11-13T16:23:31Z

@mataney I linked the wrong repo - https://github.com/falcondai/rouge-baselines is what we use (that in turn uses pyrouge)
One question, how do you use the preprocessed data you linked above? From my understanding, the download link has the individual documents instead of one large file. Do you just concatenate them? If so, do you have a script that I can use to reproduce your findings?

@pltrdy You're absolutely right, I copied the commands from a time before the brnn switch. We should definitely change that.

mataney · 2017-11-15T14:44:58Z

@sebastianGehrmann
Cool, will run rouge-baselines on my model soon.

And in order to get just the big files I ran some of See's code (because I wanted to get another thing that is not just the article and the abstract)
So the following code is just the gist of Sees preprocessing.

https://gist.github.com/mataney/67cfb05b0b84e88da3e0fe04fb80cfc8

So you can do something like this, or you can just concatenate them (the latter will be shorter)

sebastianGehrmann · 2017-11-15T14:52:34Z

Thanks, I'll check it out. To make sure we use the same exact files, could you upload yours and send me a download link via email? That'd be great! (gehrmann (at) seas.harvard.edu)

srush · 2017-11-15T16:22:51Z

Huh, this is the code I ran to make the dataset, it was forked from hers. https://github.com/OpenNMT/cnn-dailymail

I wonder if she changed anything...

srush · 2017-11-15T16:28:13Z

Oh I see, this is after the files are created. Huh, so the only thing I see that could be different is that she drops blank lines and does some unicode encoding. @mataney Could you run "sdiff " and confirm that ? I don't see anything else in this gist, but I could be missing something.

mataney · 2017-11-15T16:45:06Z

@srush This files should be the same (sdiff shouldn't work as I have more data about each article than just article and abstract (I deleted this from my gist))

I can conclude with false alarm as I didn't know you are using See's preprocessing, but you do :)
So our tokenization etc are the same.

mataney · 2017-11-15T16:47:46Z

Another question, after training and translating I only get 1 sentence summaries. This seem strange.
@srush are the translations your passed to baseline.py are 1 sentence summaries as well?

srush · 2017-11-15T16:49:57Z

Oh, shoot. I forgot to mention this. See uses </s> as her sentence end token, which is unfortunately what we use in translate as well :( For our experiments we replaced hers with </t>. You can either do that, or change the end condition in translate to 2 repeated </s> token.

pltrdy · 2017-11-22T10:41:20Z

Why not just replace </s> with . ?

BTW, it seems that there is no -m no_sent_tag option in falcondai repo. I guess you are using a modified version?!

mataney · 2018-01-01T08:32:38Z

Hey guys,
Any feature ideas\fixes you think about that will get us closer to See's results (seq2seq+attn+pointer then coverage).

srush · 2018-01-01T20:59:31Z

I think we are basically there. What scores are you getting?

srush · 2018-01-01T21:00:17Z

@sebastianGehrmann (when he gets back from vacation)

pltrdy · 2018-01-02T14:14:09Z

Using the HP you, @srush , mentioned above, I get the following ROUGE scores on CNN/D% (after 16 epochs):

ROUGE-1 (F): 0.323996
ROUGE-2 (F): 0.140015
ROUGE-L (F): 0.244148

ROUGE-3 (F): 0.081449
ROUGE-S4 (F): 0.105728

mataney · 2018-01-03T09:34:37Z

Getting about the same, although I'm getting better results when embedding and hidden sizes are 500.
This is still rather different than what See reports - ROUGE1, 2, L - 39.53, 17.28, 36.38 respectively.

(Obviously this is said without taking anything from the brilliant work that has been done here! 😄 )

srush · 2018-01-03T14:09:32Z

Okay, let me post our model, we're doing a lot better. Think we need to update the docs.

(Although, worrisome that you are getting different results with the same args. I will check into that. )

srush · 2018-01-03T14:17:26Z

Okay, here are his args:

python train.py -save_model /scratch/cnndm/ada4 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -encoder_type brnn -epochs 16 -seed 777 -batch_size 16 -max_grad_norm 2 -share_embeddings -dropout 0. -gpuid 3 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1

(See's RNN is split 512/256 which we don't support at the moment.)

And then during translation use Wu style coverage with -alpha 0.9 -beta 0.25

We're seeing train ppl of 12.84, val ppl of 11.98 and ROUGE-1/2 of 0.38 | 0.168

mataney · 2018-01-07T18:44:04Z

Hey :)
Tried to run this and it appears to be stuck around 4% accuracy.
Just pulled from master, didn't change a thing.

So only thing that might be different is the data the is being passed to preprocess.py.
Something special about it?

M.

ratishsp · 2018-03-25T10:34:15Z

Thanks @LeenaShekhar for the detail. I guess I agree with your point regarding batch_size not making any difference during inference. But I was curious to know why we did not use a batch size of 16 during inference as it could have made the inference faster.

ratishsp · 2018-03-26T15:40:52Z

Hi,
Another difference from See et al implementation is that See et al uses concept of reusing the attention distribution to serve as the copy distribution. Though I believe that the accuracy difference because of the same should not be too significant.

ratishsp · 2018-03-26T15:45:16Z

Hi,
I have one query regarding the training command. I see that option 'copy_loss_by_seqlength' was introduced in first week of Feb. Prior to that, the loss in CopyGenerator module was by default normalized by length. So, to reproduce the above reported accuracies, should the training command have the option 'copy_loss_by_seqlength' set?

sebastianGehrmann · 2018-03-27T21:46:48Z

I updated the document describing the summarization experiments here.

To directly answer some questions by @ratishsp from above:

Inference batch size 1 above was because larger batch sizes were not implemented when I wrote the original document. They work now, so feel free to use batch_size 20 or more.
You can reuse the attention for copying with the option reuse_copy_attn
Please use the new loss when using BRNN models.

LeenaShekhar · 2018-03-28T15:49:46Z

Thank you so much for updating the document.

ratishsp · 2018-03-30T13:30:55Z

Thanks @sebastianGehrmann for updating the document.
One further query: You mentioned above that copy_loss_by_seqlength is to be used with BRNN models. You mean that it is recommended only for BRNN models and not for say rnn or other encoder types?
In addition, I see that you haven't used any dropout. Were the results adverse when dropout was used?

LeenaShekhar · 2018-03-30T20:03:01Z

@ratishsp To answer your second question: Dropout of 0.3 is default.

group.add_argument('-dropout', type=float, default=0.3,
help="Dropout probability; applied in LSTM stacks.")

jingxil · 2018-04-05T10:57:33Z

Hi, everyone. Very nice discussion! I ran the baseline model nocopy_acc_51.33_ppl_12.74_e20 on Gigaword test set with report_rouge param and got ROUGE(1/2/3/L/SU4): 33.49/16.30/9.04/31.32/18.19 which is a bit lower than the claimed R1: 33.60 R2: 16.29 RL: 31.45. Do you guys have any idea why this difference happened?

boya-song · 2018-04-09T16:10:52Z

Nice work! Thank you all! After reading the thread, I still have one question though. Is the coverage layer introduced by See is not suggested during training? @sebastianGehrmann

Maggione · 2018-05-03T07:30:08Z

Is there something wrong with the script "python baseline.py -m no_sent_tag ..." ? I try it but got low score. I run the "python baseline.py -m sent_tag_verbatim ...", and the result seems more normal.

sebastianGehrmann · 2018-05-03T13:19:00Z

Hi @Maggione thanks for the question. The baseline.py script supports multiple formats for your src and tgt data. In the one I describe in the tutorial, we have and tags as sentence boundaries for the gold, but remove them from the prediction. Depending on your format, you might have to use a different one. I'll put it on my list to better format the different modes.

vince62s · 2018-09-03T15:20:04Z

closing this thread, it's fully documented in the FAQ now.
I reproduced very close results on CNNDM on 2 GPU with the transformer.

testset @200k steps averaged last 10 checkpoints

Running ROUGE...

1 ROUGE-1 Average_R: 0.37251 (95%-conf.int. 0.37023 - 0.37478)
1 ROUGE-1 Average_P: 0.42895 (95%-conf.int. 0.42621 - 0.43191)
1 ROUGE-1 Average_F: 0.38517 (95%-conf.int. 0.38297 - 0.38722)

1 ROUGE-2 Average_R: 0.16027 (95%-conf.int. 0.15808 - 0.16247)
1 ROUGE-2 Average_P: 0.18765 (95%-conf.int. 0.18514 - 0.19027)
1 ROUGE-2 Average_F: 0.16689 (95%-conf.int. 0.16484 - 0.16908)

1 ROUGE-L Average_R: 0.34430 (95%-conf.int. 0.34202 - 0.34653)
1 ROUGE-L Average_P: 0.39706 (95%-conf.int. 0.39436 - 0.39988)
1 ROUGE-L Average_F: 0.35630 (95%-conf.int. 0.35418 - 0.35835)

evasharma · 2018-10-30T17:41:06Z

@vince62s I am also trying to run transformer on CNNDM (on 2 GPUs). Could you share the set of train parameters you used? Are they same as the parameters reported for transformer here : http://opennmt.net/OpenNMT-py/Summarization.html ?

vince62s · 2018-10-31T06:05:57Z

yes same as there.

evasharma · 2018-10-31T15:22:23Z

Thanks I just noticed that the results reported on CNN on this thread and on the summarization.md were different and so I asked. And also you have used -copy_attn which is different from the transformer paper setting. Was that to improve the score?

jsbaan · 2019-01-30T15:36:21Z

Great work on all these summarization implementations, thanks a bunch! The results presented in the paper Bottom-up Abstractive Summarization are based on this implementation, is that correct?

When I follow the summarization example, given the hyperparameters used in this example, I would expect my results to be equivalent to the "Pointer-Generator + Coverage Penalty (our implementation)" entry in table 1. However, I obtain a drop of ~2.5 ROUGE points as shown in the evaluation output below. Am I missing something or did current implementation diverge from the one used in the paper?

1 ROUGE-1 Average_R: 0.37577 (95%-conf.int. 0.37294 - 0.37851)
1 ROUGE-1 Average_P: 0.39376 (95%-conf.int. 0.39136 - 0.39638)
1 ROUGE-1 Average_F: 0.36855 (95%-conf.int. 0.36631 - 0.37067)

1 ROUGE-2 Average_R: 0.16530 (95%-conf.int. 0.16269 - 0.16789)
1 ROUGE-2 Average_P: 0.16870 (95%-conf.int. 0.16632 - 0.17102)
1 ROUGE-2 Average_F: 0.15969 (95%-conf.int. 0.15743 - 0.16189)

1 ROUGE-L Average_R: 0.34268 (95%-conf.int. 0.33975 - 0.34540)
1 ROUGE-L Average_P: 0.35900 (95%-conf.int. 0.35660 - 0.36144)
1 ROUGE-L Average_F: 0.33603 (95%-conf.int. 0.33376 - 0.33817)

vince62s · 2019-01-30T15:41:47Z

which one did you run ? rnn or transformer ?

jsbaan · 2019-01-30T15:47:33Z

I ran the rnn on cnndm and evaluated using files2rouge with the predictions and targets stripped from tags.

sebastianGehrmann · 2019-01-30T15:53:33Z

@AIJoris Yes, the results in the paper are all from OpenNMT-py and the summarization example provides the exact commands I ran.
As a first step, could you try running the inference on the pretrained model you can download? We need to make sure to have the same translate.py and ROUGE eval setup first.

jsbaan · 2019-01-31T11:00:57Z

@sebastianGehrmann Thanks a lot for your quick response. I ran the inference overnight with model ada6_bridge_oldcopy_tagged_acc_54.17_ppl_11.17_e20.pt and the results are as follows:

1 ROUGE-1 Average_R: 0.37917 (95%-conf.int. 0.37638 - 0.38201)
1 ROUGE-1 Average_P: 0.40190 (95%-conf.int. 0.39943 - 0.40441)
1 ROUGE-1 Average_F: 0.37572 (95%-conf.int. 0.37351 - 0.37803)

1 ROUGE-2 Average_R: 0.16934 (95%-conf.int. 0.16684 - 0.17189)
1 ROUGE-2 Average_P: 0.17506 (95%-conf.int. 0.17269 - 0.17743)
1 ROUGE-2 Average_F: 0.16544 (95%-conf.int. 0.16335 - 0.16777)

1 ROUGE-L Average_R: 0.34806 (95%-conf.int. 0.34531 - 0.35070)
1 ROUGE-L Average_P: 0.36861 (95%-conf.int. 0.36622 - 0.37117)
1 ROUGE-L Average_F: 0.34474 (95%-conf.int. 0.34269 - 0.34697)

I used the following command for inference:

python` OpenNMT-py/translate.py -gpu 0
-batch_size 20
-beam_size 10
-model models/ada6_bridge_oldcopy_tagged_acc_54.17_ppl_11.17_e20.pt
-src data/cnndm/txt/test.txt.src
-output testout/cnndm.out
-min_length 35
-stepwise_penalty
-coverage_penalty summary
-beta 5
-length_penalty wu
-alpha 0.9
-block_ngram_repeat 3
-ignore_when_blocking "." "" ""

jsbaan · 2019-02-06T14:30:25Z

@sebastianGehrmann After some more testing I noticed that when loading the pretrained model, the input documents are not truncated to 400 tokens. This is the only difference I have been able to find between the pre-trained model and my own.

Apart from that, it looks like the inference and/or test procedure is different, as the reported results above are different from the paper.

priyanks179 · 2019-03-09T17:31:00Z

sir i am trying yo implement pointer generator network from scratch , but after 10k iterations it starts overfitting, even though i am not using coverage mechanism at all. I have tried dropout as well as batchnorm but they were not able to prevent overfitting. I am just making the max_enc_size to 200 and max_dec_size to 35 . Can you suggest what i should do.

jsbaan · 2019-04-03T14:35:49Z

I still haven't managed to obtain the reported results. I have tested both the pre-trained Transformer model as well as a Transformer model trained from scratch using the parameters from the documentation. I am using the following parameters for inference:

 python OpenNMT-py/translate.py -gpu 0\
                     -batch_size 5 \
                     -beam_size 10 \
                     -model models/transformer_pretrained.pt \
                     -src data/cnndm/txt/test.txt.src \
                     -output testout/cnndm.out \
                     -min_length 35 \
                     -stepwise_penalty \
                     -coverage_penalty summary \
                     -beta 5 \
                     -length_penalty wu \
                     -alpha 0.9 \
                     -block_ngram_repeat 3 \
                     -ignore_when_blocking "." "</t>" "<t>"

When evaluating with rouge-baselines using python baseline.py -s testout/cnndm.out -t data/cnndm/test.txt.tgt.tagged -m sent_tag_verbatim -r, I get the following results:

1 ROUGE-1 Average_R: 0.36856 (95%-conf.int. 0.36612 - 0.37105)
1 ROUGE-1 Average_P: 0.42605 (95%-conf.int. 0.42308 - 0.42912)
1 ROUGE-1 Average_F: 0.38206 (95%-conf.int. 0.37991 - 0.38434)

1 ROUGE-2 Average_R: 0.16113 (95%-conf.int. 0.15894 - 0.16335)
1 ROUGE-2 Average_P: 0.18784 (95%-conf.int. 0.18524 - 0.19053)
1 ROUGE-2 Average_F: 0.16757 (95%-conf.int. 0.16535 - 0.16985)

1 ROUGE-L Average_R: 0.34212 (95%-conf.int. 0.33971 - 0.34460)
1 ROUGE-L Average_P: 0.39589 (95%-conf.int. 0.39309 - 0.39898)
1 ROUGE-L Average_F: 0.35485 (95%-conf.int. 0.35269 - 0.35719)

* method sent_tag_verbatim
['rouge_1_recall', 'rouge_1_precision', 'rouge_1_f_score', 'rouge_2_recall', 'rouge_2_precision', 'rouge_2_f_score', 'rouge_l_recall', 'rouge_l_precision', 'rouge_l_f_score']
36.86	42.60	38.21	16.11	18.78	16.76	34.21	39.59	35.48	
* evaluated 11490 samples, took 259.233s, averaging 0.023s/sample
* portion of samples that contains self-repetitions
full-sent,32-gram,16-gram,8-gram,4-gram,2-gram
0.00%,	0.00%,	0.00%,	0.00%,	0.00%,	34.21%,	
* evaluated 11490 samples, took 1.243s, averaging 0.000s/sample

Just to be clear, the above results are from the pre-trained model located here. Do you have any idea what can cause this difference in performance and how to improve it?

lauhaide · 2019-10-08T17:01:04Z

Hi all, Thanks for the code and all replies to this discussion!
I trained from scratch CopyTransformer on CNN/DM (train and inference commands as in documentation, only sharded different size) and get the same scores as ALJoris in post above (with rouge-baselines.py). I cannot get the reported values: 39.25/17.54/36.45 . Any ideas?
Results are close to those reported above by @vince62s on Sep 3rd.

I noticed that the command line for training has no '-seed' setting, this means defaults value is used. However, I observed different results when launching without setting myself the seed in the train command line.
Thanks!

pltrdy · 2019-10-09T10:18:14Z

@lauhaide could you report how long you trained the model, and the final scores?

In fact, I don't reproduce it either, not by training from scratch nor by just running inference (I get exact same results as @AIJoris ).

@sebastianGehrmann could you help us on this? I even tried to run inference on old commit (since the repo has been moving) using commit from 2018-04-27 bde7f83 I get slightly different results but still not the same as yours (old F1: 0.38689 0.17099 0.35935)

lauhaide · 2019-10-11T15:30:25Z

Thanks for your prompt reply @pltrdy !
The scores are @ 42.5k steps (save_checkpoint_steps =2500): 38.95 | 17.04 | 35.89
If at inference with same pre-trained model you get lower results, something in the code or some decoding parameter might be different (e.g. penalties, beam size???)

pltrdy · 2019-10-11T16:12:50Z

@lauhaide could you provide the checkpoint file?

lauhaide · 2019-10-11T16:26:34Z

Yes, can be downloaded from here:
https://drive.google.com/open?id=1eOlsCGJVdCgm6t6gY7ZF3T53LLvBJXY1
Thanks!

mataney changed the title ~~Summarization~~ Abstractive Summarization Results Oct 15, 2017

srush added type:question type:performance labels Nov 5, 2017

srush closed this as completed Dec 5, 2017

pltrdy mentioned this issue Dec 20, 2017

Empty prediction on CNN/DM with beam > 1 #457

Closed

srush reopened this Jan 1, 2018

vince62s closed this as completed Sep 3, 2018

pltrdy mentioned this issue Oct 21, 2019

Reproduce Pointer-Generator + Coverage summarization model with single gpu #1610

Closed

Abstractive Summarization Results #340

Abstractive Summarization Results #340

Comments

mataney commented Oct 15, 2017

srush commented Nov 5, 2017

Summarization Experiment Description

Preprocessing the data

Training

Inference

Evaluation

CNNDM

Gigaword

References

mataney commented Nov 8, 2017 • edited Loading

srush commented Nov 9, 2017

mataney commented Nov 10, 2017 • edited Loading

sebastianGehrmann commented Nov 10, 2017 • edited Loading

mataney commented Nov 12, 2017

pltrdy commented Nov 13, 2017

sebastianGehrmann commented Nov 13, 2017 • edited Loading

mataney commented Nov 15, 2017 • edited Loading

sebastianGehrmann commented Nov 15, 2017

srush commented Nov 15, 2017

srush commented Nov 15, 2017

mataney commented Nov 15, 2017

mataney commented Nov 15, 2017

srush commented Nov 15, 2017 • edited Loading

pltrdy commented Nov 22, 2017 • edited Loading

mataney commented Jan 1, 2018

srush commented Jan 1, 2018

srush commented Jan 1, 2018

pltrdy commented Jan 2, 2018

mataney commented Jan 3, 2018

srush commented Jan 3, 2018 • edited Loading

srush commented Jan 3, 2018

mataney commented Jan 7, 2018

ratishsp commented Mar 25, 2018

ratishsp commented Mar 26, 2018

ratishsp commented Mar 26, 2018

sebastianGehrmann commented Mar 27, 2018

LeenaShekhar commented Mar 28, 2018

ratishsp commented Mar 30, 2018

LeenaShekhar commented Mar 30, 2018 • edited Loading

jingxil commented Apr 5, 2018

boya-song commented Apr 9, 2018

Maggione commented May 3, 2018

sebastianGehrmann commented May 3, 2018

vince62s commented Sep 3, 2018

Running ROUGE...

1 ROUGE-1 Average_R: 0.37251 (95%-conf.int. 0.37023 - 0.37478) 1 ROUGE-1 Average_P: 0.42895 (95%-conf.int. 0.42621 - 0.43191) 1 ROUGE-1 Average_F: 0.38517 (95%-conf.int. 0.38297 - 0.38722)

1 ROUGE-2 Average_R: 0.16027 (95%-conf.int. 0.15808 - 0.16247) 1 ROUGE-2 Average_P: 0.18765 (95%-conf.int. 0.18514 - 0.19027) 1 ROUGE-2 Average_F: 0.16689 (95%-conf.int. 0.16484 - 0.16908)

evasharma commented Oct 30, 2018

vince62s commented Oct 31, 2018

evasharma commented Oct 31, 2018

jsbaan commented Jan 30, 2019

vince62s commented Jan 30, 2019

jsbaan commented Jan 30, 2019 • edited Loading

sebastianGehrmann commented Jan 30, 2019

jsbaan commented Jan 31, 2019

jsbaan commented Feb 6, 2019

priyanks179 commented Mar 9, 2019

jsbaan commented Apr 3, 2019 • edited Loading

lauhaide commented Oct 8, 2019

pltrdy commented Oct 9, 2019

lauhaide commented Oct 11, 2019

pltrdy commented Oct 11, 2019

lauhaide commented Oct 11, 2019

mataney commented Nov 8, 2017 •

edited

Loading

mataney commented Nov 10, 2017 •

edited

Loading

sebastianGehrmann commented Nov 10, 2017 •

edited

Loading

sebastianGehrmann commented Nov 13, 2017 •

edited

Loading

mataney commented Nov 15, 2017 •

edited

Loading

srush commented Nov 15, 2017 •

edited

Loading

pltrdy commented Nov 22, 2017 •

edited

Loading

srush commented Jan 3, 2018 •

edited

Loading

LeenaShekhar commented Mar 30, 2018 •

edited

Loading

1 ROUGE-1 Average_R: 0.37251 (95%-conf.int. 0.37023 - 0.37478)
1 ROUGE-1 Average_P: 0.42895 (95%-conf.int. 0.42621 - 0.43191)
1 ROUGE-1 Average_F: 0.38517 (95%-conf.int. 0.38297 - 0.38722)

1 ROUGE-2 Average_R: 0.16027 (95%-conf.int. 0.15808 - 0.16247)
1 ROUGE-2 Average_P: 0.18765 (95%-conf.int. 0.18514 - 0.19027)
1 ROUGE-2 Average_F: 0.16689 (95%-conf.int. 0.16484 - 0.16908)

jsbaan commented Jan 30, 2019 •

edited

Loading

jsbaan commented Apr 3, 2019 •

edited

Loading