Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstractive Summarization Results #340

Closed
mataney opened this issue Oct 15, 2017 · 82 comments
Closed

Abstractive Summarization Results #340

mataney opened this issue Oct 15, 2017 · 82 comments

Comments

@mataney
Copy link
Contributor

mataney commented Oct 15, 2017

Hey guys, looking at recent pull requests and issues, it looks like a common interest of contributors (On top of NMT obv) is Abstractive Summarization.

Any suggestions of how to train a model that will get close results to recent papers on the CNN-Daily Mail Database? Any additional preprocessing?

Thanks?

@mataney mataney changed the title Summarization Abstractive Summarization Results Oct 15, 2017
@srush
Copy link
Contributor

srush commented Nov 5, 2017

Hey, so we are getting close to these results, but still a little bit below.

Summarization Experiment Description

This document describes how to replicate summarization experiments on the CNNDM and gigaword datasets using OpenNMT-py.
In the following, we assume access to a tokenized form of the corpus split into train/valid/test set.

An example article-title pair from Gigaword should look like this:

Input
australia 's current account deficit shrunk by a record #.## billion dollars -lrb- #.## billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .

Output
australian current account deficit narrows sharply

Preprocessing the data

Since we are using copy-attention [1] in the model, we need to preprocess the dataset such that source and target are aligned and use the same dictionary. This is achieved by using the options dynamic_dict and share_vocab.
We additionally turn off truncation of the source to ensure that inputs longer than 50 words are not truncated.
For CNNDM we follow See et al. [2] and additionally truncate the source length at 400 tokens and the target at 100.

command used:

(1) CNNDM

python preprocess.py -train_src data/cnndm/train.txt.src -train_tgt data/cnn-no-sent-tag/train.txt.tgt -valid_src data/cnndm/val.txt.src -valid_tgt data/cnn-no-sent-tag/val.txt.tgt -save_data data/cnn-no-sent-tag/cnndm -src_seq_length 10000 -tgt_seq_length 10000 -src_seq_length_trunc 400 -tgt_seq_length_trunc 100 -dynamic_dict -share_vocab

(2) Gigaword

python preprocess.py -train_src data/giga/train.article.txt -train_tgt data/giga/train.title.txt -valid_src data/giga/valid.article.txt -valid_tgt data/giga/valid.title.txt -save_data data/giga/giga -src_seq_length 10000 -dynamic_dict -share_vocab

Training

The training procedure described in this section for the most part follows parameter choices and implementation similar to that of See et al. [2]. As mentioned above, we use copy attention as a mechanism for the model to decide whether to either generate a new word or to copy from the source (copy_attn).
A notable difference to See's model is that we are using the attention mechanism introduced by Bahdanau et al. [3] (global_attention mlp) instead of that by Luong et al. [4] (global_attention dot). Both options typically perform very similar to each other with Luong attention often having a slight advantage.
We are using using a 128-dimensional word-embedding, and 512-dimensional 1 layer LSTM. On the encoder side, we use a bidirectional LSTM (brnn), which means that the 512 dimensions are split into 256 dimensions per direction.
We also share the word embeddings between encoder and decoder (share_embeddings). This option drastically reduces the number of parameters the model has to learn. However, we found only minimal impact on performance of a model without this option.

For the training procedure, we are using SGD with an initial learning rate of 1 for a total of 16 epochs. In most cases, the lowest validation perplexity is achieved around epoch 10-12. We also use OpenNMT's default learning rate decay, which halves the learning rate after every epoch once the validation perplexity increased after an epoch (or after epoch 8).
Alternative training procedures such as adam with initial learning rate 0.001 converge faster than sgd, but achieve slightly worse. We additionally set the maximum norm of the gradient to 2, and renormalize if the gradient norm exceeds this value.

commands used:

(1) CNNDM

python train.py -save_model logs/notag_sgd3 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 256 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

(2) Gigaword

python train.py -save_model logs/giga_sgd3_512 -data data/giga/giga -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -brnn -epochs 16 -seed 777 -batch_size 32 -max_grad_norm 2 -share_embeddings -gpuid 0 -start_checkpoint_at 9

Inference

During inference, we use beam-search with a beam-size of 10.
We additionally use the replace_unk option which replaces generated <UNK> tokens with the source token of highest attention. This acts as safety-net should the copy attention fail which should learn to copy such words.

commands used:

(1) CNNDM

python translate.py -gpu 2 -batch_size 1 -model logs/notag_try3_acc_49.29_ppl_14.62_e16.pt -src data/cnndm/test.txt.src -output sgd3_out.txt -beam_size 10 -replace_unk

(2) Gigaword

python translate.py -gpu 2 -batch_size 1 -model logs/giga_sgd3_512_acc_51.10_ppl_12.04_e16.pt -src data/giga/test.article.txt -output giga_sgd3.out.txt -beam_size 10 -replace_unk

Evaluation

CNNDM

To evaluate the ROUGE scores on CNNDM, we extended the pyrouge wrapper with additional evaluations such as the amount of repeated n-grams (typically found in models with copy attention), found here.

It can be run with the following command:

python baseline.py -s sgd3_out.txt -t ~/datasets/cnn-dailymail/sent-tagged/test.txt.tgt -m no_sent_tag -r

Note that the no_sent_tag option strips tags around sentences - when a sentence previously was <s> w w w w . </s>, it becomes w w w w ..

Gigaword

For evaluation of large test sets such as Gigaword, we use the a parallel python wrapper around ROUGE, found here.

command used:
files2rouge giga_sgd3.out.txt test.title.txt --verbose

Running the commands above should yield the following scores:

ROUGE-1 (F): 0.352127
ROUGE-2 (F): 0.173109
ROUGE-3 (F): 0.098244
ROUGE-L (F): 0.327742
ROUGE-S4 (F): 0.155524

References

[1] Vinyals, O., Fortunato, M. and Jaitly, N., 2015. Pointer Network. NIPS

[2] See, A., Liu, P.J. and Manning, C.D., 2017. Get To The Point: Summarization with Pointer-Generator Networks. ACL

[3] Bahdanau, D., Cho, K. and Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. ICLR

[4] Luong, M.T., Pham, H. and Manning, C.D., 2015. Effective approaches to attention-based neural machine translation. EMNLP

@mataney
Copy link
Contributor Author

mataney commented Nov 8, 2017

This is massive!
Absolutely massive! Thank you very much.

By the way, I found using See's tokenized dataset (can be downloaded here) to work better.

What data do you pass to preprocess.py

@srush
Copy link
Contributor

srush commented Nov 9, 2017

Cool. Can you let us know what results you got? When you say "better", do you mean compared to what?

@mataney
Copy link
Contributor Author

mataney commented Nov 10, 2017

Hey, you wrote:

python baseline.py ...

Can't seem to find this file. Can you link me to the project?

I meant "better" by comparing the accuracy results of the original dataset to See's preprocessed runs.

@sebastianGehrmann
Copy link
Contributor

sebastianGehrmann commented Nov 10, 2017

The script we've been using is this one: https://github.com/falcondai/pyrouge/
This is a slightly modified version of the script described here: http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85

Thanks for the note about See's dataset. I will try and compare models with the different datasets

@mataney
Copy link
Contributor Author

mataney commented Nov 12, 2017

Still not sure where this baseline.py file is.
I can run the script as in https://github.com/falcondai/pyrouge/
But I believe using the baseline.py with its no_sent_tag will be smarter.

@pltrdy
Copy link
Contributor

pltrdy commented Nov 13, 2017

Interesting discussion.

@srush your examples shows the -brnn flag that it now deprecated. You may want to replace it with -encoder_type brnn.

@sebastianGehrmann
Copy link
Contributor

sebastianGehrmann commented Nov 13, 2017

@mataney I linked the wrong repo - https://github.com/falcondai/rouge-baselines is what we use (that in turn uses pyrouge)
One question, how do you use the preprocessed data you linked above? From my understanding, the download link has the individual documents instead of one large file. Do you just concatenate them? If so, do you have a script that I can use to reproduce your findings?

@pltrdy You're absolutely right, I copied the commands from a time before the brnn switch. We should definitely change that.

@mataney
Copy link
Contributor Author

mataney commented Nov 15, 2017

@sebastianGehrmann
Cool, will run rouge-baselines on my model soon.

And in order to get just the big files I ran some of See's code (because I wanted to get another thing that is not just the article and the abstract)
So the following code is just the gist of Sees preprocessing.

https://gist.github.com/mataney/67cfb05b0b84e88da3e0fe04fb80cfc8

So you can do something like this, or you can just concatenate them (the latter will be shorter)

@sebastianGehrmann
Copy link
Contributor

Thanks, I'll check it out. To make sure we use the same exact files, could you upload yours and send me a download link via email? That'd be great! (gehrmann (at) seas.harvard.edu)

@srush
Copy link
Contributor

srush commented Nov 15, 2017

Huh, this is the code I ran to make the dataset, it was forked from hers. https://github.com/OpenNMT/cnn-dailymail

I wonder if she changed anything...

@srush
Copy link
Contributor

srush commented Nov 15, 2017

Oh I see, this is after the files are created. Huh, so the only thing I see that could be different is that she drops blank lines and does some unicode encoding. @mataney Could you run "sdiff " and confirm that ? I don't see anything else in this gist, but I could be missing something.

@mataney
Copy link
Contributor Author

mataney commented Nov 15, 2017

@srush This files should be the same (sdiff shouldn't work as I have more data about each article than just article and abstract (I deleted this from my gist))

I can conclude with false alarm as I didn't know you are using See's preprocessing, but you do :)
So our tokenization etc are the same.

@mataney
Copy link
Contributor Author

mataney commented Nov 15, 2017

Another question, after training and translating I only get 1 sentence summaries. This seem strange.
@srush are the translations your passed to baseline.py are 1 sentence summaries as well?

@srush
Copy link
Contributor

srush commented Nov 15, 2017

Oh, shoot. I forgot to mention this. See uses </s> as her sentence end token, which is unfortunately what we use in translate as well :( For our experiments we replaced hers with </t>. You can either do that, or change the end condition in translate to 2 repeated </s> token.

@pltrdy
Copy link
Contributor

pltrdy commented Nov 22, 2017

Why not just replace </s> with  . ?

BTW, it seems that there is no -m no_sent_tag option in falcondai repo. I guess you are using a modified version?!

@mataney
Copy link
Contributor Author

mataney commented Jan 1, 2018

Hey guys,
Any feature ideas\fixes you think about that will get us closer to See's results (seq2seq+attn+pointer then coverage).

@srush
Copy link
Contributor

srush commented Jan 1, 2018

I think we are basically there. What scores are you getting?

@srush srush reopened this Jan 1, 2018
@srush
Copy link
Contributor

srush commented Jan 1, 2018

@sebastianGehrmann (when he gets back from vacation)

@pltrdy
Copy link
Contributor

pltrdy commented Jan 2, 2018

Using the HP you, @srush , mentioned above, I get the following ROUGE scores on CNN/D% (after 16 epochs):

ROUGE-1 (F): 0.323996
ROUGE-2 (F): 0.140015
ROUGE-L (F): 0.244148

ROUGE-3 (F): 0.081449
ROUGE-S4 (F): 0.105728

@mataney
Copy link
Contributor Author

mataney commented Jan 3, 2018

Getting about the same, although I'm getting better results when embedding and hidden sizes are 500.
This is still rather different than what See reports - ROUGE1, 2, L - 39.53, 17.28, 36.38 respectively.

(Obviously this is said without taking anything from the brilliant work that has been done here! 😄 )

@srush
Copy link
Contributor

srush commented Jan 3, 2018

Okay, let me post our model, we're doing a lot better. Think we need to update the docs.

(Although, worrisome that you are getting different results with the same args. I will check into that. )

@srush
Copy link
Contributor

srush commented Jan 3, 2018

Okay, here are his args:

python train.py -save_model /scratch/cnndm/ada4 -data data/cnn-no-sent-tag/CNNDM -copy_attn -global_attention mlp -word_vec_size 128 -rnn_size 512 -layers 1 -encoder_type brnn -epochs 16 -seed 777 -batch_size 16 -max_grad_norm 2 -share_embeddings -dropout 0. -gpuid 3 -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1

(See's RNN is split 512/256 which we don't support at the moment.)

And then during translation use Wu style coverage with -alpha 0.9 -beta 0.25

We're seeing train ppl of 12.84, val ppl of 11.98 and ROUGE-1/2 of 0.38 | 0.168

@mataney
Copy link
Contributor Author

mataney commented Jan 7, 2018

Hey :)
Tried to run this and it appears to be stuck around 4% accuracy.
Just pulled from master, didn't change a thing.

So only thing that might be different is the data the is being passed to preprocess.py.
Something special about it?

M.

@ratishsp
Copy link
Contributor

Thanks @LeenaShekhar for the detail. I guess I agree with your point regarding batch_size not making any difference during inference. But I was curious to know why we did not use a batch size of 16 during inference as it could have made the inference faster.

@ratishsp
Copy link
Contributor

Hi,
Another difference from See et al implementation is that See et al uses concept of reusing the attention distribution to serve as the copy distribution. Though I believe that the accuracy difference because of the same should not be too significant.

@ratishsp
Copy link
Contributor

Hi,
I have one query regarding the training command. I see that option 'copy_loss_by_seqlength' was introduced in first week of Feb. Prior to that, the loss in CopyGenerator module was by default normalized by length. So, to reproduce the above reported accuracies, should the training command have the option 'copy_loss_by_seqlength' set?

@sebastianGehrmann
Copy link
Contributor

I updated the document describing the summarization experiments here.

To directly answer some questions by @ratishsp from above:

  1. Inference batch size 1 above was because larger batch sizes were not implemented when I wrote the original document. They work now, so feel free to use batch_size 20 or more.
  2. You can reuse the attention for copying with the option reuse_copy_attn
  3. Please use the new loss when using BRNN models.

@LeenaShekhar
Copy link

Thank you so much for updating the document.

@ratishsp
Copy link
Contributor

Thanks @sebastianGehrmann for updating the document.
One further query: You mentioned above that copy_loss_by_seqlength is to be used with BRNN models. You mean that it is recommended only for BRNN models and not for say rnn or other encoder types?
In addition, I see that you haven't used any dropout. Were the results adverse when dropout was used?

@LeenaShekhar
Copy link

LeenaShekhar commented Mar 30, 2018

@ratishsp To answer your second question: Dropout of 0.3 is default.

group.add_argument('-dropout', type=float, default=0.3,
help="Dropout probability; applied in LSTM stacks.")

@jingxil
Copy link
Contributor

jingxil commented Apr 5, 2018

Hi, everyone. Very nice discussion! I ran the baseline model nocopy_acc_51.33_ppl_12.74_e20 on Gigaword test set with report_rouge param and got ROUGE(1/2/3/L/SU4): 33.49/16.30/9.04/31.32/18.19 which is a bit lower than the claimed R1: 33.60 R2: 16.29 RL: 31.45. Do you guys have any idea why this difference happened?

@boya-song
Copy link

Nice work! Thank you all! After reading the thread, I still have one question though. Is the coverage layer introduced by See is not suggested during training? @sebastianGehrmann

@Maggione
Copy link
Contributor

Maggione commented May 3, 2018

Is there something wrong with the script "python baseline.py -m no_sent_tag ..." ? I try it but got low score. I run the "python baseline.py -m sent_tag_verbatim ...", and the result seems more normal.

@sebastianGehrmann
Copy link
Contributor

Hi @Maggione thanks for the question. The baseline.py script supports multiple formats for your src and tgt data. In the one I describe in the tutorial, we have and tags as sentence boundaries for the gold, but remove them from the prediction. Depending on your format, you might have to use a different one. I'll put it on my list to better format the different modes.

@vince62s
Copy link
Member

vince62s commented Sep 3, 2018

closing this thread, it's fully documented in the FAQ now.
I reproduced very close results on CNNDM on 2 GPU with the transformer.

testset @200k steps averaged last 10 checkpoints

Running ROUGE...

1 ROUGE-1 Average_R: 0.37251 (95%-conf.int. 0.37023 - 0.37478)
1 ROUGE-1 Average_P: 0.42895 (95%-conf.int. 0.42621 - 0.43191)
1 ROUGE-1 Average_F: 0.38517 (95%-conf.int. 0.38297 - 0.38722)

1 ROUGE-2 Average_R: 0.16027 (95%-conf.int. 0.15808 - 0.16247)
1 ROUGE-2 Average_P: 0.18765 (95%-conf.int. 0.18514 - 0.19027)
1 ROUGE-2 Average_F: 0.16689 (95%-conf.int. 0.16484 - 0.16908)

1 ROUGE-L Average_R: 0.34430 (95%-conf.int. 0.34202 - 0.34653)
1 ROUGE-L Average_P: 0.39706 (95%-conf.int. 0.39436 - 0.39988)
1 ROUGE-L Average_F: 0.35630 (95%-conf.int. 0.35418 - 0.35835)

@vince62s vince62s closed this as completed Sep 3, 2018
@evasharma
Copy link

@vince62s I am also trying to run transformer on CNNDM (on 2 GPUs). Could you share the set of train parameters you used? Are they same as the parameters reported for transformer here : http://opennmt.net/OpenNMT-py/Summarization.html ?

@vince62s
Copy link
Member

yes same as there.

@evasharma
Copy link

Thanks I just noticed that the results reported on CNN on this thread and on the summarization.md were different and so I asked. And also you have used -copy_attn which is different from the transformer paper setting. Was that to improve the score?

@jsbaan
Copy link

jsbaan commented Jan 30, 2019

Great work on all these summarization implementations, thanks a bunch! The results presented in the paper Bottom-up Abstractive Summarization are based on this implementation, is that correct?

When I follow the summarization example, given the hyperparameters used in this example, I would expect my results to be equivalent to the "Pointer-Generator + Coverage Penalty (our implementation)" entry in table 1. However, I obtain a drop of ~2.5 ROUGE points as shown in the evaluation output below. Am I missing something or did current implementation diverge from the one used in the paper?

1 ROUGE-1 Average_R: 0.37577 (95%-conf.int. 0.37294 - 0.37851)
1 ROUGE-1 Average_P: 0.39376 (95%-conf.int. 0.39136 - 0.39638)
1 ROUGE-1 Average_F: 0.36855 (95%-conf.int. 0.36631 - 0.37067)

1 ROUGE-2 Average_R: 0.16530 (95%-conf.int. 0.16269 - 0.16789)
1 ROUGE-2 Average_P: 0.16870 (95%-conf.int. 0.16632 - 0.17102)
1 ROUGE-2 Average_F: 0.15969 (95%-conf.int. 0.15743 - 0.16189)

1 ROUGE-L Average_R: 0.34268 (95%-conf.int. 0.33975 - 0.34540)
1 ROUGE-L Average_P: 0.35900 (95%-conf.int. 0.35660 - 0.36144)
1 ROUGE-L Average_F: 0.33603 (95%-conf.int. 0.33376 - 0.33817)

@vince62s
Copy link
Member

which one did you run ? rnn or transformer ?

@jsbaan
Copy link

jsbaan commented Jan 30, 2019

I ran the rnn on cnndm and evaluated using files2rouge with the predictions and targets stripped from tags.

@sebastianGehrmann
Copy link
Contributor

@AIJoris Yes, the results in the paper are all from OpenNMT-py and the summarization example provides the exact commands I ran.
As a first step, could you try running the inference on the pretrained model you can download? We need to make sure to have the same translate.py and ROUGE eval setup first.

@jsbaan
Copy link

jsbaan commented Jan 31, 2019

@sebastianGehrmann Thanks a lot for your quick response. I ran the inference overnight with model ada6_bridge_oldcopy_tagged_acc_54.17_ppl_11.17_e20.pt and the results are as follows:

1 ROUGE-1 Average_R: 0.37917 (95%-conf.int. 0.37638 - 0.38201)
1 ROUGE-1 Average_P: 0.40190 (95%-conf.int. 0.39943 - 0.40441)
1 ROUGE-1 Average_F: 0.37572 (95%-conf.int. 0.37351 - 0.37803)

1 ROUGE-2 Average_R: 0.16934 (95%-conf.int. 0.16684 - 0.17189)
1 ROUGE-2 Average_P: 0.17506 (95%-conf.int. 0.17269 - 0.17743)
1 ROUGE-2 Average_F: 0.16544 (95%-conf.int. 0.16335 - 0.16777)

1 ROUGE-L Average_R: 0.34806 (95%-conf.int. 0.34531 - 0.35070)
1 ROUGE-L Average_P: 0.36861 (95%-conf.int. 0.36622 - 0.37117)
1 ROUGE-L Average_F: 0.34474 (95%-conf.int. 0.34269 - 0.34697)

I used the following command for inference:

python` OpenNMT-py/translate.py -gpu 0
-batch_size 20
-beam_size 10
-model models/ada6_bridge_oldcopy_tagged_acc_54.17_ppl_11.17_e20.pt
-src data/cnndm/txt/test.txt.src
-output testout/cnndm.out
-min_length 35
-stepwise_penalty
-coverage_penalty summary
-beta 5
-length_penalty wu
-alpha 0.9
-block_ngram_repeat 3
-ignore_when_blocking "." "" ""

@jsbaan
Copy link

jsbaan commented Feb 6, 2019

@sebastianGehrmann After some more testing I noticed that when loading the pretrained model, the input documents are not truncated to 400 tokens. This is the only difference I have been able to find between the pre-trained model and my own.

Apart from that, it looks like the inference and/or test procedure is different, as the reported results above are different from the paper.

@priyanks179
Copy link

sir i am trying yo implement pointer generator network from scratch , but after 10k iterations it starts overfitting, even though i am not using coverage mechanism at all. I have tried dropout as well as batchnorm but they were not able to prevent overfitting. I am just making the max_enc_size to 200 and max_dec_size to 35 . Can you suggest what i should do.
adagrad 90k

@jsbaan
Copy link

jsbaan commented Apr 3, 2019

I still haven't managed to obtain the reported results. I have tested both the pre-trained Transformer model as well as a Transformer model trained from scratch using the parameters from the documentation. I am using the following parameters for inference:

 python OpenNMT-py/translate.py -gpu 0\
                     -batch_size 5 \
                     -beam_size 10 \
                     -model models/transformer_pretrained.pt \
                     -src data/cnndm/txt/test.txt.src \
                     -output testout/cnndm.out \
                     -min_length 35 \
                     -stepwise_penalty \
                     -coverage_penalty summary \
                     -beta 5 \
                     -length_penalty wu \
                     -alpha 0.9 \
                     -block_ngram_repeat 3 \
                     -ignore_when_blocking "." "</t>" "<t>"

When evaluating with rouge-baselines using python baseline.py -s testout/cnndm.out -t data/cnndm/test.txt.tgt.tagged -m sent_tag_verbatim -r, I get the following results:

1 ROUGE-1 Average_R: 0.36856 (95%-conf.int. 0.36612 - 0.37105)
1 ROUGE-1 Average_P: 0.42605 (95%-conf.int. 0.42308 - 0.42912)
1 ROUGE-1 Average_F: 0.38206 (95%-conf.int. 0.37991 - 0.38434)

1 ROUGE-2 Average_R: 0.16113 (95%-conf.int. 0.15894 - 0.16335)
1 ROUGE-2 Average_P: 0.18784 (95%-conf.int. 0.18524 - 0.19053)
1 ROUGE-2 Average_F: 0.16757 (95%-conf.int. 0.16535 - 0.16985)

1 ROUGE-L Average_R: 0.34212 (95%-conf.int. 0.33971 - 0.34460)
1 ROUGE-L Average_P: 0.39589 (95%-conf.int. 0.39309 - 0.39898)
1 ROUGE-L Average_F: 0.35485 (95%-conf.int. 0.35269 - 0.35719)

* method sent_tag_verbatim
['rouge_1_recall', 'rouge_1_precision', 'rouge_1_f_score', 'rouge_2_recall', 'rouge_2_precision', 'rouge_2_f_score', 'rouge_l_recall', 'rouge_l_precision', 'rouge_l_f_score']
36.86	42.60	38.21	16.11	18.78	16.76	34.21	39.59	35.48	
* evaluated 11490 samples, took 259.233s, averaging 0.023s/sample
* portion of samples that contains self-repetitions
full-sent,32-gram,16-gram,8-gram,4-gram,2-gram
0.00%,	0.00%,	0.00%,	0.00%,	0.00%,	34.21%,	
* evaluated 11490 samples, took 1.243s, averaging 0.000s/sample

Just to be clear, the above results are from the pre-trained model located here. Do you have any idea what can cause this difference in performance and how to improve it?

@lauhaide
Copy link

lauhaide commented Oct 8, 2019

Hi all, Thanks for the code and all replies to this discussion!
I trained from scratch CopyTransformer on CNN/DM (train and inference commands as in documentation, only sharded different size) and get the same scores as ALJoris in post above (with rouge-baselines.py). I cannot get the reported values: 39.25/17.54/36.45 . Any ideas?
Results are close to those reported above by @vince62s on Sep 3rd.

I noticed that the command line for training has no '-seed' setting, this means defaults value is used. However, I observed different results when launching without setting myself the seed in the train command line.
Thanks!

@pltrdy
Copy link
Contributor

pltrdy commented Oct 9, 2019

@lauhaide could you report how long you trained the model, and the final scores?

In fact, I don't reproduce it either, not by training from scratch nor by just running inference (I get exact same results as @AIJoris ).

@sebastianGehrmann could you help us on this? I even tried to run inference on old commit (since the repo has been moving) using commit from 2018-04-27 bde7f83 I get slightly different results but still not the same as yours (old F1: 0.38689 0.17099 0.35935)

@lauhaide
Copy link

Thanks for your prompt reply @pltrdy !
The scores are @ 42.5k steps (save_checkpoint_steps =2500): 38.95 | 17.04 | 35.89
If at inference with same pre-trained model you get lower results, something in the code or some decoding parameter might be different (e.g. penalties, beam size???)

@pltrdy
Copy link
Contributor

pltrdy commented Oct 11, 2019

@lauhaide could you provide the checkpoint file?

@lauhaide
Copy link

Yes, can be downloaded from here:
https://drive.google.com/open?id=1eOlsCGJVdCgm6t6gY7ZF3T53LLvBJXY1
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests