-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153
Comments
So I have run some extra experiments to locate the potential issue, the caveat is that the following experiments are not conducted on CNN/DM, but on another smaller dataset.
Now my main focus is on the fact that fairseq's BART allows 2048 tokens for the input, but I believe the huggingface version of BART used by AllenNLP only allows 1024, this could make a big difference. |
Have you discovered anything else with your extra experiments? Assuming that the validation loss converges fine and doesn't immediately spike due to overfitting, warm-up shouldn't make a huge difference. The difference in number of allowed tokens could have an effect. |
@ArjunSubramonian I haven't got more updates than that on the experiments. But I've found that people usually extend the positional embedding of BART from 1024 to 2048 to adapt to longer input (in which case I think they did for CNN/DM). I haven't found the corresponding code in fairseq, but I've found this code in the longformer repo. Hope we can do the same for the BART in AllenNLP. Also, just out of curiosity, when do we typically need warmup steps? |
@niansong1996 would you be interested in implementing the sparse attention mechanism used by Longformer? I think it would be a great addition to our library. There are a variety of reasons that people use LR warmup. You can read some of the responses here. People tend to use warmup when working with models that have attention mechanisms (to prevent your attention weights from overfitting or collapsing to just a few relations early in the training process) and when using adaptive optimizers (to improve gradient statistics). |
I am not particularly familiar with Longformer and the sparse attention mechanism it uses. However, I can try to implement the positional embedding extension as an option for the BART model in AllenNLP, if such a thing hasn't been done yet. Can you or someone else confirm that this is needed? Thanks a lot for the pointer and explanation for LR warmup! |
@dirkgr thoughts on implementing the positional embedding extension as an option for the BART model in AllenNLP? |
The extended positional embeddings would be great. I assume it would be relatively self-contained, just resizing the embedding matrix during model initialization? |
@dirkgr Basically, yes. But it's a little bit more tricky than that, some say that if we want to extend 1024 to 2048, we need to copy the first 1024 embedding and initialize the rest, other people added that copying it two times works better. Also, BART has position 0,1 reserved so we need to account for that as well. I will experiment a bit with different implementations and see what works. |
@dirkgr So I managed to extend the positional embedding for BART by looking at how Longformer did it. Now I am trying to mess with the So I am simply changing those params after the tokenizer is actually initialized, but I don't know if this is right since I still get errors like |
Is this just about the warning? I think the tokenizers will tokenize just fine, but they want to print that warning. If you know of a way to silence the warning, I'd love to put it in. |
This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇 |
@niansong1996, any updates? |
@dirkgr Sorry for the delay. It seems to me that the model is now accepting 2048 tokens after the positional embedding extension, despite the warnings. However, I still can not match AllenNLP's BART to the fairseq version (on my task, ~2 points ROUGE-2 and ROUGE-L difference, which is quite large). I am a little bit tied up with some paper deadlines, but I will make a PR about the positional embedding early next week since it should be useful for other purposes as well. |
@niansong1996 Check out the changes I made in #5207 and #5208. I was able to match the BART paper's scores using these hyperparameters for min/max sequence length and the length penalties. I actually didn't need to implement the trigram blocking 🤷♂️ |
Also to reproduce the ROUGE-L result, you actually need to sentence split the summaries before running ROUGE. |
@danieldeutsch This is so cool! Thanks a lot! About ROUGE-L, currently, the AllenNLP version directly takes the token ids in its tensor form (see here), are you using other packages for ROUGE computation or you have a workaround? |
The AllenNLP ROUGE is fine for training, but you should really use the original package for calculating the final scores on the test set. I think the AllenNLP version operates on the token ids, which could be problematic if your model uses subtokens for its vocabulary. ROUGE is also typically calculated using a Porter Stemmer on the tokens, which also isn't done here. Shameless plug: I use a library that I developed to calculate ROUGE (https://github.com/danieldeutsch/sacrerouge). It has a wrapper around the original perl code. See here |
Is this something we could adopt as a metric in AllenNLP? The metric API would require that we pass in two strings (or two lists of strings), one with the actual results, and one with the target results. Inside the metric we'd have to tokenize, porter-stem, and calculate ROUGE. "Wrapper around perl" sounds a bit scary to me though. That's going to generate a lot of questions on Stack Overflow and GitHub from people whose perl installation is broken. Is there a faithful reimplementation in Python that we could use? |
This one is popular https://github.com/google-research/google-research/tree/master/rouge. It's included in this pip package https://pypi.org/project/rouge-score/ and used by Huggingface https://github.com/huggingface/datasets/blob/67574a8d74796bc065a8b9b49ec02f7b1200c172/metrics/rouge/rouge.py |
Yes, I can confirm that the AllenNLP version of ROUGE somewhat differs from the original perl version, I observe ~1 point difference in ROUGE-1 and ROUGE-L on my task. But I wouldn't recommend using a version with wrapper around perl, at least not for real-time evaluation. And I agree with @danieldeutsch on the Should we open another issue to talk about ROUGE score though? On this issue, I am going to test on #5207 and #5208 this weekend and see if they fixed the performance mismatch problem. |
There is no point in worrying about differences in the implementation of ROUGE while one method uses stemmed tokens, and another uses word pieces. Let's get a version running that also uses stemmed tokens, and then see if the numbers are still different. |
Oh, I see that https://pypi.org/project/rouge-score/ does all of it, tokenization, stemming, the lot. Then it should be easy to use it in an AllenNLP metric. |
@danieldeutsch I have just tried to use the latest AllenNLP version with #5207 and #5208 merged. I am using
Does this look similar to the setting you were running? Also, you mentioned that sentence split is needed to run ROUGE-L, but the Currently, I am simply measuring the ROUGE scores by comparing I also noticed that there is a |
Yes, you need to use the length penalty of 2.0 via the LengthNormalizedSequenceLogProbabilityScorer. I think that should make the difference. I have never used the Google ROUGE package myself, but the note in the Readme explains how to reproduce the Perl ROUGE-L. |
@niansong1996, your implementation of |
Yes, this is using the https://pypi.org/project/rouge-score/. So what I was doing in my code is to have a |
Checklist
main
branch of AllenNLP.pip freeze
.Description
I was using
allennlp-models==2.3.0
and training with the scriptallennlp-models/training_scripts/generation/bart_cnn_dm.jsonnet
. And I've got the following performance output:However, according to the implementation from fairseq (and what's reported in the paper), the Rouge-1/2/L score should be 44.16/21.28/40.90, so that the Rouge-L score is 9.7 points below the reference value, while Rouge-1/2 scores have some improvements.
I am wondering if this is expected and why the Rouge-L score is significantly worse. Is this an issue with how BART models are implemented in
allennlp-models
or how Rouge-L score is computed inallennlp.training.metrics
.Related issues or possible duplicates
Environment
OS: Linux
Python version: 3.8.8
Output of
pip freeze
:(I've created an environment with only
allennlp==2.3.0
and no other pkgs installed)Steps to reproduce
python==3.8
andallennlp==2.3.0
allennlp-models
repoallennlp train training_scripts/generaion/bart_cnn_dm.jsonnet -s tmp-save-dir --include-package allennlp_models
Example source:
The text was updated successfully, but these errors were encountered: