Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

Open
10 tasks done
niansong1996 opened this issue Apr 24, 2021 · 26 comments
Open
10 tasks done

Comments

@niansong1996
Copy link

niansong1996 commented Apr 24, 2021

Checklist

  • I have verified that the issue exists against the main branch of AllenNLP.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I was using allennlp-models==2.3.0 and training with the script allennlp-models/training_scripts/generation/bart_cnn_dm.jsonnet. And I've got the following performance output:

{
  "best_epoch": 0,
  "peak_worker_0_memory_MB": 77679.125,
  "peak_gpu_0_memory_MB": 19497.5625,
  "training_duration": "1 day, 9:30:58.908029",
  "training_start_epoch": 0,
  "training_epochs": 2,
  "epoch": 2,
  "training_loss": 2.7070548462225283,
  "training_worker_0_memory_MB": 77679.125,
  "training_gpu_0_memory_MB": 19497.5625,
  "validation_ROUGE-1_R": 0.4871779537805129,
  "validation_ROUGE-2_R": 0.26309701739882685,
  "validation_ROUGE-1_P": 0.3966578995429105,
  "validation_ROUGE-2_P": 0.21290430088784706,
  "validation_ROUGE-1_F1": 0.4283963905120849,
  "validation_ROUGE-2_F1": 0.23045514136364303,
  "validation_ROUGE-L": 0.3206116030616199,
  "validation_BLEU": 0.18484394329002954,
  "validation_loss": 0.0,
  "best_validation_ROUGE-1_R": 0.47620558012437575,
  "best_validation_ROUGE-2_R": 0.25229075181929206,
  "best_validation_ROUGE-1_P": 0.38737318484205874,
  "best_validation_ROUGE-2_P": 0.20447094269175353,
  "best_validation_ROUGE-1_F1": 0.41917399613391276,
  "best_validation_ROUGE-2_F1": 0.22154245158723443,
  "best_validation_ROUGE-L": 0.31225680111602044,
  "best_validation_BLEU": 0.17805890029860716,
  "best_validation_loss": 0.0
}

However, according to the implementation from fairseq (and what's reported in the paper), the Rouge-1/2/L score should be 44.16/21.28/40.90, so that the Rouge-L score is 9.7 points below the reference value, while Rouge-1/2 scores have some improvements.

I am wondering if this is expected and why the Rouge-L score is significantly worse. Is this an issue with how BART models are implemented in allennlp-models or how Rouge-L score is computed in allennlp.training.metrics.

Related issues or possible duplicates

  • None

Environment

OS: Linux

Python version: 3.8.8

Output of pip freeze:

(I've created an environment with only allennlp==2.3.0 and no other pkgs installed)

allennlp==2.3.0
attrs==20.3.0
blis==0.7.4
boto3==1.17.53
botocore==1.20.53
catalogue==2.0.3
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
configparser==5.0.2
conllu==4.4
cymem==2.0.5
docker-pycreds==0.4.0
filelock==3.0.12
ftfy==6.0.1
gitdb==4.0.7
GitPython==3.1.14
h5py==3.2.1
idna==2.10
iniconfig==1.1.1
Jinja2==2.11.3
jmespath==0.10.0
joblib==1.0.1
jsonnet==0.17.0
lmdb==1.2.0
MarkupSafe==1.1.1
more-itertools==8.7.0
murmurhash==1.0.5
nltk==3.6.1
numpy==1.20.2
overrides==3.1.0
packaging==20.9
pathtools==0.1.2
pathy==0.4.0
Pillow==8.2.0
pluggy==0.13.1
preshed==3.0.5
promise==2.3
protobuf==3.15.8
psutil==5.8.0
py==1.10.0
pydantic==1.7.3
pyparsing==2.4.7
py-rouge==1.1
pytest==6.2.3
python-dateutil==2.8.1
PyYAML==5.4.1
regex==2021.4.4
requests==2.25.1
s3transfer==0.3.7
sacremoses==0.0.45
scikit-learn==0.24.1
scipy==1.6.2
sentencepiece==0.1.95
sentry-sdk==1.0.0
shortuuid==1.0.1
six==1.15.0
smart-open==3.0.0
smmap==4.0.0
spacy==3.0.5
spacy-legacy==3.0.2
srsly==2.4.1
subprocess32==3.5.4
tensorboardX==2.2
thinc==8.0.2
threadpoolctl==2.1.0
tokenizers==0.10.2
toml==0.10.2
torch==1.8.1+cu111
torchvision==0.9.1
tqdm==4.60.0
transformers==4.5.1
typer==0.3.2
typing-extensions==3.7.4.3
urllib3==1.26.4
wandb==0.10.26
wasabi==0.8.2
wcwidth==0.2.5
word2number==1.1

Steps to reproduce

  1. Create an environment with python==3.8 and allennlp==2.3.0
  2. Clone the allennlp-models repo
  3. Run allennlp train training_scripts/generaion/bart_cnn_dm.jsonnet -s tmp-save-dir --include-package allennlp_models
Example source:

@niansong1996 niansong1996 added bug Models Issues related to the allennlp-models repo labels Apr 24, 2021
@niansong1996 niansong1996 changed the title BART model performance not matching the fairseq implementation BART model performance not matching the paper or fairseq implementation Apr 24, 2021
@niansong1996
Copy link
Author

So I have run some extra experiments to locate the potential issue, the caveat is that the following experiments are not conducted on CNN/DM, but on another smaller dataset.

  1. The rouge evaluation is less likely to be the issue because I've tried to evaluate the same output with AllenNLP's rouge and other rouge packages, including the one with the original Perl implementation and the difference is minimal;
  2. I've noticed a few differences in the hyperparameters, including grad_norm seems to be set to 0.1 in fairseq, and it is 1.0 in AllenNLP's training config. But I didn't find the performance to change much even if I change that to 0.1
  3. There is a warm-up phase in the fairseq implementation, but I haven't found the clear definition of how this is done;

Now my main focus is on the fact that fairseq's BART allows 2048 tokens for the input, but I believe the huggingface version of BART used by AllenNLP only allows 1024, this could make a big difference.

@ArjunSubramonian
Copy link
Contributor

Have you discovered anything else with your extra experiments? Assuming that the validation loss converges fine and doesn't immediately spike due to overfitting, warm-up shouldn't make a huge difference. The difference in number of allowed tokens could have an effect.

@niansong1996
Copy link
Author

@ArjunSubramonian I haven't got more updates than that on the experiments. But I've found that people usually extend the positional embedding of BART from 1024 to 2048 to adapt to longer input (in which case I think they did for CNN/DM). I haven't found the corresponding code in fairseq, but I've found this code in the longformer repo. Hope we can do the same for the BART in AllenNLP.

Also, just out of curiosity, when do we typically need warmup steps?

@ArjunSubramonian
Copy link
Contributor

@niansong1996 would you be interested in implementing the sparse attention mechanism used by Longformer? I think it would be a great addition to our library.

There are a variety of reasons that people use LR warmup. You can read some of the responses here. People tend to use warmup when working with models that have attention mechanisms (to prevent your attention weights from overfitting or collapsing to just a few relations early in the training process) and when using adaptive optimizers (to improve gradient statistics).

@niansong1996
Copy link
Author

I am not particularly familiar with Longformer and the sparse attention mechanism it uses. However, I can try to implement the positional embedding extension as an option for the BART model in AllenNLP, if such a thing hasn't been done yet. Can you or someone else confirm that this is needed?

Thanks a lot for the pointer and explanation for LR warmup!

@ArjunSubramonian
Copy link
Contributor

@dirkgr thoughts on implementing the positional embedding extension as an option for the BART model in AllenNLP?

@dirkgr
Copy link
Member

dirkgr commented Apr 27, 2021

The extended positional embeddings would be great. I assume it would be relatively self-contained, just resizing the embedding matrix during model initialization?

@niansong1996
Copy link
Author

@dirkgr Basically, yes. But it's a little bit more tricky than that, some say that if we want to extend 1024 to 2048, we need to copy the first 1024 embedding and initialize the rest, other people added that copying it two times works better. Also, BART has position 0,1 reserved so we need to account for that as well.

I will experiment a bit with different implementations and see what works.

@niansong1996
Copy link
Author

@dirkgr So I managed to extend the positional embedding for BART by looking at how Longformer did it. Now I am trying to mess with the Tokenzier and I can't seem to find a way to configure the tokenizer in the .jsonnet file for the max_len and model_max_len to be more than 1024, I think it will always be reset to whatever here configured it to be.

So I am simply changing those params after the tokenizer is actually initialized, but I don't know if this is right since I still get errors like Token indices sequence length is longer than the specified maximum sequence length for this model (1780 > 1024). Running this sequence through the model will result in indexing errors. Or maybe all of these don't matter since Tokenizer doesn't have an internal cutoff mechanism?

@dirkgr
Copy link
Member

dirkgr commented May 3, 2021

Is this just about the warning? I think the tokenizers will tokenize just fine, but they want to print that warning. If you know of a way to silence the warning, I'd love to put it in.

@github-actions
Copy link

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@dirkgr
Copy link
Member

dirkgr commented May 12, 2021

@niansong1996, any updates?

@dirkgr dirkgr reopened this May 12, 2021
@niansong1996
Copy link
Author

@dirkgr Sorry for the delay. It seems to me that the model is now accepting 2048 tokens after the positional embedding extension, despite the warnings. However, I still can not match AllenNLP's BART to the fairseq version (on my task, ~2 points ROUGE-2 and ROUGE-L difference, which is quite large).

I am a little bit tied up with some paper deadlines, but I will make a PR about the positional embedding early next week since it should be useful for other purposes as well.

@danieldeutsch
Copy link
Contributor

@niansong1996 Check out the changes I made in #5207 and #5208. I was able to match the BART paper's scores using these hyperparameters for min/max sequence length and the length penalties. I actually didn't need to implement the trigram blocking 🤷‍♂️

@danieldeutsch
Copy link
Contributor

Also to reproduce the ROUGE-L result, you actually need to sentence split the summaries before running ROUGE.

@niansong1996
Copy link
Author

@danieldeutsch This is so cool! Thanks a lot!

About ROUGE-L, currently, the AllenNLP version directly takes the token ids in its tensor form (see here), are you using other packages for ROUGE computation or you have a workaround?

@danieldeutsch
Copy link
Contributor

The AllenNLP ROUGE is fine for training, but you should really use the original package for calculating the final scores on the test set. I think the AllenNLP version operates on the token ids, which could be problematic if your model uses subtokens for its vocabulary. ROUGE is also typically calculated using a Porter Stemmer on the tokens, which also isn't done here.

Shameless plug: I use a library that I developed to calculate ROUGE (https://github.com/danieldeutsch/sacrerouge). It has a wrapper around the original perl code. See here

@dirkgr
Copy link
Member

dirkgr commented May 26, 2021

Is this something we could adopt as a metric in AllenNLP? The metric API would require that we pass in two strings (or two lists of strings), one with the actual results, and one with the target results. Inside the metric we'd have to tokenize, porter-stem, and calculate ROUGE.

"Wrapper around perl" sounds a bit scary to me though. That's going to generate a lot of questions on Stack Overflow and GitHub from people whose perl installation is broken. Is there a faithful reimplementation in Python that we could use?

@niansong1996
Copy link
Author

niansong1996 commented May 28, 2021

Yes, I can confirm that the AllenNLP version of ROUGE somewhat differs from the original perl version, I observe ~1 point difference in ROUGE-1 and ROUGE-L on my task. But I wouldn't recommend using a version with wrapper around perl, at least not for real-time evaluation. And I agree with @danieldeutsch on the rouge-score package, which yields ~0.1 difference from the perl version for my experiments.

Should we open another issue to talk about ROUGE score though?

On this issue, I am going to test on #5207 and #5208 this weekend and see if they fixed the performance mismatch problem.

@dirkgr
Copy link
Member

dirkgr commented May 28, 2021

There is no point in worrying about differences in the implementation of ROUGE while one method uses stemmed tokens, and another uses word pieces. Let's get a version running that also uses stemmed tokens, and then see if the numbers are still different.

@dirkgr
Copy link
Member

dirkgr commented May 28, 2021

Oh, I see that https://pypi.org/project/rouge-score/ does all of it, tokenization, stemming, the lot. Then it should be easy to use it in an AllenNLP metric.

@dirkgr dirkgr removed Models Issues related to the allennlp-models repo bug labels May 28, 2021
@dirkgr dirkgr changed the title BART model performance not matching the paper or fairseq implementation Implement a ROUGE metric that faithfully reproduces the official metric written in perl. May 28, 2021
@niansong1996
Copy link
Author

@danieldeutsch I have just tried to use the latest AllenNLP version with #5207 and #5208 merged. I am using rouge-score as the package for computing ROUGE scores, but I still can't get the numbers reported by fairseq. My current ROUGE-1/2/L scores are 43.2/20.6/30.4. My configuration is as follows:

 1 local model_name = "facebook/bart-large";
  2 local data_base_url = "https://storage.googleapis.com/allennlp-public-data/cnndm-combined-data-2020.07.13.tar.gz";
  3 local train_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_train.txt";
  4 local dev_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_val.txt";
  5
  6 {
  7     "train_data_path": train_data,
  8     "validation_data_path": dev_data,
  9     "dataset_reader": {
 10         "type": "cnn_dm",
 11         "source_tokenizer": {
 12             "type": "pretrained_transformer",
 13             "model_name": model_name
 14         },
 15         "source_token_indexers": {
 16             "tokens": {
 17                 "type": "pretrained_transformer",
 18                 "model_name": model_name,
 19                 "namespace": "tokens"
 20             }
 21         },
 22         "source_max_tokens": 1022,
 23         "target_max_tokens": 140,
 24         // "max_instances": 1000 // DEBUG setting
 25     },
 26     "model": {
 27         "type": "bart_new_rouge",
 28         "model_name": model_name,
 29         "max_decoding_steps": 140,
 30         "min_decoding_steps": 55,
 31         "beam_size": 4,
 32     },
 33     "data_loader": {
 34         "batch_size": 2,
 35         "shuffle": true
 36     },
 37     "trainer": {
 38         "num_epochs": 3,
 39         "optimizer": {
 40             "type": "huggingface_adamw",
 41             "lr": 3e-5,
 42             "betas": [0.9, 0.999],
 43             "eps": 1e-8,
 44             "correct_bias": true
 45         },
 46         "learning_rate_scheduler": {
 47             "type": "polynomial_decay",
 48         },
 49         "grad_norm": 1.0,
 50         "cuda_device": 0,
 51         "run_confidence_checks": false,
 52         "validation_metric": "+NEW_ROUGE-L"
 53     }
 54 }

Does this look similar to the setting you were running? Also, you mentioned that sentence split is needed to run ROUGE-L, but the rouge-score package only provided the API as scorer.score(tgt: str, src: str), so I am wondering what's the correct way of running it?

Currently, I am simply measuring the ROUGE scores by comparing outputs["predicted_text"] and self._indexer._tokenizer.batch_decode(target_ids, skip_special_tokens=True) at the end of the forward function of the BART model.

I also noticed that there is a lenpen in the fairseq version here, did you also implemented this?

@danieldeutsch
Copy link
Contributor

Yes, you need to use the length penalty of 2.0 via the LengthNormalizedSequenceLogProbabilityScorer. I think that should make the difference.

I have never used the Google ROUGE package myself, but the note in the Readme explains how to reproduce the Perl ROUGE-L.

@dirkgr
Copy link
Member

dirkgr commented Jun 7, 2021

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

@niansong1996
Copy link
Author

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

Yes, this is using the https://pypi.org/project/rouge-score/. So what I was doing in my code is to have a Average metric as a wrapper and call the rouge-score package in the forward function. I think it should be super easy to write an AllenNLP metric out of it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants