Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

niansong1996 · 2021-04-24T21:44:58Z

Checklist

Description

I was using allennlp-models==2.3.0 and training with the script allennlp-models/training_scripts/generation/bart_cnn_dm.jsonnet. And I've got the following performance output:

{
  "best_epoch": 0,
  "peak_worker_0_memory_MB": 77679.125,
  "peak_gpu_0_memory_MB": 19497.5625,
  "training_duration": "1 day, 9:30:58.908029",
  "training_start_epoch": 0,
  "training_epochs": 2,
  "epoch": 2,
  "training_loss": 2.7070548462225283,
  "training_worker_0_memory_MB": 77679.125,
  "training_gpu_0_memory_MB": 19497.5625,
  "validation_ROUGE-1_R": 0.4871779537805129,
  "validation_ROUGE-2_R": 0.26309701739882685,
  "validation_ROUGE-1_P": 0.3966578995429105,
  "validation_ROUGE-2_P": 0.21290430088784706,
  "validation_ROUGE-1_F1": 0.4283963905120849,
  "validation_ROUGE-2_F1": 0.23045514136364303,
  "validation_ROUGE-L": 0.3206116030616199,
  "validation_BLEU": 0.18484394329002954,
  "validation_loss": 0.0,
  "best_validation_ROUGE-1_R": 0.47620558012437575,
  "best_validation_ROUGE-2_R": 0.25229075181929206,
  "best_validation_ROUGE-1_P": 0.38737318484205874,
  "best_validation_ROUGE-2_P": 0.20447094269175353,
  "best_validation_ROUGE-1_F1": 0.41917399613391276,
  "best_validation_ROUGE-2_F1": 0.22154245158723443,
  "best_validation_ROUGE-L": 0.31225680111602044,
  "best_validation_BLEU": 0.17805890029860716,
  "best_validation_loss": 0.0
}

However, according to the implementation from fairseq (and what's reported in the paper), the Rouge-1/2/L score should be 44.16/21.28/40.90, so that the Rouge-L score is 9.7 points below the reference value, while Rouge-1/2 scores have some improvements.

I am wondering if this is expected and why the Rouge-L score is significantly worse. Is this an issue with how BART models are implemented in allennlp-models or how Rouge-L score is computed in allennlp.training.metrics.

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.8.8

Output of pip freeze:

(I've created an environment with only allennlp==2.3.0 and no other pkgs installed)

allennlp==2.3.0
attrs==20.3.0
blis==0.7.4
boto3==1.17.53
botocore==1.20.53
catalogue==2.0.3
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
configparser==5.0.2
conllu==4.4
cymem==2.0.5
docker-pycreds==0.4.0
filelock==3.0.12
ftfy==6.0.1
gitdb==4.0.7
GitPython==3.1.14
h5py==3.2.1
idna==2.10
iniconfig==1.1.1
Jinja2==2.11.3
jmespath==0.10.0
joblib==1.0.1
jsonnet==0.17.0
lmdb==1.2.0
MarkupSafe==1.1.1
more-itertools==8.7.0
murmurhash==1.0.5
nltk==3.6.1
numpy==1.20.2
overrides==3.1.0
packaging==20.9
pathtools==0.1.2
pathy==0.4.0
Pillow==8.2.0
pluggy==0.13.1
preshed==3.0.5
promise==2.3
protobuf==3.15.8
psutil==5.8.0
py==1.10.0
pydantic==1.7.3
pyparsing==2.4.7
py-rouge==1.1
pytest==6.2.3
python-dateutil==2.8.1
PyYAML==5.4.1
regex==2021.4.4
requests==2.25.1
s3transfer==0.3.7
sacremoses==0.0.45
scikit-learn==0.24.1
scipy==1.6.2
sentencepiece==0.1.95
sentry-sdk==1.0.0
shortuuid==1.0.1
six==1.15.0
smart-open==3.0.0
smmap==4.0.0
spacy==3.0.5
spacy-legacy==3.0.2
srsly==2.4.1
subprocess32==3.5.4
tensorboardX==2.2
thinc==8.0.2
threadpoolctl==2.1.0
tokenizers==0.10.2
toml==0.10.2
torch==1.8.1+cu111
torchvision==0.9.1
tqdm==4.60.0
transformers==4.5.1
typer==0.3.2
typing-extensions==3.7.4.3
urllib3==1.26.4
wandb==0.10.26
wasabi==0.8.2
wcwidth==0.2.5
word2number==1.1

Steps to reproduce

Create an environment with python==3.8 and allennlp==2.3.0
Clone the allennlp-models repo
Run allennlp train training_scripts/generaion/bart_cnn_dm.jsonnet -s tmp-save-dir --include-package allennlp_models

Example source:

The text was updated successfully, but these errors were encountered:

niansong1996 · 2021-04-25T08:31:58Z

So I have run some extra experiments to locate the potential issue, the caveat is that the following experiments are not conducted on CNN/DM, but on another smaller dataset.

The rouge evaluation is less likely to be the issue because I've tried to evaluate the same output with AllenNLP's rouge and other rouge packages, including the one with the original Perl implementation and the difference is minimal;
I've noticed a few differences in the hyperparameters, including grad_norm seems to be set to 0.1 in fairseq, and it is 1.0 in AllenNLP's training config. But I didn't find the performance to change much even if I change that to 0.1
There is a warm-up phase in the fairseq implementation, but I haven't found the clear definition of how this is done;

Now my main focus is on the fact that fairseq's BART allows 2048 tokens for the input, but I believe the huggingface version of BART used by AllenNLP only allows 1024, this could make a big difference.

ArjunSubramonian · 2021-04-27T02:33:14Z

Have you discovered anything else with your extra experiments? Assuming that the validation loss converges fine and doesn't immediately spike due to overfitting, warm-up shouldn't make a huge difference. The difference in number of allowed tokens could have an effect.

niansong1996 · 2021-04-27T04:29:10Z

@ArjunSubramonian I haven't got more updates than that on the experiments. But I've found that people usually extend the positional embedding of BART from 1024 to 2048 to adapt to longer input (in which case I think they did for CNN/DM). I haven't found the corresponding code in fairseq, but I've found this code in the longformer repo. Hope we can do the same for the BART in AllenNLP.

Also, just out of curiosity, when do we typically need warmup steps?

ArjunSubramonian · 2021-04-27T17:58:33Z

@niansong1996 would you be interested in implementing the sparse attention mechanism used by Longformer? I think it would be a great addition to our library.

There are a variety of reasons that people use LR warmup. You can read some of the responses here. People tend to use warmup when working with models that have attention mechanisms (to prevent your attention weights from overfitting or collapsing to just a few relations early in the training process) and when using adaptive optimizers (to improve gradient statistics).

niansong1996 · 2021-04-27T18:11:40Z

I am not particularly familiar with Longformer and the sparse attention mechanism it uses. However, I can try to implement the positional embedding extension as an option for the BART model in AllenNLP, if such a thing hasn't been done yet. Can you or someone else confirm that this is needed?

Thanks a lot for the pointer and explanation for LR warmup!

ArjunSubramonian · 2021-04-27T18:57:31Z

@dirkgr thoughts on implementing the positional embedding extension as an option for the BART model in AllenNLP?

dirkgr · 2021-04-27T20:34:39Z

The extended positional embeddings would be great. I assume it would be relatively self-contained, just resizing the embedding matrix during model initialization?

niansong1996 · 2021-04-28T02:38:44Z

@dirkgr Basically, yes. But it's a little bit more tricky than that, some say that if we want to extend 1024 to 2048, we need to copy the first 1024 embedding and initialize the rest, other people added that copying it two times works better. Also, BART has position 0,1 reserved so we need to account for that as well.

I will experiment a bit with different implementations and see what works.

niansong1996 · 2021-04-30T03:54:46Z

@dirkgr So I managed to extend the positional embedding for BART by looking at how Longformer did it. Now I am trying to mess with the Tokenzier and I can't seem to find a way to configure the tokenizer in the .jsonnet file for the max_len and model_max_len to be more than 1024, I think it will always be reset to whatever here configured it to be.

So I am simply changing those params after the tokenizer is actually initialized, but I don't know if this is right since I still get errors like Token indices sequence length is longer than the specified maximum sequence length for this model (1780 > 1024). Running this sequence through the model will result in indexing errors. Or maybe all of these don't matter since Tokenizer doesn't have an internal cutoff mechanism?

dirkgr · 2021-05-03T18:21:07Z

Is this just about the warning? I think the tokenizers will tokenize just fine, but they want to print that warning. If you know of a way to silence the warning, I'd love to put it in.

github-actions · 2021-05-12T16:10:36Z

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

dirkgr · 2021-05-12T16:36:40Z

@niansong1996, any updates?

niansong1996 · 2021-05-12T17:01:29Z

@dirkgr Sorry for the delay. It seems to me that the model is now accepting 2048 tokens after the positional embedding extension, despite the warnings. However, I still can not match AllenNLP's BART to the fairseq version (on my task, ~2 points ROUGE-2 and ROUGE-L difference, which is quite large).

I am a little bit tied up with some paper deadlines, but I will make a PR about the positional embedding early next week since it should be useful for other purposes as well.

danieldeutsch · 2021-05-20T21:00:45Z

@niansong1996 Check out the changes I made in #5207 and #5208. I was able to match the BART paper's scores using these hyperparameters for min/max sequence length and the length penalties. I actually didn't need to implement the trigram blocking 🤷‍♂️

danieldeutsch · 2021-05-20T21:01:36Z

Also to reproduce the ROUGE-L result, you actually need to sentence split the summaries before running ROUGE.

niansong1996 · 2021-05-20T22:01:50Z

@danieldeutsch This is so cool! Thanks a lot!

About ROUGE-L, currently, the AllenNLP version directly takes the token ids in its tensor form (see here), are you using other packages for ROUGE computation or you have a workaround?

danieldeutsch · 2021-05-21T14:09:35Z

The AllenNLP ROUGE is fine for training, but you should really use the original package for calculating the final scores on the test set. I think the AllenNLP version operates on the token ids, which could be problematic if your model uses subtokens for its vocabulary. ROUGE is also typically calculated using a Porter Stemmer on the tokens, which also isn't done here.

Shameless plug: I use a library that I developed to calculate ROUGE (https://github.com/danieldeutsch/sacrerouge). It has a wrapper around the original perl code. See here

dirkgr · 2021-05-26T18:59:27Z

Is this something we could adopt as a metric in AllenNLP? The metric API would require that we pass in two strings (or two lists of strings), one with the actual results, and one with the target results. Inside the metric we'd have to tokenize, porter-stem, and calculate ROUGE.

"Wrapper around perl" sounds a bit scary to me though. That's going to generate a lot of questions on Stack Overflow and GitHub from people whose perl installation is broken. Is there a faithful reimplementation in Python that we could use?

danieldeutsch · 2021-05-26T19:14:32Z

This one is popular https://github.com/google-research/google-research/tree/master/rouge. It's included in this pip package https://pypi.org/project/rouge-score/ and used by Huggingface https://github.com/huggingface/datasets/blob/67574a8d74796bc065a8b9b49ec02f7b1200c172/metrics/rouge/rouge.py

niansong1996 · 2021-05-28T04:13:54Z

Yes, I can confirm that the AllenNLP version of ROUGE somewhat differs from the original perl version, I observe ~1 point difference in ROUGE-1 and ROUGE-L on my task. But I wouldn't recommend using a version with wrapper around perl, at least not for real-time evaluation. And I agree with @danieldeutsch on the rouge-score package, which yields ~0.1 difference from the perl version for my experiments.

Should we open another issue to talk about ROUGE score though?

On this issue, I am going to test on #5207 and #5208 this weekend and see if they fixed the performance mismatch problem.

dirkgr · 2021-05-28T05:40:52Z

There is no point in worrying about differences in the implementation of ROUGE while one method uses stemmed tokens, and another uses word pieces. Let's get a version running that also uses stemmed tokens, and then see if the numbers are still different.

dirkgr · 2021-05-28T21:39:09Z

Oh, I see that https://pypi.org/project/rouge-score/ does all of it, tokenization, stemming, the lot. Then it should be easy to use it in an AllenNLP metric.

niansong1996 · 2021-06-01T00:26:55Z

@danieldeutsch I have just tried to use the latest AllenNLP version with #5207 and #5208 merged. I am using rouge-score as the package for computing ROUGE scores, but I still can't get the numbers reported by fairseq. My current ROUGE-1/2/L scores are 43.2/20.6/30.4. My configuration is as follows:

 1 local model_name = "facebook/bart-large";
  2 local data_base_url = "https://storage.googleapis.com/allennlp-public-data/cnndm-combined-data-2020.07.13.tar.gz";
  3 local train_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_train.txt";
  4 local dev_data = data_base_url + "!cnndm-combined-data-2020.07.13/url_lists/all_val.txt";
  5
  6 {
  7     "train_data_path": train_data,
  8     "validation_data_path": dev_data,
  9     "dataset_reader": {
 10         "type": "cnn_dm",
 11         "source_tokenizer": {
 12             "type": "pretrained_transformer",
 13             "model_name": model_name
 14         },
 15         "source_token_indexers": {
 16             "tokens": {
 17                 "type": "pretrained_transformer",
 18                 "model_name": model_name,
 19                 "namespace": "tokens"
 20             }
 21         },
 22         "source_max_tokens": 1022,
 23         "target_max_tokens": 140,
 24         // "max_instances": 1000 // DEBUG setting
 25     },
 26     "model": {
 27         "type": "bart_new_rouge",
 28         "model_name": model_name,
 29         "max_decoding_steps": 140,
 30         "min_decoding_steps": 55,
 31         "beam_size": 4,
 32     },
 33     "data_loader": {
 34         "batch_size": 2,
 35         "shuffle": true
 36     },
 37     "trainer": {
 38         "num_epochs": 3,
 39         "optimizer": {
 40             "type": "huggingface_adamw",
 41             "lr": 3e-5,
 42             "betas": [0.9, 0.999],
 43             "eps": 1e-8,
 44             "correct_bias": true
 45         },
 46         "learning_rate_scheduler": {
 47             "type": "polynomial_decay",
 48         },
 49         "grad_norm": 1.0,
 50         "cuda_device": 0,
 51         "run_confidence_checks": false,
 52         "validation_metric": "+NEW_ROUGE-L"
 53     }
 54 }

Does this look similar to the setting you were running? Also, you mentioned that sentence split is needed to run ROUGE-L, but the rouge-score package only provided the API as scorer.score(tgt: str, src: str), so I am wondering what's the correct way of running it?

Currently, I am simply measuring the ROUGE scores by comparing outputs["predicted_text"] and self._indexer._tokenizer.batch_decode(target_ids, skip_special_tokens=True) at the end of the forward function of the BART model.

I also noticed that there is a lenpen in the fairseq version here, did you also implemented this?

danieldeutsch · 2021-06-01T19:07:09Z

Yes, you need to use the length penalty of 2.0 via the LengthNormalizedSequenceLogProbabilityScorer. I think that should make the difference.

I have never used the Google ROUGE package myself, but the note in the Readme explains how to reproduce the Perl ROUGE-L.

dirkgr · 2021-06-07T17:49:08Z

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

niansong1996 · 2021-06-09T07:42:53Z

@niansong1996, your implementation of NEW_ROUGE, it uses the Google package? Can you point me to that code? I want to see how hard it would be to make it an official AllenNLP metric.

Yes, this is using the https://pypi.org/project/rouge-score/. So what I was doing in my code is to have a Average metric as a wrapper and call the rouge-score package in the forward function. I think it should be super easy to write an AllenNLP metric out of it.

niansong1996 added bug Models Issues related to the allennlp-models repo labels Apr 24, 2021

niansong1996 changed the title ~~BART model performance not matching the fairseq implementation~~ BART model performance not matching the paper or fairseq implementation Apr 24, 2021

github-actions bot added the stale label May 12, 2021

github-actions bot closed this as completed May 12, 2021

dirkgr reopened this May 12, 2021

dirkgr removed Models Issues related to the allennlp-models repo bug labels May 28, 2021

dirkgr added Contributions welcome Feature request and removed stale labels May 28, 2021

dirkgr changed the title ~~BART model performance not matching the paper or fairseq implementation~~ Implement a ROUGE metric that faithfully reproduces the official metric written in perl. May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

niansong1996 commented Apr 24, 2021 •

edited

Loading

niansong1996 commented Apr 25, 2021

ArjunSubramonian commented Apr 27, 2021

niansong1996 commented Apr 27, 2021

ArjunSubramonian commented Apr 27, 2021

niansong1996 commented Apr 27, 2021

ArjunSubramonian commented Apr 27, 2021

dirkgr commented Apr 27, 2021

niansong1996 commented Apr 28, 2021

niansong1996 commented Apr 30, 2021

dirkgr commented May 3, 2021

github-actions bot commented May 12, 2021

dirkgr commented May 12, 2021

niansong1996 commented May 12, 2021

danieldeutsch commented May 20, 2021

danieldeutsch commented May 20, 2021

niansong1996 commented May 20, 2021

danieldeutsch commented May 21, 2021

dirkgr commented May 26, 2021

danieldeutsch commented May 26, 2021

niansong1996 commented May 28, 2021 •

edited

Loading

dirkgr commented May 28, 2021

dirkgr commented May 28, 2021

niansong1996 commented Jun 1, 2021

danieldeutsch commented Jun 1, 2021

dirkgr commented Jun 7, 2021

niansong1996 commented Jun 9, 2021

Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

Implement a ROUGE metric that faithfully reproduces the official metric written in perl. #5153

Comments

niansong1996 commented Apr 24, 2021 • edited Loading

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

niansong1996 commented Apr 25, 2021

ArjunSubramonian commented Apr 27, 2021

niansong1996 commented Apr 27, 2021

ArjunSubramonian commented Apr 27, 2021

niansong1996 commented Apr 27, 2021

ArjunSubramonian commented Apr 27, 2021

dirkgr commented Apr 27, 2021

niansong1996 commented Apr 28, 2021

niansong1996 commented Apr 30, 2021

dirkgr commented May 3, 2021

github-actions bot commented May 12, 2021

dirkgr commented May 12, 2021

niansong1996 commented May 12, 2021

danieldeutsch commented May 20, 2021

danieldeutsch commented May 20, 2021

niansong1996 commented May 20, 2021

danieldeutsch commented May 21, 2021

dirkgr commented May 26, 2021

danieldeutsch commented May 26, 2021

niansong1996 commented May 28, 2021 • edited Loading

dirkgr commented May 28, 2021

dirkgr commented May 28, 2021

niansong1996 commented Jun 1, 2021

danieldeutsch commented Jun 1, 2021

dirkgr commented Jun 7, 2021

niansong1996 commented Jun 9, 2021

niansong1996 commented Apr 24, 2021 •

edited

Loading

niansong1996 commented May 28, 2021 •

edited

Loading