[many quesitons] allennlp train config.json > python train.py & others #130

NicolasAG · 2020-07-07T20:57:50Z

Hi,

(1) Update guide to support newest version

I'm going through the "Next Steps" chapter > section "Switching to pre-trained contextualizers".
First, the config file showed in the guide uses :

"iterator": {
        "type": "basic",
        "batch_size": 8
    }

instead of

"data_loader": {
        "batch_size": 8,
        "shuffle": true
    },

but if I try to use this config file, I get an error saying that the key "data_loader" is required in the config file.

(2) bert-bas-uncased not as good as guide baseline model?

Secondly, when I replace the tokenizer, token_indexers, embedder and encoder by the bert model as in the new config file proposed in https://guide.allennlp.org/next-steps#1, it looks like the model is not training. Training accuracy remains at 0.50 after 5 epochs.
This is my config file:

local bert_model = "bert-base-uncased";
{
    "dataset_reader" : {
        "type": "classification-tsv",
        "tokenizer": {
            "type": "pretrained_transformer",
            "model_name": bert_model,
        },
        "token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": bert_model,
            }
        },
        "max_tokens": 512
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": bert_model,
                    "train_parameters": true,
                }
            }
        },
        "encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": false,
        }
    },
    "data_loader": {
        "batch_size": 8,
        "shuffle": true
    },
    "trainer": {
        "optimizer": {
            "type": "huggingface_adamw",
            "lr": 1.0e-5
        },
        "num_epochs": 5,
        "cuda_device": 0,
    }
}

and this is the last few lines printed at the end of allennlp train ...:

2020-07-07 20:37:52,415 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 0,
  "peak_worker_0_memory_MB": 3499.544,
  "peak_gpu_0_memory_MB": 9283,
  "training_duration": "0:05:03.516101",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 0.508125,  // <----------- 😞 
  "training_loss": 0.693260959982872,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 3499.544,
  "training_gpu_0_memory_MB": 9283,
  "validation_accuracy": 0.5,
  "validation_loss": 0.6932091021537781,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.61,  // <----------- 👎 
  "best_validation_loss": 0.6674252915382385,
  "best_validation_reg_loss": 0.0
}

I tried to play a little with the config and I noticed that if I replace

"encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": false,
        }

by

"encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": true,
        }

I get better performance:

2020-07-07 20:55:19,186 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 4,
  "peak_worker_0_memory_MB": 3643.86,
  "peak_gpu_0_memory_MB": 9287,
  "training_duration": "0:05:23.475209",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 0.910625,  // <---------------- 👍 
  "training_loss": 0.2304620276018977,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 3643.86,
  "training_gpu_0_memory_MB": 9287,
  "validation_accuracy": 0.805,  // <---------------- 👍 
  "validation_loss": 0.4564117732644081,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.805,
  "best_validation_loss": 0.4564117732644081,
  "best_validation_reg_loss": 0.0
}

but it is still not as good as with the original config:

{
    "dataset_reader" : {
        "type": "classification-tsv",
        "token_indexers": {
            "tokens": {
                "type": "single_id"
            }
        }
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "embedding",
                    "embedding_dim": 10
                }
            }
        },
        "encoder": {
            "type": "bag_of_embeddings",
            "embedding_dim": 10
        }
    },
    "data_loader": {
        "batch_size": 8,
        "shuffle": true
    },
    "trainer": {
        "optimizer": "adam",
        "num_epochs": 5
    }
}

which gets 100% training accuracy and 80% valid accuracy:

2020-07-07 20:04:36,909 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 3,
  "peak_worker_0_memory_MB": 575.508,
  "peak_gpu_0_memory_MB": 12,
  "training_duration": "0:00:22.934259",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 1.0,  // <----------------- ✔️ 
  "training_loss": 0.002026954392204061,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 575.508,
  "training_gpu_0_memory_MB": 12,
  "validation_accuracy": 0.82,
  "validation_loss": 0.39791114151477813,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.825,  // <----------------- 👍 
  "best_validation_loss": 0.3861449569836259,
  "best_validation_reg_loss": 0.0
}

What could cause this behavior? maybe the gigantic amount of parameters of the bert model? or maybe bert vocab doesn't cover most of the tokens in the training file..? 🤔

(3) allennlp train config > python train.py

I also noticed that running allennlp train config.json yields good performances (~90% train accuracy and ~80% validation accuracy) while running my own training file with python train.py doesn't seem to learn (training accuracy stays at 50% after 5 epochs). I specifically made sure that my config and custom script are as similar as possible:
config:

local bert_model = "bert-base-uncased";
{
    "dataset_reader" : {
        "type": "classification-tsv",
        "tokenizer": {
            "type": "pretrained_transformer",
            "model_name": bert_model,
        },
        "token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": bert_model,
                "namespace": "tags",
            }
        },
        "max_tokens": 512
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": bert_model,
                    "train_parameters": true,
                }
            }
        },
        "encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": true,
        }
    },
    "data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size": 8,
            "sorting_keys": ['text']
        }
    },
    "validation_data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size": 8,
            "sorting_keys": ['text']
        }
    },
    "trainer": {
        "type": "gradient_descent",
        "serialization_dir": "/allennlp/models/tmp",
        "num_epochs": 5,
        "optimizer": {
            "type": "huggingface_adamw",
            "lr": 1.0e-5,
        },
        "cuda_device": 0,
    }
}

-vs- train.py:

def build_data_reader(bert_model: str = None):
    tokenizer = PretrainedTransformerTokenizer(model_name=bert_model)
    token_indexers = {"bert": PretrainedTransformerIndexer(model_name=bert_model, namespace='tags')}
    max_tokens = 512
    return ClassificationTsvReader(tokenizer, token_indexers, max_tokens)

def build_model(vocab: Vocabulary, bert_model: str = None) -> Model:
    embedder = BasicTextFieldEmbedder({"bert": PretrainedTransformerEmbedder(model_name=bert_model, train_parameters=True)})
    encoder = BertPooler(pretrained_model=bert_model, requires_grad=True)
    return SimpleClassifier(vocab, embedder, encoder)

def build_trainer(
            model: Model, ser_dir: str, train_loader: DataLoader, valid_loader: DataLoader,
            hugging_optim: bool, cuda_device: int) -> Trainer:
    params = [ [n, p] for n, p in model.named_parameters() if p.requires_grad ]
    logging.info(f"{len(params)} parameters requiring grad updates")
    if hugging_optim:
        optim = HuggingfaceAdamWOptimizer(params, lr=1.0e-5)
    else:
        optim = AdamOptimizer(params)
    return GradientDescentTrainer(
        model=model,
        serialization_dir=ser_dir,
        data_loader=train_loader,
        validation_data_loader=valid_loader,
        num_epochs=5,
        optimizer=optim,
        cuda_device=cuda_device
    )

def run_training_loop(bert_model=None):
    logging.info("Building data reader...")
    dataset_reader = build_data_reader(bert_model)

    logging.info("Reading data...")
    train_instances = dataset_reader.read("/allennlp/data/train.tsv")
    logging.info(f"got {len(train_instances)} train instances")
    valid_instances = dataset_reader.read("/allennlp/data/dev.tsv")
    logging.info(f"got {len(valid_instances)} valid instances")

    logging.info("Building vocabulary...")
    vocab = Vocabulary.from_instances(train_instances + valid_instances, min_count={'text': 1})
    logging.info(vocab)

    logging.info("Building model...")
    model = build_model(vocab, bert_model)

    if torch.cuda.is_available():
        cuda_device = 0
        model = model.cuda(cuda_device)
    else:
        cuda_device = -1

    logging.info(model)

    logging.info("Building data loaders...")
    train_instances.index_with(vocab)
    valid_instances.index_with(vocab)
    train_batch_sampler = BucketBatchSampler(train_instances, batch_size=8, sorting_keys=['text'])
    valid_batch_sampler = BucketBatchSampler(valid_instances, batch_size=8, sorting_keys=['text'])
    train_loader = DataLoader(train_instances, batch_sampler=train_batch_sampler)
    valid_loader = DataLoader(valid_instances, batch_sampler=valid_batch_sampler)

    logging.info("Building trainer...")
    trainer = build_trainer(
        model, "/allennlp/models/tmp", train_loader, valid_loader,
        hugging_optim=bert_model is not None, cuda_device=cuda_device)
    logging.info("Start training...")
    trainer.train()
    logging.info("done.")

if __name__ == '__main__':
    run_training_loop(bert_model="bert-base-uncased")

Any idea why running the config file doesn't yield similar performances than running this custom script..?

Thanks a lot for your help :)

The text was updated successfully, but these errors were encountered:

NicolasAG · 2020-07-09T21:01:00Z

I would be happy to move these questions somewhere else if there exists a dedicated forum for that..?

matt-gardner · 2020-07-09T21:27:27Z

This venue is fine, I've just been distracted with ACL ongoing right now. I'll respond soon.

matt-gardner · 2020-07-09T22:37:33Z

On (1): thanks for the catch! That should be fixed now.

On (2): yeah, I'm not sure why the pooler was set to not be trainable, but I've now fixed that in the example, also. Thanks again for the catch. On why it doesn't do as well: I'm not certain; it could be a learning rate issue, or just a stability issue - BERT is known to have high variance between runs, and this is a small dataset. Masato originally wrote this section for an older version of allennlp, and I apparently didn't update it correctly when I updated it for the 1.0 release. It's possible that some of the learning rate / optimizer stuff also should have changed slightly to be optimal. But the point of the guide is to show you how to use the code, not provide optimal hyperparameters, so as long as it runs and gives reasonably close performance, I'm not too concerned.

On (3): it doesn't look like you're shuffling the data. Do you agree? If you're not shuffling the data, that would definitely explain the difference.

matt-gardner · 2020-07-09T22:40:27Z

If you want more standard hyperparameters, you might look at some of the examples in our model library, e.g.: https://github.com/allenai/allennlp-models/blob/09395d233161859db4c11af3689a3e0bc62169d8/training_config/rc/transformer_qa.jsonnet#L28-L40.

NicolasAG · 2020-07-10T16:53:26Z

Re (2): Of course, makes sense 👍

Re (3): Right, but when I add the keyword shuffle=True to my DataLoader I get:
ValueError: batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last.
Note that I didn't mention shuffle in the config either so the behaviors should have been the same...
I tried again this morning and it looks like it's training correctly now... 🤔 I guess bert is really unstable on small data and basic optimizers...
Maybe it has to do with random seeds as well! I'm using a GPU and I didn't set any of numpy, torch, or cuda seeds... Fixing those would probably reduce the variance between two runs.
Maybe a warning in the guide section Switching to pre-trained contextualizers should be mentioned because from one run to the next the model can go from 50% training accuracy to 90%... and novel users like me may think it's an issue with their code somewhere.

(4) bert vocab -vs- Vocabulary.from_instances

While I'm at it, I'm taking this opportunity to ask another question I had :)
I noticed that when I'm using the PretrainedTransformerTokenizer and the PretrainedTransformerIndexer in my data reader the vocab I then create with vocab = Vocabulary.from_instances(train_instances + valid_instances) has only the labels namespace (of size 2) but doesn't have the namespace specified in the PretrainedTransformerIndexer ("tags" by default).
(4.1) I'm not sure to understand why it is not present in the vocab? Is it that the loaded pretrained model has it's own vocab loaded somewhere else? If so, how does the model (or I) has (have) access to it? I logged the vocab passed in the model constructor and it also only has the 'labels' namespace.
(4.2) I didn't understand in the documentation why adding the padding or UNK tokens would break:

We use a somewhat confusing default value of tags so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.

(https://docs.allennlp.org/master/api/data/token_indexers/pretrained_transformer_indexer/#pretrainedtransformerindexer-objects)

Thanks a lot!

matt-gardner · 2020-07-10T20:03:16Z

3: looks like the bucket sampler already shuffles, so, yeah, that wasn't the issue. But yes, lots of papers have pointed out how high BERT's training variance is.

4.1: This is a bit confusing, and we'd like to fix it. Currently, the vocab gets added when you index instances. I think we should probably also add that line to when we count vocab items, which would resolve this issue (PR to fix that welcome!).

4.2: see this method.

NicolasAG · 2020-07-10T21:18:25Z

Wow! indeed I can see that you and others have been thinking about a better way to handle HF's transformers vocab for a while now... 😄
I'm still very new to the library so I'll continue reading the guide (thank you for that btw, it is REALLY helpful 🙏 ) and making experiments on my side to familiarize myself before making any PRs.
Closing this for now since I don't have more questions at this time.

NicolasAG changed the title ~~pre-trained contextualizers not performing as good as BoW~~ pre-trained contextualizers not performing as good as BoW & allen train > python train.py Jul 9, 2020

NicolasAG changed the title ~~pre-trained contextualizers not performing as good as BoW & allen train > python train.py~~ [many quesitons] allennlp train config.json > python train.py & others Jul 9, 2020

NicolasAG closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[many quesitons] allennlp train config.json > python train.py & others #130

[many quesitons] allennlp train config.json > python train.py & others #130

NicolasAG commented Jul 7, 2020 •

edited

Loading

NicolasAG commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

NicolasAG commented Jul 10, 2020 •

edited

Loading

matt-gardner commented Jul 10, 2020 •

edited

Loading

NicolasAG commented Jul 10, 2020

[many quesitons] allennlp train config.json > python train.py & others #130

[many quesitons] allennlp train config.json > python train.py & others #130

Comments

NicolasAG commented Jul 7, 2020 • edited Loading

(1) Update guide to support newest version

(2) bert-bas-uncased not as good as guide baseline model?

(3) allennlp train config > python train.py

NicolasAG commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

matt-gardner commented Jul 9, 2020

NicolasAG commented Jul 10, 2020 • edited Loading

(4) bert vocab -vs- Vocabulary.from_instances

matt-gardner commented Jul 10, 2020 • edited Loading

NicolasAG commented Jul 10, 2020

NicolasAG commented Jul 7, 2020 •

edited

Loading

NicolasAG commented Jul 10, 2020 •

edited

Loading

matt-gardner commented Jul 10, 2020 •

edited

Loading