Skip to content
This repository has been archived by the owner on Feb 6, 2024. It is now read-only.

[many quesitons] allennlp train config.json > python train.py & others #130

Closed
NicolasAG opened this issue Jul 7, 2020 · 7 comments
Closed

Comments

@NicolasAG
Copy link

NicolasAG commented Jul 7, 2020

Hi,

(1) Update guide to support newest version

I'm going through the "Next Steps" chapter > section "Switching to pre-trained contextualizers".
First, the config file showed in the guide uses :

"iterator": {
        "type": "basic",
        "batch_size": 8
    }

instead of

"data_loader": {
        "batch_size": 8,
        "shuffle": true
    },

but if I try to use this config file, I get an error saying that the key "data_loader" is required in the config file.

(2) bert-bas-uncased not as good as guide baseline model?

Secondly, when I replace the tokenizer, token_indexers, embedder and encoder by the bert model as in the new config file proposed in https://guide.allennlp.org/next-steps#1, it looks like the model is not training. Training accuracy remains at 0.50 after 5 epochs.
This is my config file:

local bert_model = "bert-base-uncased";
{
    "dataset_reader" : {
        "type": "classification-tsv",
        "tokenizer": {
            "type": "pretrained_transformer",
            "model_name": bert_model,
        },
        "token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": bert_model,
            }
        },
        "max_tokens": 512
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": bert_model,
                    "train_parameters": true,
                }
            }
        },
        "encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": false,
        }
    },
    "data_loader": {
        "batch_size": 8,
        "shuffle": true
    },
    "trainer": {
        "optimizer": {
            "type": "huggingface_adamw",
            "lr": 1.0e-5
        },
        "num_epochs": 5,
        "cuda_device": 0,
    }
}

and this is the last few lines printed at the end of allennlp train ...:

2020-07-07 20:37:52,415 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 0,
  "peak_worker_0_memory_MB": 3499.544,
  "peak_gpu_0_memory_MB": 9283,
  "training_duration": "0:05:03.516101",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 0.508125,  // <----------- 😞 
  "training_loss": 0.693260959982872,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 3499.544,
  "training_gpu_0_memory_MB": 9283,
  "validation_accuracy": 0.5,
  "validation_loss": 0.6932091021537781,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.61,  // <----------- 👎 
  "best_validation_loss": 0.6674252915382385,
  "best_validation_reg_loss": 0.0
}

I tried to play a little with the config and I noticed that if I replace

"encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": false,
        }

by

"encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": true,
        }

I get better performance:

2020-07-07 20:55:19,186 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 4,
  "peak_worker_0_memory_MB": 3643.86,
  "peak_gpu_0_memory_MB": 9287,
  "training_duration": "0:05:23.475209",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 0.910625,  // <---------------- 👍 
  "training_loss": 0.2304620276018977,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 3643.86,
  "training_gpu_0_memory_MB": 9287,
  "validation_accuracy": 0.805,  // <---------------- 👍 
  "validation_loss": 0.4564117732644081,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.805,
  "best_validation_loss": 0.4564117732644081,
  "best_validation_reg_loss": 0.0
}

but it is still not as good as with the original config:

{
    "dataset_reader" : {
        "type": "classification-tsv",
        "token_indexers": {
            "tokens": {
                "type": "single_id"
            }
        }
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "embedding",
                    "embedding_dim": 10
                }
            }
        },
        "encoder": {
            "type": "bag_of_embeddings",
            "embedding_dim": 10
        }
    },
    "data_loader": {
        "batch_size": 8,
        "shuffle": true
    },
    "trainer": {
        "optimizer": "adam",
        "num_epochs": 5
    }
}

which gets 100% training accuracy and 80% valid accuracy:

2020-07-07 20:04:36,909 - INFO - allennlp.common.util - Metrics: {
  "best_epoch": 3,
  "peak_worker_0_memory_MB": 575.508,
  "peak_gpu_0_memory_MB": 12,
  "training_duration": "0:00:22.934259",
  "training_start_epoch": 0,
  "training_epochs": 4,
  "epoch": 4,
  "training_accuracy": 1.0,  // <----------------- ✔️ 
  "training_loss": 0.002026954392204061,
  "training_reg_loss": 0.0,
  "training_worker_0_memory_MB": 575.508,
  "training_gpu_0_memory_MB": 12,
  "validation_accuracy": 0.82,
  "validation_loss": 0.39791114151477813,
  "validation_reg_loss": 0.0,
  "best_validation_accuracy": 0.825,  // <----------------- 👍 
  "best_validation_loss": 0.3861449569836259,
  "best_validation_reg_loss": 0.0
}

What could cause this behavior? maybe the gigantic amount of parameters of the bert model? or maybe bert vocab doesn't cover most of the tokens in the training file..? 🤔

(3) allennlp train config > python train.py

I also noticed that running allennlp train config.json yields good performances (~90% train accuracy and ~80% validation accuracy) while running my own training file with python train.py doesn't seem to learn (training accuracy stays at 50% after 5 epochs). I specifically made sure that my config and custom script are as similar as possible:
config:

local bert_model = "bert-base-uncased";
{
    "dataset_reader" : {
        "type": "classification-tsv",
        "tokenizer": {
            "type": "pretrained_transformer",
            "model_name": bert_model,
        },
        "token_indexers": {
            "bert": {
                "type": "pretrained_transformer",
                "model_name": bert_model,
                "namespace": "tags",
            }
        },
        "max_tokens": 512
    },
    "train_data_path": "/allennlp/data/train.tsv",
    "validation_data_path": "/allennlp/data/dev.tsv",
    "model": {
        "type": "simple_classifier",
        "embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": bert_model,
                    "train_parameters": true,
                }
            }
        },
        "encoder": {
            "type": "bert_pooler",
            "pretrained_model": bert_model,
            "requires_grad": true,
        }
    },
    "data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size": 8,
            "sorting_keys": ['text']
        }
    },
    "validation_data_loader": {
        "batch_sampler": {
            "type": "bucket",
            "batch_size": 8,
            "sorting_keys": ['text']
        }
    },
    "trainer": {
        "type": "gradient_descent",
        "serialization_dir": "/allennlp/models/tmp",
        "num_epochs": 5,
        "optimizer": {
            "type": "huggingface_adamw",
            "lr": 1.0e-5,
        },
        "cuda_device": 0,
    }
}

-vs- train.py:

def build_data_reader(bert_model: str = None):
    tokenizer = PretrainedTransformerTokenizer(model_name=bert_model)
    token_indexers = {"bert": PretrainedTransformerIndexer(model_name=bert_model, namespace='tags')}
    max_tokens = 512
    return ClassificationTsvReader(tokenizer, token_indexers, max_tokens)

def build_model(vocab: Vocabulary, bert_model: str = None) -> Model:
    embedder = BasicTextFieldEmbedder({"bert": PretrainedTransformerEmbedder(model_name=bert_model, train_parameters=True)})
    encoder = BertPooler(pretrained_model=bert_model, requires_grad=True)
    return SimpleClassifier(vocab, embedder, encoder)

def build_trainer(
            model: Model, ser_dir: str, train_loader: DataLoader, valid_loader: DataLoader,
            hugging_optim: bool, cuda_device: int) -> Trainer:
    params = [ [n, p] for n, p in model.named_parameters() if p.requires_grad ]
    logging.info(f"{len(params)} parameters requiring grad updates")
    if hugging_optim:
        optim = HuggingfaceAdamWOptimizer(params, lr=1.0e-5)
    else:
        optim = AdamOptimizer(params)
    return GradientDescentTrainer(
        model=model,
        serialization_dir=ser_dir,
        data_loader=train_loader,
        validation_data_loader=valid_loader,
        num_epochs=5,
        optimizer=optim,
        cuda_device=cuda_device
    )

def run_training_loop(bert_model=None):
    logging.info("Building data reader...")
    dataset_reader = build_data_reader(bert_model)

    logging.info("Reading data...")
    train_instances = dataset_reader.read("/allennlp/data/train.tsv")
    logging.info(f"got {len(train_instances)} train instances")
    valid_instances = dataset_reader.read("/allennlp/data/dev.tsv")
    logging.info(f"got {len(valid_instances)} valid instances")

    logging.info("Building vocabulary...")
    vocab = Vocabulary.from_instances(train_instances + valid_instances, min_count={'text': 1})
    logging.info(vocab)

    logging.info("Building model...")
    model = build_model(vocab, bert_model)

    if torch.cuda.is_available():
        cuda_device = 0
        model = model.cuda(cuda_device)
    else:
        cuda_device = -1

    logging.info(model)

    logging.info("Building data loaders...")
    train_instances.index_with(vocab)
    valid_instances.index_with(vocab)
    train_batch_sampler = BucketBatchSampler(train_instances, batch_size=8, sorting_keys=['text'])
    valid_batch_sampler = BucketBatchSampler(valid_instances, batch_size=8, sorting_keys=['text'])
    train_loader = DataLoader(train_instances, batch_sampler=train_batch_sampler)
    valid_loader = DataLoader(valid_instances, batch_sampler=valid_batch_sampler)

    logging.info("Building trainer...")
    trainer = build_trainer(
        model, "/allennlp/models/tmp", train_loader, valid_loader,
        hugging_optim=bert_model is not None, cuda_device=cuda_device)
    logging.info("Start training...")
    trainer.train()
    logging.info("done.")

if __name__ == '__main__':
    run_training_loop(bert_model="bert-base-uncased")

Any idea why running the config file doesn't yield similar performances than running this custom script..?

Thanks a lot for your help :)

@NicolasAG NicolasAG changed the title pre-trained contextualizers not performing as good as BoW pre-trained contextualizers not performing as good as BoW & allen train > python train.py Jul 9, 2020
@NicolasAG NicolasAG changed the title pre-trained contextualizers not performing as good as BoW & allen train > python train.py [many quesitons] allennlp train config.json > python train.py & others Jul 9, 2020
@NicolasAG
Copy link
Author

I would be happy to move these questions somewhere else if there exists a dedicated forum for that..?

@matt-gardner
Copy link
Contributor

This venue is fine, I've just been distracted with ACL ongoing right now. I'll respond soon.

@matt-gardner
Copy link
Contributor

On (1): thanks for the catch! That should be fixed now.

On (2): yeah, I'm not sure why the pooler was set to not be trainable, but I've now fixed that in the example, also. Thanks again for the catch. On why it doesn't do as well: I'm not certain; it could be a learning rate issue, or just a stability issue - BERT is known to have high variance between runs, and this is a small dataset. Masato originally wrote this section for an older version of allennlp, and I apparently didn't update it correctly when I updated it for the 1.0 release. It's possible that some of the learning rate / optimizer stuff also should have changed slightly to be optimal. But the point of the guide is to show you how to use the code, not provide optimal hyperparameters, so as long as it runs and gives reasonably close performance, I'm not too concerned.

On (3): it doesn't look like you're shuffling the data. Do you agree? If you're not shuffling the data, that would definitely explain the difference.

@matt-gardner
Copy link
Contributor

If you want more standard hyperparameters, you might look at some of the examples in our model library, e.g.: https://github.com/allenai/allennlp-models/blob/09395d233161859db4c11af3689a3e0bc62169d8/training_config/rc/transformer_qa.jsonnet#L28-L40.

@NicolasAG
Copy link
Author

NicolasAG commented Jul 10, 2020

Re (2): Of course, makes sense 👍

Re (3): Right, but when I add the keyword shuffle=True to my DataLoader I get:
ValueError: batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last.
Note that I didn't mention shuffle in the config either so the behaviors should have been the same...
I tried again this morning and it looks like it's training correctly now... 🤔 I guess bert is really unstable on small data and basic optimizers...
Maybe it has to do with random seeds as well! I'm using a GPU and I didn't set any of numpy, torch, or cuda seeds... Fixing those would probably reduce the variance between two runs.
Maybe a warning in the guide section Switching to pre-trained contextualizers should be mentioned because from one run to the next the model can go from 50% training accuracy to 90%... and novel users like me may think it's an issue with their code somewhere.

(4) bert vocab -vs- Vocabulary.from_instances

While I'm at it, I'm taking this opportunity to ask another question I had :)
I noticed that when I'm using the PretrainedTransformerTokenizer and the PretrainedTransformerIndexer in my data reader the vocab I then create with vocab = Vocabulary.from_instances(train_instances + valid_instances) has only the labels namespace (of size 2) but doesn't have the namespace specified in the PretrainedTransformerIndexer ("tags" by default).
(4.1) I'm not sure to understand why it is not present in the vocab? Is it that the loaded pretrained model has it's own vocab loaded somewhere else? If so, how does the model (or I) has (have) access to it? I logged the vocab passed in the model constructor and it also only has the 'labels' namespace.
(4.2) I didn't understand in the documentation why adding the padding or UNK tokens would break:

We use a somewhat confusing default value of tags so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.

(https://docs.allennlp.org/master/api/data/token_indexers/pretrained_transformer_indexer/#pretrainedtransformerindexer-objects)

Thanks a lot!

@matt-gardner
Copy link
Contributor

matt-gardner commented Jul 10, 2020

3: looks like the bucket sampler already shuffles, so, yeah, that wasn't the issue. But yes, lots of papers have pointed out how high BERT's training variance is.

4.1: This is a bit confusing, and we'd like to fix it. Currently, the vocab gets added when you index instances. I think we should probably also add that line to when we count vocab items, which would resolve this issue (PR to fix that welcome!).

4.2: see this method.

@NicolasAG
Copy link
Author

Wow! indeed I can see that you and others have been thinking about a better way to handle HF's transformers vocab for a while now... 😄
I'm still very new to the library so I'll continue reading the guide (thank you for that btw, it is REALLY helpful 🙏 ) and making experiments on my side to familiarize myself before making any PRs.
Closing this for now since I don't have more questions at this time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants