Rethink vocabulary API #3097

matt-gardner · 2019-07-25T20:40:52Z

The way we do vocab stuff with BERT / other pretrained wordpiece tokenizers is a little crazy and leads to subtle bugs. We need to get the BERT wordpiece vocab into our Vocabulary object so that we can pass it to the model and use it there if necessary. But if you save a vocab object with this vocabulary, it will give this error on load:

  File "tmp/tmp.py", line 30, in predict
    predictor = Predictor.from_path(path, 'masked_lm_predictor')
  File "/Users/mattg/clone/allennlp/allennlp/predictors/predictor.py", line 253, in from_path
    return Predictor.from_archive(load_archive(archive_path, cuda_device=cuda_device), predictor_name)
  File "/Users/mattg/clone/allennlp/allennlp/models/archival.py", line 230, in load_archive
    cuda_device=cuda_device)
  File "/Users/mattg/clone/allennlp/allennlp/models/model.py", line 327, in load
    return cls.by_name(model_type)._load(config, serialization_dir, weights_file, cuda_device)
  File "/Users/mattg/clone/allennlp/allennlp/models/model.py", line 256, in _load
    vocab = Vocabulary.by_name(vocab_choice).from_files(vocab_dir)
  File "/Users/mattg/clone/allennlp/allennlp/data/vocabulary.py", line 324, in from_files
    vocab.set_from_file(filename, is_padded, namespace=namespace)
  File "/Users/mattg/clone/allennlp/allennlp/data/vocabulary.py", line 378, in set_from_file
    assert self._oov_token in self._token_to_index[namespace], "OOV token not found!"
AssertionError: OOV token not found!

Somehow our existing BERT models are not actually saving the BERT vocabulary, even though it appears in the vocab object after indexing data. If I'm understanding things right, this is because we only count vocab items on our first pass through the data, and don't actually index anything when constructing the vocab. It's only after we've saved the vocab that we index stuff:

allennlp/allennlp/training/trainer_pieces.py

Lines 67 to 77 in 417a757

    
           # Initializing the model can have side effect of expanding the vocabulary 
        
           vocab.save_to_files(os.path.join(serialization_dir, "vocabulary")) 
        
           iterator = DataIterator.from_params(params.pop("iterator")) 
        
           iterator.index_with(model.vocab) 
        
           validation_iterator_params = params.pop("validation_iterator", None) 
        
           if validation_iterator_params: 
        
               validation_iterator = DataIterator.from_params(validation_iterator_params) 
        
               validation_iterator.index_with(model.vocab) 
        
           else: 
        
               validation_iterator = None

This means that we are basically only avoiding the above bug because of an accident in when we are saving the vocab. We should probably figure out a better solution for this.

The text was updated successfully, but these errors were encountered:

matt-gardner · 2019-07-25T20:42:52Z

Oh, actually, this means that the model won't be getting the right vocab sizes if you use our standard training script. I was hacking my own thing, so I got the right vocab size, but I ran into this OOV error...

matt-gardner · 2019-07-25T21:25:15Z

I can fix the error by setting the "bert" namespace as one of the non-padded namespaces (which is appropriate anyway). But this doesn't solve the vocab size issue, if the model needs to know the BERT vocab size.

guoquan · 2019-12-22T06:39:48Z

The issue is as stated in #3233. Looking forward to any update.

matt-gardner · 2020-03-11T00:55:47Z

Moving this to the 1.1 milestone, as I don’t think it’s pressing, nor should it be a breaking change.

dirkgr · 2020-03-26T23:43:04Z

It breaks the masked-lm model we have in the library.

MaksymDel · 2020-03-26T23:49:59Z

It also has blocked hosting demos for models that use vocabs constructed from huggingface transformers.

dirkgr · 2020-03-26T23:50:22Z

Yeah, masked-lm is one of those.

dirkgr · 2020-03-27T00:08:55Z

I've been banging my head against this all day, because I'm trying to fix masked-lm. The way we do it now only works if we read all the data twice before we start training, once to create a vocab, and then again to index all instances. That's obviously not great, especially if we do it in order to fix an issue that has nothing to do with reading the data.

I think the cleanest way to do this is to specify a Vocabulary class, and have a few Registrable implementation that you can select from in your config. Ideally with some reasonable defaults so nobody has to worry about it in 90% of the cases. I already proposed something like it in #3581 (though in a different context at the time), and we didn't like it then. @matt-gardner, can we think about this a second time?

matt-gardner · 2020-03-29T21:41:53Z

I'm not opposed if you think that's the best solution to this problem. It's unfortunate that we'll now have four separate places where you need to specify a particular transformer model, but I'm not sure how to get around that. It's an argument for something much more integrated, like we had in the pre-allennlp days. But that has its own problems.

dirkgr · 2020-03-31T00:03:28Z

Maybe the reader gets a function, get_vocab()? It would make sense that the reader decides which vocab we're using.

matt-gardner · 2020-03-31T13:33:42Z

The reader can't do that in general; if you don't have a pre-built vocabulary, vocabulary construction happens after the reader has finished its job. Or are you suggesting to change the whole pipeline such that instead of calling Vocabulary.from_instances, we call reader.get_vocab(instances), or something? I'm not really sure what you gain by doing that, except additional unnecessary coupling.

I suppose one option is to just have the default Vocabulary be None, and make sure everything handles that correctly. If you're using a pre-trained transformer, most of the time you don't care about the vocabulary at all.

Well, that breaks any kind of labeling task. So does having a separate transformer-specified vocab object - where does the label vocabulary come from, if you're not iterating over your instances?

One possibility: add a parameter to Vocabulary.from_instances (or a new constructor) that populates one namespace from a transformer. I suppose this is basically what you said in recommending a Vocabulary subclass for transformers. If you really need access to the transformer vocab in your model, you should use this. Well, actually, if you're ever going to make anything human readable, then you need the transformer vocab in your model. And you almost always want that, so you almost always want the transformer vocab in your model.

Looking back at what you were complaining about, we have always read through the data twice before starting training (though the second time is technically during training). The only way around that is to precompute your vocab and set it from_files. So basically you just want a shortcut to get your vocab from_transformer, similar to from_files, without reading through the data. That's fine, but I don't know that it really saves you much, and it only works when you don't have any labels to assign indices to, either.

Sorry this is a bit rambling; I'm a bit too tired at the moment to clean it up into a more coherent story. Hopefully the points still make sense.

dirkgr · 2021-02-22T18:23:03Z

I continue to have thoughts about this, but I don't think a GitHub issue is the right place for them right now.

DeNeutoy added the Contributions welcome label Aug 9, 2019

matt-gardner mentioned this issue Nov 8, 2019

Vocabulary mismatch when using pre-trained transformers #3434

Closed

nicola-decao mentioned this issue Nov 12, 2019

Adding padding_token and oov_token as parameters of Vocabulary #3446

Merged

matt-gardner mentioned this issue Nov 16, 2019

Pretrained vocabulary from transformers is sometimes not saved in our Vocabulary object #3456

Closed

MaksymDel mentioned this issue Nov 17, 2019

Rework Pretrained Tokenizer and Indexer #3453

Merged

MaksymDel mentioned this issue Dec 6, 2019

Possible issue with vocabulary when providing BertForClassification a BERT model path #3233

Closed

matt-gardner added this to the 1.0.0 milestone Jan 3, 2020

schmmd added the Week label Jan 6, 2020

matt-gardner modified the milestones: 1.0.0, 1.1 Mar 11, 2020

MaksymDel mentioned this issue Apr 4, 2020

Fix transformers vocab not being saved to files #4023

Open

dirkgr mentioned this issue Apr 6, 2020

Get a Vocabulary object from the reader #4034

Closed

matt-gardner mentioned this issue May 4, 2020

Prediction Error when doing Textual Entailment using roberta #4192

Closed

dirkgr removed the Contributions welcome label Jun 11, 2020

elkotito mentioned this issue Jun 18, 2020

how to set vocabulary (from another model) from a config file allennlp #4374

Closed

matt-gardner changed the title ~~Odd OOV issue with BERT vocab~~ Rethink vocabulary API Jun 26, 2020

epwalsh mentioned this issue Jun 26, 2020

Option to extend the vocabulary of TokenEmbedder and Tokenizer. #4397

Open

schmmd modified the milestones: 1.1.0, 1.2 Jun 29, 2020

matt-gardner mentioned this issue Jul 10, 2020

[many quesitons] allennlp train config.json > python train.py & others allenai/allennlp-guide#130

Closed

dirkgr modified the milestones: 1.2, 2.0 Aug 24, 2020

epwalsh mentioned this issue Oct 27, 2020

Train a model with transformer embeddings and additional_special_tokens #4690

Closed

10 tasks

dirkgr modified the milestones: 2.0, 2.1 Nov 9, 2020

dirkgr removed the Week label Feb 18, 2021

dirkgr closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink vocabulary API #3097

Rethink vocabulary API #3097

matt-gardner commented Jul 25, 2019

matt-gardner commented Jul 25, 2019

matt-gardner commented Jul 25, 2019

guoquan commented Dec 22, 2019

matt-gardner commented Mar 11, 2020

dirkgr commented Mar 26, 2020

MaksymDel commented Mar 26, 2020

dirkgr commented Mar 26, 2020

dirkgr commented Mar 27, 2020

matt-gardner commented Mar 29, 2020

dirkgr commented Mar 31, 2020

matt-gardner commented Mar 31, 2020

dirkgr commented Feb 22, 2021

Rethink vocabulary API #3097

Rethink vocabulary API #3097

Comments

matt-gardner commented Jul 25, 2019

matt-gardner commented Jul 25, 2019

matt-gardner commented Jul 25, 2019

guoquan commented Dec 22, 2019

matt-gardner commented Mar 11, 2020

dirkgr commented Mar 26, 2020

MaksymDel commented Mar 26, 2020

dirkgr commented Mar 26, 2020

dirkgr commented Mar 27, 2020

matt-gardner commented Mar 29, 2020

dirkgr commented Mar 31, 2020

matt-gardner commented Mar 31, 2020

dirkgr commented Feb 22, 2021