Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Rethink vocabulary API #3097

Closed
matt-gardner opened this issue Jul 25, 2019 · 12 comments · May be fixed by #4023
Closed

Rethink vocabulary API #3097

matt-gardner opened this issue Jul 25, 2019 · 12 comments · May be fixed by #4023
Milestone

Comments

@matt-gardner
Copy link
Contributor

The way we do vocab stuff with BERT / other pretrained wordpiece tokenizers is a little crazy and leads to subtle bugs. We need to get the BERT wordpiece vocab into our Vocabulary object so that we can pass it to the model and use it there if necessary. But if you save a vocab object with this vocabulary, it will give this error on load:

  File "tmp/tmp.py", line 30, in predict
    predictor = Predictor.from_path(path, 'masked_lm_predictor')
  File "/Users/mattg/clone/allennlp/allennlp/predictors/predictor.py", line 253, in from_path
    return Predictor.from_archive(load_archive(archive_path, cuda_device=cuda_device), predictor_name)
  File "/Users/mattg/clone/allennlp/allennlp/models/archival.py", line 230, in load_archive
    cuda_device=cuda_device)
  File "/Users/mattg/clone/allennlp/allennlp/models/model.py", line 327, in load
    return cls.by_name(model_type)._load(config, serialization_dir, weights_file, cuda_device)
  File "/Users/mattg/clone/allennlp/allennlp/models/model.py", line 256, in _load
    vocab = Vocabulary.by_name(vocab_choice).from_files(vocab_dir)
  File "/Users/mattg/clone/allennlp/allennlp/data/vocabulary.py", line 324, in from_files
    vocab.set_from_file(filename, is_padded, namespace=namespace)
  File "/Users/mattg/clone/allennlp/allennlp/data/vocabulary.py", line 378, in set_from_file
    assert self._oov_token in self._token_to_index[namespace], "OOV token not found!"
AssertionError: OOV token not found!

Somehow our existing BERT models are not actually saving the BERT vocabulary, even though it appears in the vocab object after indexing data. If I'm understanding things right, this is because we only count vocab items on our first pass through the data, and don't actually index anything when constructing the vocab. It's only after we've saved the vocab that we index stuff:

# Initializing the model can have side effect of expanding the vocabulary
vocab.save_to_files(os.path.join(serialization_dir, "vocabulary"))
iterator = DataIterator.from_params(params.pop("iterator"))
iterator.index_with(model.vocab)
validation_iterator_params = params.pop("validation_iterator", None)
if validation_iterator_params:
validation_iterator = DataIterator.from_params(validation_iterator_params)
validation_iterator.index_with(model.vocab)
else:
validation_iterator = None

This means that we are basically only avoiding the above bug because of an accident in when we are saving the vocab. We should probably figure out a better solution for this.

@matt-gardner
Copy link
Contributor Author

Oh, actually, this means that the model won't be getting the right vocab sizes if you use our standard training script. I was hacking my own thing, so I got the right vocab size, but I ran into this OOV error...

@matt-gardner
Copy link
Contributor Author

I can fix the error by setting the "bert" namespace as one of the non-padded namespaces (which is appropriate anyway). But this doesn't solve the vocab size issue, if the model needs to know the BERT vocab size.

@guoquan
Copy link
Contributor

guoquan commented Dec 22, 2019

The issue is as stated in #3233. Looking forward to any update.

@matt-gardner matt-gardner added this to the 1.0.0 milestone Jan 3, 2020
@schmmd schmmd added the Week label Jan 6, 2020
@matt-gardner
Copy link
Contributor Author

Moving this to the 1.1 milestone, as I don’t think it’s pressing, nor should it be a breaking change.

@matt-gardner matt-gardner modified the milestones: 1.0.0, 1.1 Mar 11, 2020
@dirkgr
Copy link
Member

dirkgr commented Mar 26, 2020

It breaks the masked-lm model we have in the library.

@MaksymDel
Copy link
Contributor

It also has blocked hosting demos for models that use vocabs constructed from huggingface transformers.

@dirkgr
Copy link
Member

dirkgr commented Mar 26, 2020

Yeah, masked-lm is one of those.

@dirkgr
Copy link
Member

dirkgr commented Mar 27, 2020

I've been banging my head against this all day, because I'm trying to fix masked-lm. The way we do it now only works if we read all the data twice before we start training, once to create a vocab, and then again to index all instances. That's obviously not great, especially if we do it in order to fix an issue that has nothing to do with reading the data.

I think the cleanest way to do this is to specify a Vocabulary class, and have a few Registrable implementation that you can select from in your config. Ideally with some reasonable defaults so nobody has to worry about it in 90% of the cases. I already proposed something like it in #3581 (though in a different context at the time), and we didn't like it then. @matt-gardner, can we think about this a second time?

@matt-gardner
Copy link
Contributor Author

I'm not opposed if you think that's the best solution to this problem. It's unfortunate that we'll now have four separate places where you need to specify a particular transformer model, but I'm not sure how to get around that. It's an argument for something much more integrated, like we had in the pre-allennlp days. But that has its own problems.

@dirkgr
Copy link
Member

dirkgr commented Mar 31, 2020

Maybe the reader gets a function, get_vocab()? It would make sense that the reader decides which vocab we're using.

@matt-gardner
Copy link
Contributor Author

The reader can't do that in general; if you don't have a pre-built vocabulary, vocabulary construction happens after the reader has finished its job. Or are you suggesting to change the whole pipeline such that instead of calling Vocabulary.from_instances, we call reader.get_vocab(instances), or something? I'm not really sure what you gain by doing that, except additional unnecessary coupling.

I suppose one option is to just have the default Vocabulary be None, and make sure everything handles that correctly. If you're using a pre-trained transformer, most of the time you don't care about the vocabulary at all.

Well, that breaks any kind of labeling task. So does having a separate transformer-specified vocab object - where does the label vocabulary come from, if you're not iterating over your instances?

One possibility: add a parameter to Vocabulary.from_instances (or a new constructor) that populates one namespace from a transformer. I suppose this is basically what you said in recommending a Vocabulary subclass for transformers. If you really need access to the transformer vocab in your model, you should use this. Well, actually, if you're ever going to make anything human readable, then you need the transformer vocab in your model. And you almost always want that, so you almost always want the transformer vocab in your model.

Looking back at what you were complaining about, we have always read through the data twice before starting training (though the second time is technically during training). The only way around that is to precompute your vocab and set it from_files. So basically you just want a shortcut to get your vocab from_transformer, similar to from_files, without reading through the data. That's fine, but I don't know that it really saves you much, and it only works when you don't have any labels to assign indices to, either.

Sorry this is a bit rambling; I'm a bit too tired at the moment to clean it up into a more coherent story. Hopefully the points still make sense.

@dirkgr
Copy link
Member

dirkgr commented Feb 22, 2021

I continue to have thoughts about this, but I don't think a GitHub issue is the right place for them right now.

@dirkgr dirkgr closed this as completed Feb 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants