You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:
1. For the vocab: I am trying to convert the vocab using solution from #69 : tokenizer = BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')
I get : OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I don’t have the vocab.json, so I how do I convert the vocab for the tokenizer ?
2. For the model: As the out of the box pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch did not work I customized it per #2 by adding:
excluded = ['BERTAdam','_power','global_step']
init_vars = list(filter(lambda x:all([True if e not in x[0] else False for e in excluded]),init_vars))
With this the model 'seems' to be converting fine. But When I load this using:
model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')
I still get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Can you pl. help me to understand what is going on here ?
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
Folks, I am trying to convert the Biobert model to Pytorch. Here are the things that I did so far:
1. For the vocab: I am trying to convert the vocab using solution from #69 :
tokenizer = BartTokenizer.from_pretrained('/content/biobert_v1.1_pubmed/vocab.txt')
I get :
OSError: Model name '/content/biobert_v1.1_pubmed' was not found in tokenizers model name list (bart-large, bart-large-mnli, bart-large-cnn, bart-large-xsum). We assumed '/content/biobert_v1.1_pubmed' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.
I don’t have the vocab.json, so I how do I convert the vocab for the tokenizer ?
2. For the model: As the out of the box
pytorch_pretrained_bert.convert_tf_checkpoint_to_pytorch
did not work I customized it per #2 by adding:With this the model 'seems' to be converting fine. But When I load this using:
model = BartForConditionalGeneration.from_pretrained('path/to/model/biobert_v1.1_pubmed_pytorch.model')
I still get
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
Can you pl. help me to understand what is going on here ?
The text was updated successfully, but these errors were encountered: