Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer中的模型应当是什么呢?Bert吗? #2

Open
lskunk opened this issue Sep 25, 2023 · 4 comments
Open

AutoTokenizer中的模型应当是什么呢?Bert吗? #2

lskunk opened this issue Sep 25, 2023 · 4 comments

Comments

@lskunk
Copy link

lskunk commented Sep 25, 2023

由于连接不到huggingface所以我将bert模型下载到了本地,我将代码修改如下:
try:
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
except:
self.tokenizer = BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/bert-base-uncased-vocab.txt', use_fast=True)
和class ASTE(pl.LightningModule):
def init(self, hparams, data_module):
super().init()
self.save_hyperparameters(hparams)
self.data_module = data_module

    self.config = BertConfig.from_pretrained(self.hparams.model_name_or_path)
    self.config.table_num_labels = self.data_module.table_num_labels

为什么会出现
Original Traceback (most recent call last):
File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 82, in call
batch = self.tokenizer_function(examples)
File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 154, in tokenizer_function
encoding = batch_encodings[i]
File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 247, in getitem
raise KeyError(
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'这样的问题
请问Autotokenizer和Autoconfig中的模型是什么?麻烦回答一下谢谢

@1140310118
Copy link
Collaborator

你好,你这里 BertTokenizer.from_pretrained(model_name_or_path)中的参数设置不对。model_name_or_path要么是bert-base-uncased,要么是一个目录,目录下面有名为vocab.txt的文件。

@lskunk
Copy link
Author

lskunk commented Sep 26, 2023

如果要是这个参数设置的不对,那么代码应当在这行报错而不是在其他地方。 如果我按照您说的方式,报错如下:
File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 208, in init
self.tokenizer = BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/', use_fast=True)
File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1838, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for '/home/lsk/python/code/bert-base-uncased/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/home/lsk/python/code/bert-base-uncased/' is the correct path to a directory containing all relevant files for a BertTokenizer tokenizer.
并且我使用如下代码是可以运行的:
from transformers import AutoTokenizer,BertTokenizer

tokenizer=BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/bert-base-uncased-vocab.txt')
batch_sentences=["Hello I'm a single sentence","And another sentence","And the very very last one"]

batch=tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")

print(batch)

@1140310118
Copy link
Collaborator

你好,我测试了一下,应该还是tokenizer加载上有问题。

使用下面的加载方式,这样是会报错的。

tok = BertTokenizer.from_pretrained('/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/', use_fast=True)
tok(['1', '2'])[0]

错误信息如下:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zhangyice/anaconda3/envs/pytorch191/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 247, in __getitem__
    raise KeyError(
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'

注意,/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/目录下面需要有config.json和vocab.txt;另外,将上面的路径替换为/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/vocab.txt,会报一样的错。

但是,使用下面这种加载方式的,是不会报错的。

tok = AutoTokenizer.from_pretrained('/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/', use_fast=True)
tok(['1', '2'])[0]

@evanyfyang
Copy link
Collaborator

KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'
这条报错的原因是BertTokenizer中没有use_fast这个参数
这个参数仅存在于AutoTokenizer中,目的是调用BertTokenizerFast这个Tokenizer
这里修改为BertTokenizerFast.from_pretrained(<your_model_path>)即可

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants