Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we ever use the start and end tokens #77

Open
allanj opened this issue Mar 17, 2023 · 0 comments
Open

Do we ever use the start and end tokens #77

allanj opened this issue Mar 17, 2023 · 0 comments

Comments

@allanj
Copy link

allanj commented Mar 17, 2023

I saw this function, but never called

galai/galai/model.py

Lines 143 to 172 in 3a724f5

def _set_tokenizer(self, tokenizer_path: str):
"""
Configures the tokenizer for the model
Parameters
----------
tokenizer_path : str
Path for the tokenizer (str)
"""
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
# setup padding
tokenizer.pad_token_id = 1
tokenizer.pad_token = "<pad>"
tokenizer.padding_side = "left"
# setup truncation
tokenizer.truncation_side = "left"
# setup special tokens
tokenizer.bos_token_id = 0
tokenizer.bos_token = "<s>"
tokenizer.eos_token_id = 2
tokenizer.eos_token = "</s>"
tokenizer.unk_token = "<unk>"
tokenizer.unk_token_id = 3
self.tokenizer = tokenizer

But quite confused that
it seems we never use the bos and eos tokens,

even in transformer, they are None if we print out the bos, eos, pad tokens.

I'm quite confused about bos and eos, do we actually use them or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant