Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer adds an additional space after the added token #28218

Open
2 of 4 tasks
kitkhai opened this issue Dec 23, 2023 · 6 comments
Open
2 of 4 tasks

Tokenizer adds an additional space after the added token #28218

kitkhai opened this issue Dec 23, 2023 · 6 comments

Comments

@kitkhai
Copy link

kitkhai commented Dec 23, 2023

System Info

  • transformers version: 4.35.2
  • Platform: Linux-6.1.58+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
  • Jax version: 0.4.23
  • JaxLib version: 0.4.23
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang = "eng_Latn", tgt_lang = "zho_Hans")
tokenizer.add_tokens(["abcd"])

sent = 'I like to walk abcdgym along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

sent = 'I like to walk gymabcd along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

Expected behavior

The output from my code:
image

The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5

For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.

Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).

I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.

@ArthurZucker
Copy link
Collaborator

Hey! Thanks for raising the issue. This is pretty much a duplicated of #26318 and will be fixed by #27883!
You can already work around this following the tutorial here: huggingface/tokenizers#1357

@ArthurZucker
Copy link
Collaborator

PR Was merged, let me check if this is fixed!

@ArthurZucker
Copy link
Collaborator

Okay not fixed yet I' ll include it in #27717

@kitkhai
Copy link
Author

kitkhai commented Feb 1, 2024

Hi @ArthurZucker excited to hear about the progress! :)

Also, I realised that if there are multiple added tokens (abcd & gyma) that could in a sense overlap and be segmented from the same word gymabcd. It seems like the current implementation seems to just go from left to right and look for any added token that appears first.
Instead of doing a left to right, is there a way to control which added token should take precedence and be segmented first?

Just a reminder that I am thinking of Chinese language where words are not separated by space and hence my seemingly weird example of gymabcd

@ArthurZucker
Copy link
Collaborator

There is no real way to do that yet, I think we check the longest first.

@huggingface huggingface deleted a comment from github-actions bot Feb 27, 2024
@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Feb 28, 2024

That is not fixed yet, but can be fixed kind of manually if we follow what was done for SpmConverters:

    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = "always"
        if hasattr(self.original_tokenizer, "legacy") and not self.original_tokenizer.legacy:
            prepend_scheme = "first"
        return pre_tokenizers.Metaspace(
            replacement=replacement, add_prefix_space=add_prefix_space, prepend_scheme=prepend_scheme
        )

setting legacy to False for the fast tokenizer.
I'll add this as a good difficult issue, as this should be similar to #26678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants