Tokenizer adds an additional space after the added token #28218

kitkhai · 2023-12-23T03:43:18Z

System Info

transformers version: 4.35.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.23
JaxLib version: 0.4.23
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang = "eng_Latn", tgt_lang = "zho_Hans")
tokenizer.add_tokens(["abcd"])

sent = 'I like to walk abcdgym along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

sent = 'I like to walk gymabcd along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

Expected behavior

The output from my code:

The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5

For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.

Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).

I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-02T16:35:46Z

Hey! Thanks for raising the issue. This is pretty much a duplicated of #26318 and will be fixed by #27883!
You can already work around this following the tutorial here: huggingface/tokenizers#1357

ArthurZucker · 2024-01-30T02:11:15Z

PR Was merged, let me check if this is fixed!

ArthurZucker · 2024-01-31T02:49:19Z

Okay not fixed yet I' ll include it in #27717

kitkhai · 2024-02-01T13:53:36Z

Hi @ArthurZucker excited to hear about the progress! :)

Also, I realised that if there are multiple added tokens (abcd & gyma) that could in a sense overlap and be segmented from the same word gymabcd. It seems like the current implementation seems to just go from left to right and look for any added token that appears first.
Instead of doing a left to right, is there a way to control which added token should take precedence and be segmented first?

Just a reminder that I am thinking of Chinese language where words are not separated by space and hence my seemingly weird example of gymabcd

ArthurZucker · 2024-02-02T07:25:46Z

There is no real way to do that yet, I think we check the longest first.

ArthurZucker · 2024-02-28T02:21:06Z

That is not fixed yet, but can be fixed kind of manually if we follow what was done for SpmConverters:

    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = "always"
        if hasattr(self.original_tokenizer, "legacy") and not self.original_tokenizer.legacy:
            prepend_scheme = "first"
        return pre_tokenizers.Metaspace(
            replacement=replacement, add_prefix_space=add_prefix_space, prepend_scheme=prepend_scheme
        )

setting legacy to False for the fast tokenizer.
I'll add this as a good difficult issue, as this should be similar to #26678

kitkhai mentioned this issue Jan 3, 2024

FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

Closed

4 tasks

huggingface deleted a comment from github-actions bot Jan 30, 2024

huggingface deleted a comment from github-actions bot Feb 27, 2024

ArthurZucker added the Good Difficult Issue label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer adds an additional space after the added token #28218

Tokenizer adds an additional space after the added token #28218

kitkhai commented Dec 23, 2023 •

edited by ArthurZucker

Loading

ArthurZucker commented Jan 2, 2024

ArthurZucker commented Jan 30, 2024

ArthurZucker commented Jan 31, 2024

kitkhai commented Feb 1, 2024

ArthurZucker commented Feb 2, 2024

ArthurZucker commented Feb 28, 2024 •

edited

Loading

Tokenizer adds an additional space after the added token #28218

Tokenizer adds an additional space after the added token #28218

Comments

kitkhai commented Dec 23, 2023 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jan 2, 2024

ArthurZucker commented Jan 30, 2024

ArthurZucker commented Jan 31, 2024

kitkhai commented Feb 1, 2024

ArthurZucker commented Feb 2, 2024

ArthurZucker commented Feb 28, 2024 • edited Loading

kitkhai commented Dec 23, 2023 •

edited by ArthurZucker

Loading

ArthurZucker commented Feb 28, 2024 •

edited

Loading