FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

kitkhai · 2024-01-03T11:07:49Z

System Info

transformers version: 4.35.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.1
Safetensors version: 0.4.1
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.23
JaxLib version: 0.4.23
Using GPU in script?:
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers.convert_slow_tokenizer import import_protobuf
from transformers import AutoTokenizer
from transformers import NllbTokenizer, NllbTokenizerFast

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.save_pretrained("old_tokenizer")

model_pb2 = import_protobuf()
m = model_pb2.ModelProto()
m.ParseFromString(open("./old_tokenizer/sentencepiece.bpe.model", 'rb').read())

piece = m.SentencePiece()
piece.piece = "superlongword"
piece.score = -10
piece.type = 4

m.pieces.extend([piece1])
with open("temp_eng_insert_user_def_sentencepiece.bpe.model", 'wb') as f:
    f.write(m.SerializeToString())


tokenizer_edited = NllbTokenizer(vocab_file="temp_sentencepiece.bpe.model", src_lang = "zho_Hans", tgt_lang = "eng_Latn")
tokenizer_edited_fast = NllbTokenizerFast(vocab_file="temp_sentencepiece.bpe.model", src_lang = "zho_Hans", tgt_lang = "eng_Latn")

sent = 'Hi there superlongword'
print(sent)
> Hi there superlongword

print("original tokenizer: ", tokenizer.tokenize(sent))
> original tokenizer:  ['▁Hi', '▁there', '▁super', 'long', 'word']

print("tokenizer with tokens: ", tokenizer_edited.tokenize(sent))
> tokenizer with tokens:  ['▁Hi', '▁there', '▁', 'superlongword']

print("tokenizer with tokens (Fast): ", tokenizer_edited_fast.tokenize(sent))
> tokenizer with tokens (Fast):  ['▁Hi', '▁there', '▁super', 'long', 'word']

Expected behavior

> Hi there superlongword
> original tokenizer:  ['▁Hi', '▁there', '▁super', 'long', 'word']
> tokenizer with tokens:  ['▁Hi', '▁there', '▁', 'superlongword']
> tokenizer with tokens (Fast):  ['▁Hi', '▁there', '▁', 'superlongword']

I faced a similar issue as raised by a question in the HF forum where the OP trainer the tokenizer with user_defined_symbols while in my case I added to the SentencePiece model file directly without training.

Noted that I can just use the add_tokens method to achieve the same outcome but because of another issue that I raised #28218 , I would like to avoid the use of add_tokens method if possible.

The text was updated successfully, but these errors were encountered:

kitkhai · 2024-01-03T12:24:09Z

Additionally, is there a way to retrieve (and edit) the merge rules from "slow" & "fast" tokenizers respectively?

ArthurZucker · 2024-01-03T15:46:40Z

Hey! Few things here. What you are trying to do is outside the scope of the supported features. Adding a token should be done using tokenizer.add_tokens function.
The fast version is for me more right than what you expect. If there are no merges, then there is absolutely no reason for the BPE model to fuse '▁super', 'long', 'word' into superlongword. Thus the slow version seems more wrong, and specifically because sentencepiece does not really allow adding tokens that way.

kitkhai closed this as completed Jan 3, 2024

kitkhai closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

kitkhai commented Jan 3, 2024 •

edited by ArthurZucker

Loading

kitkhai commented Jan 3, 2024 •

edited

Loading

ArthurZucker commented Jan 3, 2024

FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

FastTokenizer not using the user_defined_symbols defined in the SentencePiece Model #28324

Comments

kitkhai commented Jan 3, 2024 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

kitkhai commented Jan 3, 2024 • edited Loading

ArthurZucker commented Jan 3, 2024

kitkhai commented Jan 3, 2024 •

edited by ArthurZucker

Loading

kitkhai commented Jan 3, 2024 •

edited

Loading