Access to pre_tokenizer for PreTrainedTokenizer #26254

GitMew · 2023-09-19T09:57:55Z

Feature request

Give access to setting a pre_tokenizer for a transformers.PreTrainedTokenizer, similar to how this works for PreTrainedTokenizerFast.

Motivation

As far as I understand from these docs, there are two interfaces for interacting with tokenizers in the HuggingFace ecosystem: PreTrainedTokenizerFast is a wrapper around Rust code, and PreTrainedTokenizer is supposed to be the slow Python equivalent.

PreTrainedTokenizerFast has a property backend_tokenizer which is a tokenizers.Tokenizer object, which has a pre_tokenizer property and is built from a tokenizers.models.Model subclass (the thing that does the tokenization). You can instantiate a PreTrainedTokenizerFast from such a Tokenizer object with the constructor argument tokenizer_object. Meanwhile, none of this is accessible for a PreTrainedTokenizer.

Here is my use-case: I have a function tokenizeWord(w: str) implemented entirely in Python to segment a single word into subwords. I would now like to

Build a PreTrainedTokenizer from this function, and
pre-tokenize sentences on punctuation and whitespace so that each word is sent to that function separately.

I can do the first as follows (at least I think this is how it's supposed to be done):

class CustomTokenizer(PreTrainedTokenizer):

    def __init__(self, custom_tkz_algorithm, **kwargs):
        super().__init__(**kwargs)
        self.algorithm = custom_tkz_algorithm
        self.vocab         = self.algorithm.get_vocab()
        self.reverse_vocab = {i: s for s,i in self.vocab.items()}  # Assume that the vocabulary is injective (no duplicate IDs)

    @property
    def vocab_size(self) -> int:
        return len(self.vocab)

    def _convert_token_to_id(self, token):
        return self.vocab[token]

    def _convert_id_to_token(self, index: int) -> str:
        return self.reverse_vocab[index]

    def _tokenize(self, text, **kwargs) -> List[str]:
        """
        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
        vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).

        Do NOT take care of added tokens.
        """
        return tokenizeWord(text)

but where does the pre-tokenizer come in? It doesn't even seem feasible to manually use the pre-tokenizers provided by tokenizers.pre_tokenizers (e.g. Whitespace, to name one) because those all provide Rust interfaces and hence the objects they output don't work with a simple string segmentation function.

Your contribution

None.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-27T09:57:35Z

Hey! The equivalent of pre-tokenizers is not impelemted directly for PretrainedTokenizers (yet, something might be on it's way). The pre-tokenization is usually done in the prepare_for_tokenization function.

GitMew · 2023-09-29T11:21:15Z

@ArthurZucker Thanks for your reply! That's unfortunate. One would expect that the two classes derive from the same base class and that that base class offers pretokenisation (and postprocessing, while we're at it).

I did see the prepare_for_tokenization function, but as far as I can see, it is supposed to output a string, not e.g. a list of strings to be tokenised separately, unless I violate its type signature. That seems like a bad idea, given that the PreTrainedTokenizer.tokenize function looks something like this, abstracted:

    def tokenizer(text, **kwargs):
        text, kwargs = self.prepare_for_tokenization(text, **kwargs)
        ...
        tokens = self.tokens_trie.split(text)
        ...
        tokenized_text = []
        for token in tokens:
            ...
            tokenized_text.extend(self._tokenize(token))

...wherein I assume the tokens_trie is only used to isolate a small set of very special tokens, and .split expects a string. Do you have an example of how people would e.g. include Whitespace() in prepare_for_tokenization compatible with this?

ArthurZucker · 2023-10-03T12:33:15Z

Usually the Whitespace() is done in this function, which is applied to all the inputs if the input is a batch of strings.
A lot of sentencepiece models do this (see LlamaTokenizer) for example. It is sometimes done in the tokenize.
I agree with you that the fast and slow lack consistency, and note this for futur improvements 🤗 Thanks for your input

github-actions · 2023-10-31T08:05:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access to pre_tokenizer for PreTrainedTokenizer #26254

Access to pre_tokenizer for PreTrainedTokenizer #26254

GitMew commented Sep 19, 2023 •

edited

Loading

ArthurZucker commented Sep 27, 2023

GitMew commented Sep 29, 2023 •

edited

Loading

ArthurZucker commented Oct 3, 2023

github-actions bot commented Oct 31, 2023

Access to pre_tokenizer for PreTrainedTokenizer #26254

Access to pre_tokenizer for PreTrainedTokenizer #26254

Comments

GitMew commented Sep 19, 2023 • edited Loading

Feature request

Motivation

Your contribution

ArthurZucker commented Sep 27, 2023

GitMew commented Sep 29, 2023 • edited Loading

ArthurZucker commented Oct 3, 2023

github-actions bot commented Oct 31, 2023

GitMew commented Sep 19, 2023 •

edited

Loading

GitMew commented Sep 29, 2023 •

edited

Loading