Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to pre_tokenizer for PreTrainedTokenizer #26254

Closed
GitMew opened this issue Sep 19, 2023 · 4 comments
Closed

Access to pre_tokenizer for PreTrainedTokenizer #26254

GitMew opened this issue Sep 19, 2023 · 4 comments

Comments

@GitMew
Copy link

GitMew commented Sep 19, 2023

Feature request

Give access to setting a pre_tokenizer for a transformers.PreTrainedTokenizer, similar to how this works for PreTrainedTokenizerFast.

Motivation

As far as I understand from these docs, there are two interfaces for interacting with tokenizers in the HuggingFace ecosystem: PreTrainedTokenizerFast is a wrapper around Rust code, and PreTrainedTokenizer is supposed to be the slow Python equivalent.

PreTrainedTokenizerFast has a property backend_tokenizer which is a tokenizers.Tokenizer object, which has a pre_tokenizer property and is built from a tokenizers.models.Model subclass (the thing that does the tokenization). You can instantiate a PreTrainedTokenizerFast from such a Tokenizer object with the constructor argument tokenizer_object. Meanwhile, none of this is accessible for a PreTrainedTokenizer.

Here is my use-case: I have a function tokenizeWord(w: str) implemented entirely in Python to segment a single word into subwords. I would now like to

  1. Build a PreTrainedTokenizer from this function, and
  2. pre-tokenize sentences on punctuation and whitespace so that each word is sent to that function separately.

I can do the first as follows (at least I think this is how it's supposed to be done):

class CustomTokenizer(PreTrainedTokenizer):

    def __init__(self, custom_tkz_algorithm, **kwargs):
        super().__init__(**kwargs)
        self.algorithm = custom_tkz_algorithm
        self.vocab         = self.algorithm.get_vocab()
        self.reverse_vocab = {i: s for s,i in self.vocab.items()}  # Assume that the vocabulary is injective (no duplicate IDs)

    @property
    def vocab_size(self) -> int:
        return len(self.vocab)

    def _convert_token_to_id(self, token):
        return self.vocab[token]

    def _convert_id_to_token(self, index: int) -> str:
        return self.reverse_vocab[index]

    def _tokenize(self, text, **kwargs) -> List[str]:
        """
        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
        vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces).

        Do NOT take care of added tokens.
        """
        return tokenizeWord(text)

but where does the pre-tokenizer come in? It doesn't even seem feasible to manually use the pre-tokenizers provided by tokenizers.pre_tokenizers (e.g. Whitespace, to name one) because those all provide Rust interfaces and hence the objects they output don't work with a simple string segmentation function.

Your contribution

None.

@ArthurZucker
Copy link
Collaborator

Hey! The equivalent of pre-tokenizers is not impelemted directly for PretrainedTokenizers (yet, something might be on it's way). The pre-tokenization is usually done in the prepare_for_tokenization function.

@GitMew
Copy link
Author

GitMew commented Sep 29, 2023

@ArthurZucker Thanks for your reply! That's unfortunate. One would expect that the two classes derive from the same base class and that that base class offers pretokenisation (and postprocessing, while we're at it).

I did see the prepare_for_tokenization function, but as far as I can see, it is supposed to output a string, not e.g. a list of strings to be tokenised separately, unless I violate its type signature. That seems like a bad idea, given that the PreTrainedTokenizer.tokenize function looks something like this, abstracted:

    def tokenizer(text, **kwargs):
        text, kwargs = self.prepare_for_tokenization(text, **kwargs)
        ...
        tokens = self.tokens_trie.split(text)
        ...
        tokenized_text = []
        for token in tokens:
            ...
            tokenized_text.extend(self._tokenize(token))

...wherein I assume the tokens_trie is only used to isolate a small set of very special tokens, and .split expects a string. Do you have an example of how people would e.g. include Whitespace() in prepare_for_tokenization compatible with this?

@ArthurZucker
Copy link
Collaborator

Usually the Whitespace() is done in this function, which is applied to all the inputs if the input is a batch of strings.
A lot of sentencepiece models do this (see LlamaTokenizer) for example. It is sometimes done in the tokenize.
I agree with you that the fast and slow lack consistency, and note this for futur improvements 🤗 Thanks for your input

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants