Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Model-based Tokenizer for Text Chunking #794

Open
yuye-aws opened this issue Jun 14, 2024 · 7 comments
Open

[RFC] Model-based Tokenizer for Text Chunking #794

yuye-aws opened this issue Jun 14, 2024 · 7 comments
Labels
enhancement Enhancements Increases software capabilities beyond original client specifications

Comments

@yuye-aws
Copy link
Member

yuye-aws commented Jun 14, 2024

Since OpenSearch 2.13, fixed token length algorithm is available in text chunking processor. For fixed token length algorithm, users can specify the token limit for each chunked passages. A common use case for text chunking processor is to append a text embedding processor. With text chunking processor, users can circumvent the information loss due to truncation from downstream text embedding models.

By OpenSearch 2.15, fixed token length algorithm only supports word tokenizers. The text embedding models perform truncation on long texts if exceeding token limit by their own tokenizer. Given the disparity between word tokenizers and model-based tokenizers, It is hard for users to assign a perfect value to parameter for token_limit. We are initiating this RFC to solicit feedbacks from the community to determine whether and how to implement model-based tokenizer for fixed token length algorithm.

Introduction

Tokenization is the process of segmenting a string into a list of individual tokens. Prior to text embedding, language models perform tokenization on the input texts. Each language model have its own Model-based tokenizers.

The tokenization results varies across different tokenizers. We showcase the difference between word tokenizers and model-based tokenizers with a simple example. The same input string will be tokenized with standard tokenizer and the tokenizer from model sentence-transformers/msmarco-distilbert-base-tas-b.

// input 
"It’s fun to contribute a brand-new PR or 2 to OpenSearch!"

// standard tokenizer
['`It’s'`, '`fun'`, '`to'`, '`contribute'`, '`a'`, '`brand'`, '`new'`, '`PR'`, '`or'`, '`2'`, '`to'`, '`OpenSearch'`]

// sentence-transformers/msmarco-distilbert-base-tas-b
['[CLS]', 'it', '’', 's', 'fun', 'to', 'contribute', 'a', 'brand', '-', 'new', 'pr', 'or', '2', 'to', 'opens', '##ear', '##ch', '!', '[SEP]']

where [CLS] indicates the beginning of a sentence and [SEP] splits sentences. As we can see from the example above, the tokens returned by the standard tokenizer and the model-based tokenizer are quite different. The standard tokenizer returns 12 tokens and the model-based tokenizer returns 20 tokens.

In our first release, we can start with the tokenizers for OpenSearch-provided pretrained models. These models usually do not share the same vocabulary corpus and tokenizer. For disambiguation, we need to support a dedicated tokenizer for each of the following model.

Sentence transformers

  1. huggingface/sentence-transformers/all-distilroberta-v1
  2. huggingface/sentence-transformers/all-MiniLM-L6-v2
  3. huggingface/sentence-transformers/all-MiniLM-L12-v2
  4. huggingface/sentence-transformers/all-mpnet-base-v2
  5. huggingface/sentence-transformers/msmarco-distilbert-base-tas-b
  6. huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
  7. huggingface/sentence-transformers/multi-qa-mpnet-base-dot-v1
  8. huggingface/sentence-transformers/paraphrase-MiniLM-L3-v2
  9. huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  10. huggingface/sentence-transformers/paraphrase-mpnet-base-v2
  11. huggingface/sentence-transformers/distiluse-base-multilingual-cased-v1

Sparse encoding models

  1. amazon/neural-sparse/opensearch-neural-sparse-encoding-v1
  2. amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1
  3. amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1

Cross-encoder models

  1. huggingface/cross-encoders/ms-marco-MiniLM-L-6-v2
  2. huggingface/cross-encoders/ms-marco-MiniLM-L-12-v2

Pros and cons

Here are the pros and cons for model-based tokenizers:

Pros

  1. Enables users to precisely chunk their document according to truncation limit by downstream text embedding models.
  2. Model-based tokenizers are free from the max token count limit by word tokenizer, which is default set to 10000.

Cons

  1. Unlike word tokenizer which returns the start and end offset of every token, the model-based tokenizer only returns a list of tokens. As we can see from the example above, model-based tokenizer would modify the original input. Users may get confused with the content change.
  2. Model-based tokenizer may generate some new characters. For example, the word OpenSearch will be tokenized into ['opens', '##ear', '##ch']. It is unclear how to reformat these tokens into human readable texts.
  3. May introduce user confusion on how to specify word tokenizer and model-tokenizer from API.

API

There are two options to use model-based tokenizer in fixed token length algorithm. Please note that we can support both of them. For example, we can implement one option in the first release and then support another option later.

Option 1

Specify tokenizer with pretrained model name.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Users can use model tokenizer without deploying the text embedding model.

Cons

  1. Do not support the tokenizers from user uploaded model.
  2. Hard for user to remember the whole model name.
  3. The tokenizer is not reusable.

Option 2

After deploying a text embedding model, users can assign the model id to tokenizer.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <model_id for pretrained models>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Supports tokenization for any deployed model.

Cons

  1. Users need to deploy the text embedding model.
  2. Introduce a new parameter named tokenizer_model_id. We need to consider the contradiction between the already existing parameter tokenizer.
  3. Need to consider invalid models like text image embedding models.

Option 3

Unlike text embedding models, the tokenizer only needs files like tokenizer.json, tokenizer_config.json and vocab.txt. Following the behavior of registering models in ml-commons plugin, users can register their tokenizer without the model weights.

POST /_plugins/_ml/tokenizers/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
}


PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <tokenizer_id for pretrained tokenizers>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

  1. Users can register their tokenizer alone, which saves time and space consumption for model deployment.
  2. After deploying the tokenizer, users can reuse it within different ingestion pipelines.

Cons

  1. Introduces a new API for tokenizer registration. There may be some potential security risk.
  2. If the user need both tokenizer and embedding model, there would be some duplication.

Open questions

  1. What are other options to make model-based tokenizer available without deploying the text embedding model?
@dblock dblock changed the title [RFC] Model-baed Tokenizer for Text Chunking [RFC] Model-based Tokenizer for Text Chunking Jul 8, 2024
@dblock
Copy link
Member

dblock commented Jul 8, 2024

[Catch All Triage, attendees 1, 2, 3, 4, 5, 6, 7]

Thanks for opening this!

@reuschling
Copy link

The current approach is using the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. If the tokenization rules are known from the specified model, it is clear where the model-specific gaps of the chunks are. So specifying the model or model Id sounds good for me.

But the output of the chunking process should only rely on the mapping specification of the target field. This can be anything, it's the choice of the user. I would say that specifying the model should only be used to determine the real chunk length, relying on the model specific tokenizer rules, enabling to determine the chunk length with a real token limit.

Regarding the options, I would prefer option 2, because it enables every deployed model (I use LLM models not in the preconfigured OpenSearch list), and frees me to specify an extra tokenizer. I currently can't think of a use case where someone would want to chunk according to a specific model and not use it afterward.

Question: What about models that are not uploaded directly into Opensearch, but with an underlying connector (as in the LLM case). Are there the tokenizing rules known also?

@yuye-aws
Copy link
Member Author

Regarding the options, I would prefer option 2, because it enables every deployed model (I use LLM models not in the preconfigured OpenSearch list), and frees me to specify an extra tokenizer. I currently can't think of a use case where someone would want to chunk according to a specific model and not use it afterward.

Question: What about models that are not uploaded directly into Opensearch, but with an underlying connector (as in the LLM case). Are there the tokenizing rules known also?

If OpenSearch is using a remote model for downstream, option 2 is no longer valid unless we deploy another kind of connector in ml-commons. For remote model, can we choose option 3 so that users can enable the tokenizer with the tokenizer file.

@yuye-aws
Copy link
Member Author

The current approach is using the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. If the tokenization rules are known from the specified model, it is clear where the model-specific gaps of the chunks are. So specifying the model or model Id sounds good for me.

Although it is a valid estimation, the word tokenizer in OpenSearch may still produce longer texts than the token limit of the text embedding model. We are considering implementing model-based tokenizer as an approach so that there would be neither information loss or extra chunks.

@reuschling
Copy link

The tokenization inside the chunker is only an internal process, right? It is not necessary to specify this tokenizer on the target field. I'm a bit confused because of the pros and cons, where a con is the output of the tokenizer, but this output is never seen by the user, or am I wrong?"

@yuye-aws
Copy link
Member Author

The tokenization inside the chunker is only an internal process, right? It is not necessary to specify this tokenizer on the target field.

You are right. The tokenization is orthogonal to the target field. Ideally, the user should be able to specify any tokenizer on any existing target field.

@yuye-aws
Copy link
Member Author

yuye-aws commented Jul 11, 2024

I'm a bit confused because of the pros and cons, where a con is the output of the tokenizer, but this output is never seen by the user, or am I wrong?"

Are you referring to this con:
Model-based tokenizer may generate some new characters. For example, the word OpenSearch will be tokenized into ['opens', '##ear', '##ch']. It is unclear how to reformat these tokens into human readable texts.

It is the reformatting problem. Suppose we are using fixed token length algorithm and the token limit is set to 1. The ideal output should be either ['opens', 'ear', 'ch'] or ['OpenS', 'ear', 'ch']. We may also need to take into account other special tokens like [CLS] and [SEP].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

No branches or pull requests

3 participants