[RFC] Model-based Tokenizer for Text Chunking #794

yuye-aws · 2024-06-14T07:00:05Z

Since OpenSearch 2.13, fixed token length algorithm is available in text chunking processor. For fixed token length algorithm, users can specify the token limit for each chunked passages. A common use case for text chunking processor is to append a text embedding processor. With text chunking processor, users can circumvent the information loss due to truncation from downstream text embedding models.

By OpenSearch 2.15, fixed token length algorithm only supports word tokenizers. The text embedding models perform truncation on long texts if exceeding token limit by their own tokenizer. Given the disparity between word tokenizers and model-based tokenizers, It is hard for users to assign a perfect value to parameter for token_limit. We are initiating this RFC to solicit feedbacks from the community to determine whether and how to implement model-based tokenizer for fixed token length algorithm.

Introduction

Tokenization is the process of segmenting a string into a list of individual tokens. Prior to text embedding, language models perform tokenization on the input texts. Each language model have its own Model-based tokenizers.

The tokenization results varies across different tokenizers. We showcase the difference between word tokenizers and model-based tokenizers with a simple example. The same input string will be tokenized with standard tokenizer and the tokenizer from model sentence-transformers/msmarco-distilbert-base-tas-b.

// input 
"It’s fun to contribute a brand-new PR or 2 to OpenSearch!"

// standard tokenizer
['`It’s'`, '`fun'`, '`to'`, '`contribute'`, '`a'`, '`brand'`, '`new'`, '`PR'`, '`or'`, '`2'`, '`to'`, '`OpenSearch'`]

// sentence-transformers/msmarco-distilbert-base-tas-b
['[CLS]', 'it', '’', 's', 'fun', 'to', 'contribute', 'a', 'brand', '-', 'new', 'pr', 'or', '2', 'to', 'opens', '##ear', '##ch', '!', '[SEP]']

where [CLS] indicates the beginning of a sentence and [SEP] splits sentences. As we can see from the example above, the tokens returned by the standard tokenizer and the model-based tokenizer are quite different. The standard tokenizer returns 12 tokens and the model-based tokenizer returns 20 tokens.

In our first release, we can start with the tokenizers for OpenSearch-provided pretrained models. These models usually do not share the same vocabulary corpus and tokenizer. For disambiguation, we need to support a dedicated tokenizer for each of the following model.

Sentence transformers

huggingface/sentence-transformers/all-distilroberta-v1
huggingface/sentence-transformers/all-MiniLM-L6-v2
huggingface/sentence-transformers/all-MiniLM-L12-v2
huggingface/sentence-transformers/all-mpnet-base-v2
huggingface/sentence-transformers/msmarco-distilbert-base-tas-b
huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
huggingface/sentence-transformers/multi-qa-mpnet-base-dot-v1
huggingface/sentence-transformers/paraphrase-MiniLM-L3-v2
huggingface/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
huggingface/sentence-transformers/paraphrase-mpnet-base-v2
huggingface/sentence-transformers/distiluse-base-multilingual-cased-v1

Sparse encoding models

amazon/neural-sparse/opensearch-neural-sparse-encoding-v1
amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1
amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1

Cross-encoder models

huggingface/cross-encoders/ms-marco-MiniLM-L-6-v2
huggingface/cross-encoders/ms-marco-MiniLM-L-12-v2

Pros and cons

Here are the pros and cons for model-based tokenizers:

Pros

Enables users to precisely chunk their document according to truncation limit by downstream text embedding models.
Model-based tokenizers are free from the max token count limit by word tokenizer, which is default set to 10000.

Cons

Unlike word tokenizer which returns the start and end offset of every token, the model-based tokenizer only returns a list of tokens. As we can see from the example above, model-based tokenizer would modify the original input. Users may get confused with the content change.
Model-based tokenizer may generate some new characters. For example, the word OpenSearch will be tokenized into ['opens', '##ear', '##ch']. It is unclear how to reformat these tokens into human readable texts.
May introduce user confusion on how to specify word tokenizer and model-tokenizer from API.

API

There are two options to use model-based tokenizer in fixed token length algorithm. Please note that we can support both of them. For example, we can implement one option in the first release and then support another option later.

Option 1

Specify tokenizer with pretrained model name.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

Users can use model tokenizer without deploying the text embedding model.

Cons

Do not support the tokenizers from user uploaded model.
Hard for user to remember the whole model name.
The tokenizer is not reusable.

Option 2

After deploying a text embedding model, users can assign the model id to tokenizer.

PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <model_id for pretrained models>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

Supports tokenization for any deployed model.

Cons

Users need to deploy the text embedding model.
Introduce a new parameter named tokenizer_model_id. We need to consider the contradiction between the already existing parameter tokenizer.
Need to consider invalid models like text image embedding models.

Option 3

Unlike text embedding models, the tokenizer only needs files like tokenizer.json, tokenizer_config.json and vocab.txt. Following the behavior of registering models in ml-commons plugin, users can register their tokenizer without the model weights.

POST /_plugins/_ml/tokenizers/_register
{
  "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b"
}


PUT _ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer_model_id": <tokenizer_id for pretrained tokenizers>
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    }
  ]
}

Pros

Users can register their tokenizer alone, which saves time and space consumption for model deployment.
After deploying the tokenizer, users can reuse it within different ingestion pipelines.

Cons

Introduces a new API for tokenizer registration. There may be some potential security risk.
If the user need both tokenizer and embedding model, there would be some duplication.

Open questions

What are other options to make model-based tokenizer available without deploying the text embedding model?

The text was updated successfully, but these errors were encountered:

dblock · 2024-07-08T16:06:20Z

[Catch All Triage, attendees 1, 2, 3, 4, 5, 6, 7]

Thanks for opening this!

reuschling · 2024-07-09T12:31:18Z

The current approach is using the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. If the tokenization rules are known from the specified model, it is clear where the model-specific gaps of the chunks are. So specifying the model or model Id sounds good for me.

But the output of the chunking process should only rely on the mapping specification of the target field. This can be anything, it's the choice of the user. I would say that specifying the model should only be used to determine the real chunk length, relying on the model specific tokenizer rules, enabling to determine the chunk length with a real token limit.

Regarding the options, I would prefer option 2, because it enables every deployed model (I use LLM models not in the preconfigured OpenSearch list), and frees me to specify an extra tokenizer. I currently can't think of a use case where someone would want to chunk according to a specific model and not use it afterward.

Question: What about models that are not uploaded directly into Opensearch, but with an underlying connector (as in the LLM case). Are there the tokenizing rules known also?

yuye-aws · 2024-07-10T02:18:40Z

Regarding the options, I would prefer option 2, because it enables every deployed model (I use LLM models not in the preconfigured OpenSearch list), and frees me to specify an extra tokenizer. I currently can't think of a use case where someone would want to chunk according to a specific model and not use it afterward.

Question: What about models that are not uploaded directly into Opensearch, but with an underlying connector (as in the LLM case). Are there the tokenizing rules known also?

If OpenSearch is using a remote model for downstream, option 2 is no longer valid unless we deploy another kind of connector in ml-commons. For remote model, can we choose option 3 so that users can enable the tokenizer with the tokenizer file.

yuye-aws · 2024-07-10T02:22:41Z

The current approach is using the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. If the tokenization rules are known from the specified model, it is clear where the model-specific gaps of the chunks are. So specifying the model or model Id sounds good for me.

Although it is a valid estimation, the word tokenizer in OpenSearch may still produce longer texts than the token limit of the text embedding model. We are considering implementing model-based tokenizer as an approach so that there would be neither information loss or extra chunks.

reuschling · 2024-07-10T10:10:12Z

The tokenization inside the chunker is only an internal process, right? It is not necessary to specify this tokenizer on the target field. I'm a bit confused because of the pros and cons, where a con is the output of the tokenizer, but this output is never seen by the user, or am I wrong?"

yuye-aws · 2024-07-11T02:24:18Z

The tokenization inside the chunker is only an internal process, right? It is not necessary to specify this tokenizer on the target field.

You are right. The tokenization is orthogonal to the target field. Ideally, the user should be able to specify any tokenizer on any existing target field.

yuye-aws · 2024-07-11T02:28:12Z

I'm a bit confused because of the pros and cons, where a con is the output of the tokenizer, but this output is never seen by the user, or am I wrong?"

Are you referring to this con:
Model-based tokenizer may generate some new characters. For example, the word OpenSearch will be tokenized into ['opens', '##ear', '##ch']. It is unclear how to reformat these tokens into human readable texts.

It is the reformatting problem. Suppose we are using fixed token length algorithm and the token limit is set to 1. The ideal output should be either ['opens', 'ear', 'ch'] or ['OpenS', 'ear', 'ch']. We may also need to take into account other special tokens like [CLS] and [SEP].

github-actions bot added the untriaged label Jun 14, 2024

dblock changed the title ~~[RFC] Model-baed Tokenizer for Text Chunking~~ [RFC] Model-based Tokenizer for Text Chunking Jul 8, 2024

dblock added Enhancements Increases software capabilities beyond original client specifications enhancement and removed untriaged labels Jul 8, 2024

yuye-aws mentioned this issue Jul 9, 2024

[FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input opensearch-project/ml-commons#2612

Open

martin-gaievski mentioned this issue Jul 29, 2024

[META] Chunking and querying of long passages for vector search #612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Model-based Tokenizer for Text Chunking #794

[RFC] Model-based Tokenizer for Text Chunking #794

yuye-aws commented Jun 14, 2024 •

edited

Loading

dblock commented Jul 8, 2024

reuschling commented Jul 9, 2024

yuye-aws commented Jul 10, 2024

yuye-aws commented Jul 10, 2024

reuschling commented Jul 10, 2024

yuye-aws commented Jul 11, 2024

yuye-aws commented Jul 11, 2024 •

edited

Loading

[RFC] Model-based Tokenizer for Text Chunking #794

[RFC] Model-based Tokenizer for Text Chunking #794

Comments

yuye-aws commented Jun 14, 2024 • edited Loading

Introduction

Sentence transformers

Sparse encoding models

Cross-encoder models

Pros and cons

Pros

Cons

API

Option 1

Pros

Cons

Option 2

Pros

Cons

Option 3

Pros

Cons

Open questions

dblock commented Jul 8, 2024

reuschling commented Jul 9, 2024

yuye-aws commented Jul 10, 2024

yuye-aws commented Jul 10, 2024

reuschling commented Jul 10, 2024

yuye-aws commented Jul 11, 2024

yuye-aws commented Jul 11, 2024 • edited Loading

yuye-aws commented Jun 14, 2024 •

edited

Loading

yuye-aws commented Jul 11, 2024 •

edited

Loading