Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

Merged

Conversation

yuye-aws
Copy link
Member

@yuye-aws yuye-aws commented Feb 18, 2024

Description

This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And then obtain the response:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

You can refer to the RFC for detailed parameter description.

User Cases

Text Embedding

After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "IYMBDo4BwlxmLrDqUr0a",
          "field_map": {
            "body_chunk": "body_chunk_embedding"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And we obtain the following results:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body_chunk_embedding": [
            {
              "knn": [...]
            },
            {
              "knn": [...]
            },
            {
              "knn": [...]
            }
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

Cascaded Chunking Processors

Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:

PUT _ingest/pipeline/chunking-pipeline
{
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "body": "body_chunk1"
        }
      }
    },
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "body_chunk1": "body_chunk2"
        }
      }
    }
  ]
}

Issues Resolved

Implement document chunking processor and fixed token length algorithm

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuye-aws
Copy link
Member Author

For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests.

Copy link

codecov bot commented Feb 18, 2024

Codecov Report

Attention: Patch coverage is 97.89916% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (e41fba7) to head (68fef4f).
Report is 2 commits behind head on main.

Files Patch % Lines
.../neuralsearch/processor/TextChunkingProcessor.java 96.03% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #607      +/-   ##
============================================
+ Coverage     82.62%   84.19%   +1.56%     
- Complexity      666      743      +77     
============================================
  Files            52       59       +7     
  Lines          2072     2309     +237     
  Branches        334      370      +36     
============================================
+ Hits           1712     1944     +232     
- Misses          212      214       +2     
- Partials        148      151       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yuye-aws yuye-aws force-pushed the feature/documentChunkingProcessor branch from 30fd0eb to 57a4a20 Compare February 22, 2024 15:21
@yuye-aws
Copy link
Member Author

Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code.

Copy link

@samuel-oci samuel-oci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.

  1. Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
  2. Lets create a meta issue with design (I can create one and link it)
  3. We will move forward with the changes

@yuye-aws
Copy link
Member Author

  • Lets create a meta issue with design (I can create one and link it)

Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for?

Comment on lines +48 to +56
private static final Set<String> WORD_TOKENIZERS = Set.of(
"standard",
"letter",
"lowercase",
"whitespace",
"uax_url_email",
"classic",
"thai"
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.

Comment on lines 168 to 169
throw new IllegalStateException(
String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.

@model-collapse model-collapse merged commit eea53aa into opensearch-project:main Mar 18, 2024
60 checks passed
@model-collapse model-collapse added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 18, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 18, 2024
…elimiter algorithm (#607)

* implement chunking processor and fixed token length

Signed-off-by: yuye-aws <[email protected]>

* initialize node client for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize document chunking processor with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* chunker factory create with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement tokenizer in fixed token length algorithm with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* add max token count parsing logic

Signed-off-by: yuye-aws <[email protected]>

* bug fix for non-existing index

Signed-off-by: yuye-aws <[email protected]>

* change error log

Signed-off-by: yuye-aws <[email protected]>

* implement evenly chunk

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* add error message for chunker factory tests

Signed-off-by: yuye-aws <[email protected]>

* resolve comments

Signed-off-by: yuye-aws <[email protected]>

* Revert "implement evenly chunk"

This reverts commit 93dd2f4.

Signed-off-by: yuye-aws <[email protected]>

* add default value logic back

Signed-off-by: yuye-aws <[email protected]>

* implement unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* add test cases in unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove system out println

Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* basic unit tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix tests for getProcessors in neural search

Signed-off-by: yuye-aws <[email protected]>

* add unit tests with string, map and nested map type for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for parameter valdiation in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back deleted xml file

Signed-off-by: yuye-aws <[email protected]>

* restore xml file

Signed-off-by: yuye-aws <[email protected]>

* integration tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* restore Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* add changelog

Signed-off-by: yuye-aws <[email protected]>

* update integration test for cascade processor

Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update error message

Signed-off-by: yuye-aws <[email protected]>

* change field UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change logic of max chunk number

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit into fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* Support list<list<string>> type in embedding and extract validation logic to common class

Signed-off-by: zane-neo <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for inference processor

Signed-off-by: yuye-aws <[email protected]>

* implement unit tests for unit tests with max_chunk_limit in fixed token length

Signed-off-by: yuye-aws <[email protected]>

* constructor for inference processor

Signed-off-by: yuye-aws <[email protected]>

* use inference processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* draft code for extending inference processor with document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* api refactor for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove nested list key for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove unused function

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* Revert InferenceProcessor.java

Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* revert changes in text embedding and sparse encoding processor

Signed-off-by: yuye-aws <[email protected]>

* implement chunk with map in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add default delimiter value

Signed-off-by: Lu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* implement max chunk logic in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add initial value for max chunk limit in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunking processor: allow 0 max_chunk_limit

Signed-off-by: yuye-aws <[email protected]>

* implement overlap rate with big decimal

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit in delimiter

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit implementation in chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* spotless apply for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize current chunk count

Signed-off-by: yuye-aws <[email protected]>

* parameter validation for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* fix integration tests

Signed-off-by: yuye-aws <[email protected]>

* fix current UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change delimiter UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove delimiter useless code

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add more unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix import order

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix java doc error

Signed-off-by: yuye-aws <[email protected]>

* fix update ut for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* implement chunk count wrapper for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* rename variable end to nextDelimiterPosition

Signed-off-by: yuye-aws <[email protected]>

* adjust method place

Signed-off-by: yuye-aws <[email protected]>

* update java doc for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* reanme interface name and fixed token length algorithm name

Signed-off-by: yuye-aws <[email protected]>

* update fixed token length algorithm configuration for integration tests

Signed-off-by: yuye-aws <[email protected]>

* make delimiter member variables static

Signed-off-by: yuye-aws <[email protected]>

* remove redundant set field value in execute method

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add integration tests with more tokenizers

Signed-off-by: yuye-aws <[email protected]>

* bug fix: unit test failure due to invalid tokenizer

Signed-off-by: yuye-aws <[email protected]>

* bug fix: token concatenation in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update chunker interface

Signed-off-by: yuye-aws <[email protected]>

* track chunkCount within function

Signed-off-by: yuye-aws <[email protected]>

* bug fix: allow white space as the delimiter

Signed-off-by: yuye-aws <[email protected]>

* fix fixed length chunker

Signed-off-by: xinyual <[email protected]>

* fix delimiter chunker

Signed-off-by: xinyual <[email protected]>

* fix chunker factory

Signed-off-by: xinyual <[email protected]>

* fix UTs

Signed-off-by: xinyual <[email protected]>

* fix UT and chunker factory

Signed-off-by: xinyual <[email protected]>

* move analysis_registry to non-runtime parameters

Signed-off-by: xinyual <[email protected]>

* fix Uts

Signed-off-by: xinyual <[email protected]>

* avoid java doc change

Signed-off-by: xinyual <[email protected]>

* move validate to commonUtlis

Signed-off-by: xinyual <[email protected]>

* remove useless function

Signed-off-by: xinyual <[email protected]>

* change java doc

Signed-off-by: xinyual <[email protected]>

* fix Document process ut

Signed-off-by: xinyual <[email protected]>

* fixed token length: re-implement with start and end offset

Signed-off-by: yuye-aws <[email protected]>

* update exception message

Signed-off-by: yuye-aws <[email protected]>

* fix document chunking processor IT

Signed-off-by: yuye-aws <[email protected]>

* bug fix: adjust start, end content position in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update changelog for 2.x release

Signed-off-by: yuye-aws <[email protected]>

* rename processor

Signed-off-by: yuye-aws <[email protected]>

* update default delimiter to be \n\n

Signed-off-by: yuye-aws <[email protected]>

* remove change log in 3.0 unreleased

Signed-off-by: yuye-aws <[email protected]>

* fix IT failure due to chunking processor rename

Signed-off-by: yuye-aws <[email protected]>

* update javadoc for text chunking processor factory

Signed-off-by: yuye-aws <[email protected]>

* adjust functions in chunker interface

Signed-off-by: yuye-aws <[email protected]>

* move algorithm name definition to concrete chunker class

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update java doc for delimiter algorithm

Signed-off-by: yuye-aws <[email protected]>

* support range double in chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update sneaky throw with text chunking processor it

Signed-off-by: yuye-aws <[email protected]>

* add word tokenizer restriction for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update error message for multiple algorithms in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add comment in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* validate max chunk limit with util parameter class

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* make parameter final

Signed-off-by: yuye-aws <[email protected]>

* implement a map from chunker name to constuctor function in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove get all chunkers in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for max token count

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement parser and validator

Signed-off-by: yuye-aws <[email protected]>

* update comment

Signed-off-by: yuye-aws <[email protected]>

* provide fixed token length as the default algorithm

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* use object nonnull and require nonnull

Signed-off-by: yuye-aws <[email protected]>

* apply final to ingest document and chunk count

Signed-off-by: yuye-aws <[email protected]>

* merge parameter validator into the parser

Signed-off-by: yuye-aws <[email protected]>

* assign positive default value for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* validate supported chunker algorithm in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting of max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* add unit test with non list of string

Signed-off-by: yuye-aws <[email protected]>

* add unit test with null input

Signed-off-by: yuye-aws <[email protected]>

* add unit test for tokenization excpetion in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method name in text chunking processor unit test

Signed-off-by: yuye-aws <[email protected]>

* tune method name in delimiter algorithm unit test

Signed-off-by: yuye-aws <[email protected]>

* add unit test for overlap rate too small in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method modifier for all classes

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune exception type in parameter parser

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* include max chunk limit in both algorithms

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* allow 0 for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* update runtime max chunk limit in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* tune code for chunker

Signed-off-by: yuye-aws <[email protected]>

* implement test for multiple field max chunk limit exceed

Signed-off-by: yuye-aws <[email protected]>

* tune methods name in text chunking proceesor unit tests

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for both algorithms with max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* optimize code

Signed-off-by: yuye-aws <[email protected]>

* extract max chunk limit check to util class

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests

Signed-off-by: yuye-aws <[email protected]>

* bug fix: only update runtime max chunk limit when enabled

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: xinyual <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: Lu <[email protected]>
Co-authored-by: xinyual <[email protected]>
Co-authored-by: zane-neo <[email protected]>
Co-authored-by: Lu <[email protected]>
(cherry picked from commit eea53aa)
zane-neo pushed a commit that referenced this pull request Mar 18, 2024
…en length and delimiter algorithm (#644)

* feat: implement text chunking processor with fixed token length and delimiter algorithm (#607)

* implement chunking processor and fixed token length

Signed-off-by: yuye-aws <[email protected]>

* initialize node client for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize document chunking processor with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* chunker factory create with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement tokenizer in fixed token length algorithm with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* add max token count parsing logic

Signed-off-by: yuye-aws <[email protected]>

* bug fix for non-existing index

Signed-off-by: yuye-aws <[email protected]>

* change error log

Signed-off-by: yuye-aws <[email protected]>

* implement evenly chunk

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* add error message for chunker factory tests

Signed-off-by: yuye-aws <[email protected]>

* resolve comments

Signed-off-by: yuye-aws <[email protected]>

* Revert "implement evenly chunk"

This reverts commit 93dd2f4.

Signed-off-by: yuye-aws <[email protected]>

* add default value logic back

Signed-off-by: yuye-aws <[email protected]>

* implement unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* add test cases in unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove system out println

Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* basic unit tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix tests for getProcessors in neural search

Signed-off-by: yuye-aws <[email protected]>

* add unit tests with string, map and nested map type for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for parameter valdiation in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back deleted xml file

Signed-off-by: yuye-aws <[email protected]>

* restore xml file

Signed-off-by: yuye-aws <[email protected]>

* integration tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* restore Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* add changelog

Signed-off-by: yuye-aws <[email protected]>

* update integration test for cascade processor

Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update error message

Signed-off-by: yuye-aws <[email protected]>

* change field UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change logic of max chunk number

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit into fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* Support list<list<string>> type in embedding and extract validation logic to common class

Signed-off-by: zane-neo <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for inference processor

Signed-off-by: yuye-aws <[email protected]>

* implement unit tests for unit tests with max_chunk_limit in fixed token length

Signed-off-by: yuye-aws <[email protected]>

* constructor for inference processor

Signed-off-by: yuye-aws <[email protected]>

* use inference processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* draft code for extending inference processor with document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* api refactor for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove nested list key for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove unused function

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* Revert InferenceProcessor.java

Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* revert changes in text embedding and sparse encoding processor

Signed-off-by: yuye-aws <[email protected]>

* implement chunk with map in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add default delimiter value

Signed-off-by: Lu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* implement max chunk logic in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add initial value for max chunk limit in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunking processor: allow 0 max_chunk_limit

Signed-off-by: yuye-aws <[email protected]>

* implement overlap rate with big decimal

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit in delimiter

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit implementation in chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* spotless apply for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize current chunk count

Signed-off-by: yuye-aws <[email protected]>

* parameter validation for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* fix integration tests

Signed-off-by: yuye-aws <[email protected]>

* fix current UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change delimiter UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove delimiter useless code

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add more unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix import order

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix java doc error

Signed-off-by: yuye-aws <[email protected]>

* fix update ut for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* implement chunk count wrapper for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* rename variable end to nextDelimiterPosition

Signed-off-by: yuye-aws <[email protected]>

* adjust method place

Signed-off-by: yuye-aws <[email protected]>

* update java doc for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* reanme interface name and fixed token length algorithm name

Signed-off-by: yuye-aws <[email protected]>

* update fixed token length algorithm configuration for integration tests

Signed-off-by: yuye-aws <[email protected]>

* make delimiter member variables static

Signed-off-by: yuye-aws <[email protected]>

* remove redundant set field value in execute method

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add integration tests with more tokenizers

Signed-off-by: yuye-aws <[email protected]>

* bug fix: unit test failure due to invalid tokenizer

Signed-off-by: yuye-aws <[email protected]>

* bug fix: token concatenation in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update chunker interface

Signed-off-by: yuye-aws <[email protected]>

* track chunkCount within function

Signed-off-by: yuye-aws <[email protected]>

* bug fix: allow white space as the delimiter

Signed-off-by: yuye-aws <[email protected]>

* fix fixed length chunker

Signed-off-by: xinyual <[email protected]>

* fix delimiter chunker

Signed-off-by: xinyual <[email protected]>

* fix chunker factory

Signed-off-by: xinyual <[email protected]>

* fix UTs

Signed-off-by: xinyual <[email protected]>

* fix UT and chunker factory

Signed-off-by: xinyual <[email protected]>

* move analysis_registry to non-runtime parameters

Signed-off-by: xinyual <[email protected]>

* fix Uts

Signed-off-by: xinyual <[email protected]>

* avoid java doc change

Signed-off-by: xinyual <[email protected]>

* move validate to commonUtlis

Signed-off-by: xinyual <[email protected]>

* remove useless function

Signed-off-by: xinyual <[email protected]>

* change java doc

Signed-off-by: xinyual <[email protected]>

* fix Document process ut

Signed-off-by: xinyual <[email protected]>

* fixed token length: re-implement with start and end offset

Signed-off-by: yuye-aws <[email protected]>

* update exception message

Signed-off-by: yuye-aws <[email protected]>

* fix document chunking processor IT

Signed-off-by: yuye-aws <[email protected]>

* bug fix: adjust start, end content position in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update changelog for 2.x release

Signed-off-by: yuye-aws <[email protected]>

* rename processor

Signed-off-by: yuye-aws <[email protected]>

* update default delimiter to be \n\n

Signed-off-by: yuye-aws <[email protected]>

* remove change log in 3.0 unreleased

Signed-off-by: yuye-aws <[email protected]>

* fix IT failure due to chunking processor rename

Signed-off-by: yuye-aws <[email protected]>

* update javadoc for text chunking processor factory

Signed-off-by: yuye-aws <[email protected]>

* adjust functions in chunker interface

Signed-off-by: yuye-aws <[email protected]>

* move algorithm name definition to concrete chunker class

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update java doc for delimiter algorithm

Signed-off-by: yuye-aws <[email protected]>

* support range double in chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update sneaky throw with text chunking processor it

Signed-off-by: yuye-aws <[email protected]>

* add word tokenizer restriction for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update error message for multiple algorithms in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add comment in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* validate max chunk limit with util parameter class

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* make parameter final

Signed-off-by: yuye-aws <[email protected]>

* implement a map from chunker name to constuctor function in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove get all chunkers in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for max token count

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement parser and validator

Signed-off-by: yuye-aws <[email protected]>

* update comment

Signed-off-by: yuye-aws <[email protected]>

* provide fixed token length as the default algorithm

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* use object nonnull and require nonnull

Signed-off-by: yuye-aws <[email protected]>

* apply final to ingest document and chunk count

Signed-off-by: yuye-aws <[email protected]>

* merge parameter validator into the parser

Signed-off-by: yuye-aws <[email protected]>

* assign positive default value for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* validate supported chunker algorithm in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting of max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* add unit test with non list of string

Signed-off-by: yuye-aws <[email protected]>

* add unit test with null input

Signed-off-by: yuye-aws <[email protected]>

* add unit test for tokenization excpetion in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method name in text chunking processor unit test

Signed-off-by: yuye-aws <[email protected]>

* tune method name in delimiter algorithm unit test

Signed-off-by: yuye-aws <[email protected]>

* add unit test for overlap rate too small in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method modifier for all classes

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune exception type in parameter parser

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* include max chunk limit in both algorithms

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* allow 0 for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* update runtime max chunk limit in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* tune code for chunker

Signed-off-by: yuye-aws <[email protected]>

* implement test for multiple field max chunk limit exceed

Signed-off-by: yuye-aws <[email protected]>

* tune methods name in text chunking proceesor unit tests

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for both algorithms with max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* optimize code

Signed-off-by: yuye-aws <[email protected]>

* extract max chunk limit check to util class

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests

Signed-off-by: yuye-aws <[email protected]>

* bug fix: only update runtime max chunk limit when enabled

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: xinyual <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: Lu <[email protected]>
Co-authored-by: xinyual <[email protected]>
Co-authored-by: zane-neo <[email protected]>
Co-authored-by: Lu <[email protected]>
(cherry picked from commit eea53aa)

* bug fix: fix compile error in integration test (#645)

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
Co-authored-by: Yuye Zhu <[email protected]>
@yuye-aws yuye-aws deleted the feature/documentChunkingProcessor branch March 26, 2024 02:19
// chunk the object when target key is of leaf type (null, string and list of string)
Object chunkObject = sourceAndMetadataMap.get(originalKey);
List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters);
sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceAndMetadataMap contains some metadata fields such as _index, _routing and _id, if the targetKey equals the name of the metadata field, may cause accident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple solution is to prohibiting targetKey starting with "_".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check the behavior of other ingestion processors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants