Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] feat: implement text chunking processor with fixed token length and delimiter algorithm #644

Merged
merged 2 commits into from
Mar 18, 2024

Conversation

opensearch-trigger-bot[bot]
Copy link
Contributor

Backport eea53aa from #607

…elimiter algorithm (#607)

* implement chunking processor and fixed token length

Signed-off-by: yuye-aws <[email protected]>

* initialize node client for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize document chunking processor with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* chunker factory create with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement tokenizer in fixed token length algorithm with analysis registry

Signed-off-by: yuye-aws <[email protected]>

* add max token count parsing logic

Signed-off-by: yuye-aws <[email protected]>

* bug fix for non-existing index

Signed-off-by: yuye-aws <[email protected]>

* change error log

Signed-off-by: yuye-aws <[email protected]>

* implement evenly chunk

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* unit tests for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* add error message for chunker factory tests

Signed-off-by: yuye-aws <[email protected]>

* resolve comments

Signed-off-by: yuye-aws <[email protected]>

* Revert "implement evenly chunk"

This reverts commit 93dd2f4.

Signed-off-by: yuye-aws <[email protected]>

* add default value logic back

Signed-off-by: yuye-aws <[email protected]>

* implement unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* add test cases in unit test for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* support map type as an input

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type

Signed-off-by: yuye-aws <[email protected]>

* bug fix for map type in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove system out println

Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for delimiter chunker

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add delimiter chunker processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UTs

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* basic unit tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix tests for getProcessors in neural search

Signed-off-by: yuye-aws <[email protected]>

* add unit tests with string, map and nested map type for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for parameter valdiation in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back deleted xml file

Signed-off-by: yuye-aws <[email protected]>

* restore xml file

Signed-off-by: yuye-aws <[email protected]>

* integration tests for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add back Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* restore Run_Neural_Search.xml

Signed-off-by: yuye-aws <[email protected]>

* add changelog

Signed-off-by: yuye-aws <[email protected]>

* update integration test for cascade processor

Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update error message

Signed-off-by: yuye-aws <[email protected]>

* change field UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove useless and apply spotless

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change logic of max chunk number

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add max chunk limit into fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* Support list<list<string>> type in embedding and extract validation logic to common class

Signed-off-by: zane-neo <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for inference processor

Signed-off-by: yuye-aws <[email protected]>

* implement unit tests for unit tests with max_chunk_limit in fixed token length

Signed-off-by: yuye-aws <[email protected]>

* constructor for inference processor

Signed-off-by: yuye-aws <[email protected]>

* use inference processor

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* draft code for extending inference processor with document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* api refactor for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove nested list key for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* remove unused function

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* remove processor validator

Signed-off-by: yuye-aws <[email protected]>

* Revert InferenceProcessor.java

Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* revert changes in text embedding and sparse encoding processor

Signed-off-by: yuye-aws <[email protected]>

* implement chunk with map in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add default delimiter value

Signed-off-by: Lu <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* implement max chunk logic in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add initial value for max chunk limit in document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunking processor: allow 0 max_chunk_limit

Signed-off-by: yuye-aws <[email protected]>

* implement overlap rate with big decimal

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit in delimiter

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update max chunk limit implementation in chunking processor

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* spotless apply for document chunking processor

Signed-off-by: yuye-aws <[email protected]>

* initialize current chunk count

Signed-off-by: yuye-aws <[email protected]>

* parameter validation for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* fix integration tests

Signed-off-by: yuye-aws <[email protected]>

* fix current UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* change delimiter UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* remove delimiter useless code

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add more UT

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* add UT for list inside map

Signed-off-by: xinyual <[email protected]>
Signed-off-by: yuye-aws <[email protected]>

* update unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add more unit tests for chunking processor

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix import order

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* fix java doc error

Signed-off-by: yuye-aws <[email protected]>

* fix update ut for fixed token length chunker

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* implement chunk count wrapper for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* rename variable end to nextDelimiterPosition

Signed-off-by: yuye-aws <[email protected]>

* adjust method place

Signed-off-by: yuye-aws <[email protected]>

* update java doc for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* reanme interface name and fixed token length algorithm name

Signed-off-by: yuye-aws <[email protected]>

* update fixed token length algorithm configuration for integration tests

Signed-off-by: yuye-aws <[email protected]>

* make delimiter member variables static

Signed-off-by: yuye-aws <[email protected]>

* remove redundant set field value in execute method

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* add integration tests with more tokenizers

Signed-off-by: yuye-aws <[email protected]>

* bug fix: unit test failure due to invalid tokenizer

Signed-off-by: yuye-aws <[email protected]>

* bug fix: token concatenation in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update chunker interface

Signed-off-by: yuye-aws <[email protected]>

* track chunkCount within function

Signed-off-by: yuye-aws <[email protected]>

* bug fix: allow white space as the delimiter

Signed-off-by: yuye-aws <[email protected]>

* fix fixed length chunker

Signed-off-by: xinyual <[email protected]>

* fix delimiter chunker

Signed-off-by: xinyual <[email protected]>

* fix chunker factory

Signed-off-by: xinyual <[email protected]>

* fix UTs

Signed-off-by: xinyual <[email protected]>

* fix UT and chunker factory

Signed-off-by: xinyual <[email protected]>

* move analysis_registry to non-runtime parameters

Signed-off-by: xinyual <[email protected]>

* fix Uts

Signed-off-by: xinyual <[email protected]>

* avoid java doc change

Signed-off-by: xinyual <[email protected]>

* move validate to commonUtlis

Signed-off-by: xinyual <[email protected]>

* remove useless function

Signed-off-by: xinyual <[email protected]>

* change java doc

Signed-off-by: xinyual <[email protected]>

* fix Document process ut

Signed-off-by: xinyual <[email protected]>

* fixed token length: re-implement with start and end offset

Signed-off-by: yuye-aws <[email protected]>

* update exception message

Signed-off-by: yuye-aws <[email protected]>

* fix document chunking processor IT

Signed-off-by: yuye-aws <[email protected]>

* bug fix: adjust start, end content position in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update changelog for 2.x release

Signed-off-by: yuye-aws <[email protected]>

* rename processor

Signed-off-by: yuye-aws <[email protected]>

* update default delimiter to be \n\n

Signed-off-by: yuye-aws <[email protected]>

* remove change log in 3.0 unreleased

Signed-off-by: yuye-aws <[email protected]>

* fix IT failure due to chunking processor rename

Signed-off-by: yuye-aws <[email protected]>

* update javadoc for text chunking processor factory

Signed-off-by: yuye-aws <[email protected]>

* adjust functions in chunker interface

Signed-off-by: yuye-aws <[email protected]>

* move algorithm name definition to concrete chunker class

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker factory

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update java doc for delimiter algorithm

Signed-off-by: yuye-aws <[email protected]>

* support range double in chunker parameter validator

Signed-off-by: yuye-aws <[email protected]>

* update string formatted message for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update sneaky throw with text chunking processor it

Signed-off-by: yuye-aws <[email protected]>

* add word tokenizer restriction for fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* update error message for multiple algorithms in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* add comment in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* validate max chunk limit with util parameter class

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update comments

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* update java doc

Signed-off-by: yuye-aws <[email protected]>

* make parameter final

Signed-off-by: yuye-aws <[email protected]>

* implement a map from chunker name to constuctor function in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* bug fix in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove get all chunkers in chunker factory

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for max token count

Signed-off-by: yuye-aws <[email protected]>

* remove type check for parameter check for analysis registry

Signed-off-by: yuye-aws <[email protected]>

* implement parser and validator

Signed-off-by: yuye-aws <[email protected]>

* update comment

Signed-off-by: yuye-aws <[email protected]>

* provide fixed token length as the default algorithm

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* adjust exception message

Signed-off-by: yuye-aws <[email protected]>

* use object nonnull and require nonnull

Signed-off-by: yuye-aws <[email protected]>

* apply final to ingest document and chunk count

Signed-off-by: yuye-aws <[email protected]>

* merge parameter validator into the parser

Signed-off-by: yuye-aws <[email protected]>

* assign positive default value for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* validate supported chunker algorithm in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* update parameter setting of max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* add unit test with non list of string

Signed-off-by: yuye-aws <[email protected]>

* add unit test with null input

Signed-off-by: yuye-aws <[email protected]>

* add unit test for tokenization excpetion in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method name in text chunking processor unit test

Signed-off-by: yuye-aws <[email protected]>

* tune method name in delimiter algorithm unit test

Signed-off-by: yuye-aws <[email protected]>

* add unit test for overlap rate too small in fixed token length algorithm

Signed-off-by: yuye-aws <[email protected]>

* tune method modifier for all classes

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune code

Signed-off-by: yuye-aws <[email protected]>

* tune exception type in parameter parser

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* include max chunk limit in both algorithms

Signed-off-by: yuye-aws <[email protected]>

* tune comment

Signed-off-by: yuye-aws <[email protected]>

* allow 0 for max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* update runtime max chunk limit in text chunking processor

Signed-off-by: yuye-aws <[email protected]>

* tune code for chunker

Signed-off-by: yuye-aws <[email protected]>

* implement test for multiple field max chunk limit exceed

Signed-off-by: yuye-aws <[email protected]>

* tune methods name in text chunking proceesor unit tests

Signed-off-by: yuye-aws <[email protected]>

* add unit tests for both algorithms with max chunk limit

Signed-off-by: yuye-aws <[email protected]>

* optimize code

Signed-off-by: yuye-aws <[email protected]>

* extract max chunk limit check to util class

Signed-off-by: yuye-aws <[email protected]>

* resolve code review comments

Signed-off-by: yuye-aws <[email protected]>

* fix unit tests

Signed-off-by: yuye-aws <[email protected]>

* bug fix: only update runtime max chunk limit when enabled

Signed-off-by: yuye-aws <[email protected]>

---------

Signed-off-by: yuye-aws <[email protected]>
Signed-off-by: xinyual <[email protected]>
Signed-off-by: zane-neo <[email protected]>
Signed-off-by: Yuye Zhu <[email protected]>
Signed-off-by: Lu <[email protected]>
Co-authored-by: xinyual <[email protected]>
Co-authored-by: zane-neo <[email protected]>
Co-authored-by: Lu <[email protected]>
(cherry picked from commit eea53aa)
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 97.89916% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.77%. Comparing base (7e57f65) to head (49396f2).

Files Patch % Lines
.../neuralsearch/processor/TextChunkingProcessor.java 96.03% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                2.x     #644      +/-   ##
============================================
+ Coverage     83.28%   84.77%   +1.48%     
- Complexity      674      751      +77     
============================================
  Files            52       59       +7     
  Lines          2088     2325     +237     
  Branches        338      374      +36     
============================================
+ Hits           1739     1971     +232     
- Misses          196      198       +2     
- Partials        153      156       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zane-neo zane-neo merged commit c27f712 into 2.x Mar 18, 2024
59 checks passed
@github-actions github-actions bot deleted the backport/backport-607-to-2.x branch March 18, 2024 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants