Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for custom tokenizers ngram and regex. #3575

Merged
merged 6 commits into from
Jul 4, 2023

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Jun 25, 2023

Fix #3056 and #3392

This PR adds support for custom analyzers, aka tokenizer + filters.

Doc mapping config with custom tokenizers

doc_mapping:
  tokenizers:
    - name: service_regex
      type: regex
      pattern: "\\w*"
  field_mappings:
    - name: service
      type: text
      tokenizer: service_regex

Endpoint to test a custom tokenizer

POST /analyze

with payload

{
  "type": "ngram",
  "min_gram": 3,
  "max_gram": 5,
  "text": "hello"
}
curl -XPOST http://localhost:7280/api/v1/analyze -H "content-type: application/json" --data '{"type": "regex", "pattern": "\\w+", "text":"hello world"}'
[
  {
    "offset_from": 0,
    "offset_to": 5,
    "position": 0,
    "position_length": 1,
    "text": "hello"
  },
  {
    "offset_from": 6,
    "offset_to": 11,
    "position": 1,
    "position_length": 1,
    "text": "world"
  }
]

Follow-up in PRs to come

  • Add documentation.
  • Add multilang tokenizer
  • What else?

@fmassot fmassot force-pushed the fmassot/custom-tokenizers branch 4 times, most recently from 3f8026c to ce91eb0 Compare June 25, 2023 17:36
@fmassot fmassot mentioned this pull request Jun 26, 2023
@fmassot fmassot force-pushed the fmassot/custom-tokenizers branch from 5b23b2d to 0dd5aae Compare July 3, 2023 22:30
@fmassot fmassot marked this pull request as ready for review July 4, 2023 10:05
@fmassot fmassot requested a review from fulmicoton July 4, 2023 10:06
@fmassot fmassot force-pushed the fmassot/custom-tokenizers branch from 0b03f0c to a6cd4f4 Compare July 4, 2023 10:18
@@ -38,6 +39,10 @@ pub(crate) use self::field_mapping_entry::{
FieldMappingEntryForSerialization, IndexRecordOptionSchema, QuickwitTextTokenizer,
};
pub(crate) use self::field_mapping_type::FieldMappingType;
pub use self::tokenizer_entry::{
analyze_text, NgramTokenizerOption, RegexTokenizerOption, TokenFilterType, TokenizerConfig,
Copy link
Contributor

@fulmicoton fulmicoton Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those public? Is it for the rest endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, only "analyze_text, TokenizerConfig, TokenizerEntry" should be public and the others should be pub(crate) for the OpenAPI stuff.

@fulmicoton
Copy link
Contributor

I only had minor comments, so I approved.

@fmassot fmassot force-pushed the fmassot/custom-tokenizers branch from debfed6 to bc65117 Compare July 4, 2023 22:16
@fmassot fmassot enabled auto-merge (squash) July 4, 2023 22:28
@fmassot fmassot merged commit 50a6e71 into main Jul 4, 2023
7 checks passed
@fmassot fmassot deleted the fmassot/custom-tokenizers branch July 4, 2023 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support custom tokenizers in index config
2 participants