Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settings customizing tokenization #3946

Merged
merged 14 commits into from
Aug 10, 2023
Merged

Conversation

ManyTheFish
Copy link
Member

@ManyTheFish ManyTheFish commented Jul 25, 2023

Pull Request

This pull Request allows the User to customize Meilisearch Tokenization by providing specialized settings.

Small documentation

All the new settings can be set and reset like the other index settings by calling the route /indexes/:name/settings

nonSeparatorTokens

The Meilisearch word segmentation uses a default list of separators to segment words, however, for specific use cases some of the default separators shouldn't be considered separators, the nonSeparatorTokens setting allows to remove of some tokens from the default list of separators.

Request payload PUT- /indexes/articles/settings/non-separator-tokens

["@", "#", "&"]

separatorTokens

Some use cases need to define additional separators, some are related to a specific way of parsing technical documents some others are related to encodings in documents, the separatorTokens setting allows adding some tokens to the list of separators.

Request payload PUT- /indexes/articles/settings/separator-tokens

["§", "&sep"]

dictionary

The Meilisearch word segmentation relies on separators and language-based word-dictionaries to segment words, however, this segmentation is inaccurate on technical or use-case specific vocabulary (like G/Box to say Gear Box), or on proper nouns (like J. R. R. when parsing J. R. R. Tolkien), the dictionary setting allows defining a list of words that would be segmented as described in the list.

Request payload PUT- /indexes/articles/settings/dictionary

["J. R. R.", "J.R.R."]

these last feature synergies well with the stopWords setting or the synonyms setting allowing to segment words and correctly retrieve the synonyms:
Request payload PATCH- /indexes/articles/settings

{
    "dictionary": ["J. R. R.", "J.R.R."],
    "synonyms": {
            "J.R.R.": ["jrr", "J. R. R."],
            "J. R. R.": ["jrr", "J.R.R."],
            "jrr": ["J.R.R.", "J. R. R."],
    }
}

Related specifications:

Try it with Docker

$ docker pull getmeili/meilisearch:prototype-tokenizer-customization-3

Related issue

Fixes #3610
Fixes #3917
Fixes meilisearch/product#468
Fixes meilisearch/product#160
Fixes meilisearch/product#260
Fixes meilisearch/product#381
Fixes meilisearch/product#131
Related to #2879

Fixes #2760

What does this PR do?

  • Add a setting nonSeparatorTokens allowing to remove a token from the default separator tokens
  • Add a setting separatorTokens allowing to add a token in the separator tokens
  • Add a setting dictionary allowing to override the segmentation on specific words
  • add new error code invalid_settings_non_separator_tokens (invalid_request)
  • add new error code invalid_settings_separator_tokens (invalid_request)
  • add new error code invalid_settings_dictionary (invalid_request)

@tobiasnitsche
Copy link

tobiasnitsche commented Aug 2, 2023

Hi Maria / ManyTheFish,

thanks for taking an eye and doing the effort, i really appreciate!

Two things came to mind:

  • Would it be possible just to just use "PUT /separatorTokens" endpoint and to overwrite the whole tokenizer list to whats sent over? Then i could just set "[]" to disable tokenization at all.

  • Is it possible to just use it on specific attributes? Similar to typoTolerance: https://www.meilisearch.com/docs/learn/configuration/typo_tolerance#disableonattributes . We have for example in our article index only one attribute "serialnumber" which should have disabled the tokenization.

But thanks again for working on this!

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Aug 7, 2023

Hello @tobiasnitsche,

First of all this feature is not meant to deactivate the tokenization in Meilisearch but to customize the behavior of the tokenizer. Deactivating the tokenization implies giving the user the full charge of it on the indexing side, the settings side and the search side, meaning that on each API we should accept an array of tokens instead of raw strings.

  • Would it be possible just to just use "PUT /separatorTokens" endpoint and to overwrite the whole tokenizer list to whats sent over? Then i could just set "[]" to disable tokenization at all.

No, you could eventually set the exhaustive list of separators in the nonSeparatorTokens list, but it wouldn't deactivate the tokenization, it would just avoid segmenting the words split by a separator contained in the list, and the other type of segmentation and the normalization wouldn't be deactivated.

Not with this feature, the only thing you can do is using the disableonwords feature, but it's not constrained to a specific attribute.

Thank you for your report and your interest in the feature,
sorry if my answer is a bit deceptive,

see you!

@tobiasnitsche
Copy link

tobiasnitsche commented Aug 7, 2023

Thanks for your honest & detailed feedback on this!

@ManyTheFish ManyTheFish marked this pull request as ready for review August 8, 2023 16:30
@curquiza curquiza added this to the v1.4.0 milestone Aug 8, 2023
Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, there are a lot of tests thanks!
I left a few questions, but overall, I think we'll be able to merge it in no time

meilisearch-types/src/error.rs Show resolved Hide resolved
milli/src/update/index_documents/extract/mod.rs Outdated Show resolved Hide resolved
milli/src/update/settings.rs Show resolved Hide resolved
Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors merge

@meili-bors
Copy link
Contributor

meili-bors bot commented Aug 10, 2023

Build succeeded:

@meili-bors meili-bors bot merged commit 8084cf2 into main Aug 10, 2023
8 checks passed
@meili-bors meili-bors bot deleted the settings-customizing-tokenization branch August 10, 2023 10:55
@meili-bot meili-bot added the v1.4.0 PRs/issues solved in v1.4.0 released on 2023-09-25 label Sep 26, 2023
@tobiasnitsche
Copy link

tobiasnitsche commented Sep 26, 2023

Hello @tobiasnitsche,

First of all this feature is not meant to deactivate the tokenization in Meilisearch...

Hi @ManyTheFish

Unfortunaely this feature is not that useful for me, as the seperators can only be set on index base and not field base. :/

My index (in a nutshell)

articles:

article-number, description

A-12345 , "An article description"
123-B/2023 , "Another article description"

To have nice searching experience, i need to set / disable the tokenizers only on the article number, the description should have the normal awesome meilisearch experience.

Is there any plan to add this feature on field base? Other settings like maxTypo has the option to set it on field base..

@curquiza
Copy link
Member

curquiza commented Sep 26, 2023

Hello @tobiasnitsche
thanks for your feedback, I let our product team know.

If you want to detail your usecase and directly talk with the product team, you can open a discussion here (or comment the already existing if there is one)
Just saw you already are present on the related discussion: meilisearch/product#422
Discussing now with the product team

Thanks in advance for your feedback, it's really helpful

@tobiasnitsche
Copy link

tobiasnitsche commented Nov 17, 2023

Thank you @curquiza for encouraging comment.

After some thinking, a field "email" might also be a good use case for disabling tokenization on field basis.

Anyway, I love Meilisearch so far, keep on the good work!

@tobiasnitsche
Copy link

@curquiza @irevoire

is there any update on this?

Actually for me , its a bit shameful, to tell the users, that they can not search for email addresses , or article numbers in the search...

Would love to have an update. This customization tokenization does not solve my problem, as i described it a few times...

@curquiza
Copy link
Member

curquiza commented Mar 7, 2024

Hello @tobiasnitsche

Sorry, but here it's a closed PR, we would rather avoid discussing too much in it to avoid losing track of information. PR are only for implementation and technical detail, not for product discussions 😊 Let's focus on the already existing support you interact in

@tobiasnitsche
Copy link

tobiasnitsche commented Mar 7, 2024

Under stoo

Hello @tobiasnitsche

Sorry, but here it's a closed PR, we would rather avoid discussing too much in it to avoid losing track of information. PR are only for implementation and technical detail, not for product discussions 😊 Let's focus on the already existing support you interact in

* https://github.com/orgs/meilisearch/discussions/422

* [Any idea to disable tokenization on specific field? #3380](https://github.com/meilisearch/meilisearch/discussions/3380)

Understood, lets move it there :-) my detailled description can be found here as well: #3380

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.4.0 PRs/issues solved in v1.4.0 released on 2023-09-25
Projects
None yet
5 participants