Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an allowlist to the tokenizer builder #132

Closed
ManyTheFish opened this issue Sep 26, 2022 · 2 comments · Fixed by #148
Closed

Add an allowlist to the tokenizer builder #132

ManyTheFish opened this issue Sep 26, 2022 · 2 comments · Fixed by #148
Labels
good first issue Good for newcomers

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Sep 26, 2022

Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.

drawback

Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.

enhancement

Add a new setting in the TokenizerBuilder forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.

Technical approach:

  1. add an optional allowlist parameter to the method detect of the Detect trait in detection/mod.rs
  2. add a segment_with_allowlist and a segment_str_with_allowlist with an additional allowlist parameter to the Segment trait in segmenter/mod.rs
  3. add an allowlist method to the TokenizerBuilder struct in tokenizer.rs

The allowlist should be a hashmap of Script -> [Languages]

Files expected to be modified

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

@curquiza curquiza transferred this issue from meilisearch/engine-team Sep 29, 2022
@yenwel
Copy link
Contributor

yenwel commented Oct 5, 2022

I am attempting this one for hacktober

@curquiza
Copy link
Member

curquiza commented Oct 6, 2022

Hello @yenwel

Thanks for your interest in this project 🔥 You are definitely more than welcome to open a PR for this!

FYI, we prefer not assigning people to our issues because sometimes people ask to be assigned and never come back, which discourages the volunteer contributors from opening a PR to fix this issue.
We will accept and merge the first PR that fixes correctly and well implements the issue following our contributing guidelines.

We are looking forward to reviewing your PR 😊

@yenwel yenwel mentioned this issue Oct 7, 2022
3 tasks
@bors bors bot closed this as completed in 3d735c7 Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants