-
-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add boxed token filter to ease the building of TextAnalyzer with... #2096
Conversation
@trinity-1686a I'm interested by your opinion on this :) |
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## main #2096 +/- ##
==========================================
+ Coverage 94.39% 94.44% +0.04%
==========================================
Files 321 321
Lines 60612 60666 +54
==========================================
+ Hits 57213 57294 +81
+ Misses 3399 3372 -27
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do this to make less things public (also re-export BoxTokenFilter so it's easier to build)
diff --git a/src/tokenizer/mod.rs b/src/tokenizer/mod.rs
index dbc3dd867..046e497fb 100644
--- a/src/tokenizer/mod.rs
+++ b/src/tokenizer/mod.rs
@@ -154,7 +154,7 @@ pub use self::split_compound_words::SplitCompoundWords;
pub use self::stemmer::{Language, Stemmer};
pub use self::stop_word_filter::StopWordFilter;
pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
-pub use self::tokenizer::TextAnalyzer;
+pub use self::tokenizer::{TextAnalyzer, BoxTokenFilter};
pub use self::tokenizer_manager::TokenizerManager;
pub use self::whitespace_tokenizer::WhitespaceTokenizer;
diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs
index 011978fb2..f04a798ed 100644
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -1,5 +1,3 @@
-use std::ops::Deref;
-
/// The tokenizer module contains all of the tools used to process
/// text in `tantivy`.
use tokenizer_api::{BoxTokenStream, TokenFilter, TokenStream, Tokenizer};
@@ -12,7 +10,7 @@ pub struct TextAnalyzer {
}
/// A boxable `Tokenizer`, with its `TokenStream` type erased.
-pub trait BoxableTokenizer: 'static + Send + Sync {
+trait BoxableTokenizer: 'static + Send + Sync {
/// Creates a boxed token stream for a given `str`.
fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a>;
/// Clone this tokenizer.
@@ -45,7 +43,7 @@ impl Tokenizer for BoxedTokenizer {
}
/// Trait for the pluggable components of `Tokenizer`s.
-pub trait BoxableTokenFilter: 'static + Send + Sync {
+trait BoxableTokenFilter: 'static + Send + Sync {
/// Wraps a Tokenizer and returns a new one.
fn box_transform(&self, tokenizer: BoxedTokenizer) -> Box<dyn BoxableTokenizer>;
}
@@ -59,14 +57,6 @@ impl<T: TokenFilter> BoxableTokenFilter for T {
pub struct BoxTokenFilter(Box<dyn BoxableTokenFilter>);
-impl Deref for BoxTokenFilter {
- type Target = dyn BoxableTokenFilter;
-
- fn deref(&self) -> &dyn BoxableTokenFilter {
- &*self.0
- }
-}
-
impl<T: TokenFilter> From<T> for BoxTokenFilter {
fn from(tokenizer: T) -> BoxTokenFilter {
BoxTokenFilter(Box::new(tokenizer))
@@ -86,7 +76,7 @@ impl TextAnalyzer {
) -> TextAnalyzer {
let mut boxed_tokenizer = BoxedTokenizer(Box::new(tokenizer));
for filter in boxed_token_filters.into_iter() {
- let filtered_boxed_tokenizer = filter.box_transform(boxed_tokenizer);
+ let filtered_boxed_tokenizer = filter.0.box_transform(boxed_tokenizer);
boxed_tokenizer = BoxedTokenizer(filtered_boxed_tokenizer);
}
TextAnalyzer {
src/tokenizer/tokenizer.rs
Outdated
/// When creating a `TextAnalyzer` from a `Tokenizer` alone, prefer using | ||
/// `TextAnalyzer::from(tokenizer)`. | ||
/// When creating a `TextAnalyzer` from a `Tokenizer` and a static set of `TokenFilter`, | ||
/// prefer using `TextAnalyzer::builder(tokenizer).filter(token_filter).build()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should explain why it is preferred (i.e. less Boxes, better tokenization perf)
We could remove the intermediate boxes between filters using the same technique as tower does. See builder ( We could also use |
Isn't that already implemented via |
Yes, but that's for non-boxed token filters. Actually, I don't know if we can avoid boxing the intermediate tokenizers, this is more complex than the tower situation because of the returned associate types and generic associated types. |
385c2a8
to
a97c0b8
Compare
a97c0b8
to
b82cd08
Compare
@guilload I introduced As for the boxes... the first annoying part is getting rid of the GATs. Let's discuss it offline if you see better options. |
tokenizer-api/src/lib.rs
Outdated
@@ -63,6 +63,12 @@ pub trait Tokenizer: 'static + Clone + Send + Sync { | |||
/// Simple wrapper of `Box<dyn TokenStream + 'a>`. | |||
pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably not be a newtype
diff --git a/src/indexer/segment_writer.rs b/src/indexer/segment_writer.rs
index bad8638e5..7fa58e18a 100644
--- a/src/indexer/segment_writer.rs
+++ b/src/indexer/segment_writer.rs
@@ -209,7 +209,7 @@ impl SegmentWriter {
for value in values {
let mut token_stream = match value {
Value::PreTokStr(tok_str) => {
- PreTokenizedStream::from(tok_str.clone()).into()
+ Box::new(PreTokenizedStream::from(tok_str.clone()))
}
Value::Str(ref text) => {
let text_analyzer =
diff --git a/src/query/more_like_this/more_like_this.rs b/src/query/more_like_this/more_like_this.rs
index 994dd96c0..ca86c70b1 100644
--- a/src/query/more_like_this/more_like_this.rs
+++ b/src/query/more_like_this/more_like_this.rs
@@ -4,9 +4,7 @@ use std::collections::{BinaryHeap, HashMap};
use crate::query::bm25::idf;
use crate::query::{BooleanQuery, BoostQuery, Occur, Query, TermQuery};
use crate::schema::{Field, FieldType, IndexRecordOption, Term, Value};
-use crate::tokenizer::{
- BoxTokenStream, FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer,
-};
+use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer};
use crate::{DocAddress, Result, Searcher, TantivyError};
#[derive(Debug, PartialEq)]
@@ -206,8 +204,7 @@ impl MoreLikeThis {
for value in values {
match value {
Value::PreTokStr(tok_str) => {
- let mut token_stream: BoxTokenStream =
- PreTokenizedStream::from(tok_str.clone()).into();
+ let mut token_stream = PreTokenizedStream::from(tok_str.clone());
token_stream.process(&mut |token| {
if !self.is_noise_word(token.text.clone()) {
let term = Term::from_field_text(field, &token.text);
diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs
index 29a11b396..d0e3921f7 100644
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -19,7 +19,7 @@ trait BoxableTokenizer: 'static + Send + Sync + DynClone {
impl<T: Tokenizer> BoxableTokenizer for T {
fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
- self.token_stream(text).into()
+ Box::new(self.token_stream(text))
}
}
diff --git a/tokenizer-api/src/lib.rs b/tokenizer-api/src/lib.rs
index eca0d9566..d0cac2b04 100644
--- a/tokenizer-api/src/lib.rs
+++ b/tokenizer-api/src/lib.rs
@@ -6,7 +6,6 @@
//! Checkout the [tantivy repo](https://github.com/quickwit-oss/tantivy/tree/main/src/tokenizer) for some examples.
use std::borrow::{Borrow, BorrowMut};
-use std::ops::{Deref, DerefMut};
use serde::{Deserialize, Serialize};
@@ -61,35 +60,7 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
}
/// Simple wrapper of `Box<dyn TokenStream + 'a>`.
-pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);
-
-impl<'a> From<BoxTokenStream<'a>> for Box<dyn TokenStream + 'a> {
- fn from(token_stream: BoxTokenStream<'a>) -> Self {
- token_stream.0
- }
-}
-
-impl<'a, T> From<T> for BoxTokenStream<'a>
-where T: TokenStream + 'a
-{
- fn from(token_stream: T) -> BoxTokenStream<'a> {
- BoxTokenStream(Box::new(token_stream))
- }
-}
-
-impl<'a> Deref for BoxTokenStream<'a> {
- type Target = dyn TokenStream + 'a;
-
- fn deref(&self) -> &Self::Target {
- &*self.0
- }
-}
-
-impl<'a> DerefMut for BoxTokenStream<'a> {
- fn deref_mut(&mut self) -> &mut Self::Target {
- &mut *self.0
- }
-}
+pub type BoxTokenStream<'a> = Box<dyn TokenStream + 'a>;
impl<'a> TokenStream for Box<dyn TokenStream + 'a> {
fn advance(&mut self) -> bool {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to remove it, it does add a bunch of code and does not really provide some gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it, and have a seperate hotfix that just fixes the visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it.
have a seperate hotfix that just fixes the visibility
Do you mean this one? main...addconversion
I really don't like this version. I'm working on a better implem. |
Closing in favor of #2101 |
...a vec of filters.
All those additional traits to get rid of GATs are a bit confusing for the reader, unfortunately.
Example: