Add boxed token filter to ease the building of TextAnalyzer with... #2096

fmassot · 2023-06-19T22:14:39Z

...a vec of filters.

All those additional traits to get rid of GATs are a bit confusing for the reader, unfortunately.

Example:

    #[test]
    fn test_text_analyzer_with_filters_boxed() {
        let mut analyzer = TextAnalyzer::build(
            WhitespaceTokenizer::default(),
            vec![
                BoxTokenFilter::from(AlphaNumOnlyFilter),
                BoxTokenFilter::from(LowerCaser),
                BoxTokenFilter::from(RemoveLongFilter::limit(6)),
            ],
        );
        let mut stream = analyzer.token_stream("- first bullet point");
        assert_eq!(stream.next().unwrap().text, "first");
        assert_eq!(stream.next().unwrap().text, "point");
    }

fmassot · 2023-06-19T22:16:44Z

@trinity-1686a I'm interested by your opinion on this :)

codecov-commenter · 2023-06-19T23:44:48Z

Codecov Report

Merging #2096 (b82cd08) into main (3b0cbf8) will increase coverage by 0.04%.
The diff coverage is 92.75%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2096      +/-   ##
==========================================
+ Coverage   94.39%   94.44%   +0.04%     
==========================================
  Files         321      321              
  Lines       60612    60666      +54     
==========================================
+ Hits        57213    57294      +81     
+ Misses       3399     3372      -27

Impacted Files	Coverage Δ
src/core/segment_reader.rs	`90.57% <0.00%> (ø)`
src/lib.rs	`99.05% <ø> (ø)`
src/tokenizer/mod.rs	`97.84% <ø> (ø)`
src/tokenizer/tokenizer.rs	`95.65% <93.54%> (+2.96%)`	⬆️
src/directory/ram_directory.rs	`90.50% <100.00%> (ø)`
src/indexer/segment_updater.rs	`95.44% <100.00%> (ø)`
src/postings/serializer.rs	`98.95% <100.00%> (ø)`
tokenizer-api/src/lib.rs	`100.00% <100.00%> (+18.75%)`	⬆️

... and 9 files with indirect coverage changes

trinity-1686a

you can do this to make less things public (also re-export BoxTokenFilter so it's easier to build)

diff --git a/src/tokenizer/mod.rs b/src/tokenizer/mod.rs
index dbc3dd867..046e497fb 100644
--- a/src/tokenizer/mod.rs
+++ b/src/tokenizer/mod.rs
@@ -154,7 +154,7 @@ pub use self::split_compound_words::SplitCompoundWords;
 pub use self::stemmer::{Language, Stemmer};
 pub use self::stop_word_filter::StopWordFilter;
 pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
-pub use self::tokenizer::TextAnalyzer;
+pub use self::tokenizer::{TextAnalyzer, BoxTokenFilter};
 pub use self::tokenizer_manager::TokenizerManager;
 pub use self::whitespace_tokenizer::WhitespaceTokenizer;
 
diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs
index 011978fb2..f04a798ed 100644
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -1,5 +1,3 @@
-use std::ops::Deref;
-
 /// The tokenizer module contains all of the tools used to process
 /// text in `tantivy`.
 use tokenizer_api::{BoxTokenStream, TokenFilter, TokenStream, Tokenizer};
@@ -12,7 +10,7 @@ pub struct TextAnalyzer {
 }
 
 /// A boxable `Tokenizer`, with its `TokenStream` type erased.
-pub trait BoxableTokenizer: 'static + Send + Sync {
+trait BoxableTokenizer: 'static + Send + Sync {
     /// Creates a boxed token stream for a given `str`.
     fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a>;
     /// Clone this tokenizer.
@@ -45,7 +43,7 @@ impl Tokenizer for BoxedTokenizer {
 }
 
 /// Trait for the pluggable components of `Tokenizer`s.
-pub trait BoxableTokenFilter: 'static + Send + Sync {
+trait BoxableTokenFilter: 'static + Send + Sync {
     /// Wraps a Tokenizer and returns a new one.
     fn box_transform(&self, tokenizer: BoxedTokenizer) -> Box<dyn BoxableTokenizer>;
 }
@@ -59,14 +57,6 @@ impl<T: TokenFilter> BoxableTokenFilter for T {
 
 pub struct BoxTokenFilter(Box<dyn BoxableTokenFilter>);
 
-impl Deref for BoxTokenFilter {
-    type Target = dyn BoxableTokenFilter;
-
-    fn deref(&self) -> &dyn BoxableTokenFilter {
-        &*self.0
-    }
-}
-
 impl<T: TokenFilter> From<T> for BoxTokenFilter {
     fn from(tokenizer: T) -> BoxTokenFilter {
         BoxTokenFilter(Box::new(tokenizer))
@@ -86,7 +76,7 @@ impl TextAnalyzer {
     ) -> TextAnalyzer {
         let mut boxed_tokenizer = BoxedTokenizer(Box::new(tokenizer));
         for filter in boxed_token_filters.into_iter() {
-            let filtered_boxed_tokenizer = filter.box_transform(boxed_tokenizer);
+            let filtered_boxed_tokenizer = filter.0.box_transform(boxed_tokenizer);
             boxed_tokenizer = BoxedTokenizer(filtered_boxed_tokenizer);
         }
         TextAnalyzer {

trinity-1686a · 2023-06-20T12:30:38Z

src/tokenizer/tokenizer.rs

+    /// When creating a `TextAnalyzer` from a `Tokenizer` alone, prefer using
+    /// `TextAnalyzer::from(tokenizer)`.
+    /// When creating a `TextAnalyzer` from a `Tokenizer` and a static set of `TokenFilter`,
+    /// prefer using `TextAnalyzer::builder(tokenizer).filter(token_filter).build()`.


you should explain why it is preferred (i.e. less Boxes, better tokenization perf)

guilload · 2023-06-21T15:21:41Z

We could remove the intermediate boxes between filters using the same technique as tower does. See builder (ServiceBuilder) and stacks (Stack).

We could also use dyn_clone to reduce the amount of boilerplate to implement Clone for trait objects. It has zero dependencies.

adamreichold · 2023-06-21T16:33:29Z

We could remove the intermediate boxes between filters using the same technique as tower does. See builder (ServiceBuilder) and stacks (Stack).

Isn't that already implemented via TextAnalyzer::builder?

guilload · 2023-06-21T19:39:50Z

Yes, but that's for non-boxed token filters. Actually, I don't know if we can avoid boxing the intermediate tokenizers, this is more complex than the tower situation because of the returned associate types and generic associated types.

…c of filters.

fmassot · 2023-06-22T07:15:58Z

@guilload I introduced dyn_clone.

As for the boxes... the first annoying part is getting rid of the GATs. Let's discuss it offline if you see better options.

PSeitz · 2023-06-23T08:19:33Z

tokenizer-api/src/lib.rs

@@ -63,6 +63,12 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
 /// Simple wrapper of `Box<dyn TokenStream + 'a>`.
 pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);


This should probably not be a newtype

diff --git a/src/indexer/segment_writer.rs b/src/indexer/segment_writer.rs index bad8638e5..7fa58e18a 100644 --- a/src/indexer/segment_writer.rs +++ b/src/indexer/segment_writer.rs @@ -209,7 +209,7 @@ impl SegmentWriter { for value in values { let mut token_stream = match value { Value::PreTokStr(tok_str) => { - PreTokenizedStream::from(tok_str.clone()).into() + Box::new(PreTokenizedStream::from(tok_str.clone())) } Value::Str(ref text) => { let text_analyzer = diff --git a/src/query/more_like_this/more_like_this.rs b/src/query/more_like_this/more_like_this.rs index 994dd96c0..ca86c70b1 100644 --- a/src/query/more_like_this/more_like_this.rs +++ b/src/query/more_like_this/more_like_this.rs @@ -4,9 +4,7 @@ use std::collections::{BinaryHeap, HashMap}; use crate::query::bm25::idf; use crate::query::{BooleanQuery, BoostQuery, Occur, Query, TermQuery}; use crate::schema::{Field, FieldType, IndexRecordOption, Term, Value}; -use crate::tokenizer::{ - BoxTokenStream, FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer, -}; +use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer}; use crate::{DocAddress, Result, Searcher, TantivyError}; #[derive(Debug, PartialEq)] @@ -206,8 +204,7 @@ impl MoreLikeThis { for value in values { match value { Value::PreTokStr(tok_str) => { - let mut token_stream: BoxTokenStream = - PreTokenizedStream::from(tok_str.clone()).into(); + let mut token_stream = PreTokenizedStream::from(tok_str.clone()); token_stream.process(&mut |token| { if !self.is_noise_word(token.text.clone()) { let term = Term::from_field_text(field, &token.text); diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs index 29a11b396..d0e3921f7 100644 --- a/src/tokenizer/tokenizer.rs +++ b/src/tokenizer/tokenizer.rs @@ -19,7 +19,7 @@ trait BoxableTokenizer: 'static + Send + Sync + DynClone { impl<T: Tokenizer> BoxableTokenizer for T { fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> { - self.token_stream(text).into() + Box::new(self.token_stream(text)) } } diff --git a/tokenizer-api/src/lib.rs b/tokenizer-api/src/lib.rs index eca0d9566..d0cac2b04 100644 --- a/tokenizer-api/src/lib.rs +++ b/tokenizer-api/src/lib.rs @@ -6,7 +6,6 @@ //! Checkout the [tantivy repo](https://github.com/quickwit-oss/tantivy/tree/main/src/tokenizer) for some examples. use std::borrow::{Borrow, BorrowMut}; -use std::ops::{Deref, DerefMut}; use serde::{Deserialize, Serialize}; @@ -61,35 +60,7 @@ pub trait Tokenizer: 'static + Clone + Send + Sync { } /// Simple wrapper of `Box<dyn TokenStream + 'a>`. -pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>); - -impl<'a> From<BoxTokenStream<'a>> for Box<dyn TokenStream + 'a> { - fn from(token_stream: BoxTokenStream<'a>) -> Self { - token_stream.0 - } -} - -impl<'a, T> From<T> for BoxTokenStream<'a> -where T: TokenStream + 'a -{ - fn from(token_stream: T) -> BoxTokenStream<'a> { - BoxTokenStream(Box::new(token_stream)) - } -} - -impl<'a> Deref for BoxTokenStream<'a> { - type Target = dyn TokenStream + 'a; - - fn deref(&self) -> &Self::Target { - &*self.0 - } -} - -impl<'a> DerefMut for BoxTokenStream<'a> { - fn deref_mut(&mut self) -> &mut Self::Target { - &mut *self.0 - } -} +pub type BoxTokenStream<'a> = Box<dyn TokenStream + 'a>; impl<'a> TokenStream for Box<dyn TokenStream + 'a> { fn advance(&mut self) -> bool {

I would be happy to remove it, it does add a bunch of code and does not really provide some gains.

Let's remove it, and have a seperate hotfix that just fixes the visibility

I removed it.

have a seperate hotfix that just fixes the visibility

Do you mean this one? main...addconversion

fmassot · 2023-06-23T15:04:45Z

I really don't like this version. I'm working on a better implem.

fmassot · 2023-06-26T07:38:39Z

Closing in favor of #2101

trinity-1686a reviewed Jun 20, 2023

View reviewed changes

fmassot force-pushed the fmassot/add-box-tokenizer-and-token-filter branch 3 times, most recently from 385c2a8 to a97c0b8 Compare June 22, 2023 07:06

fmassot added 4 commits June 22, 2023 09:12

Add boxed token filter to ease the building of TextAnalyzer with a ve…

ad9b825

…c of filters.

Clean code and improve docs.

6c6b97d

Use dyn_clone.

54f4313

Fix comment.

b82cd08

fmassot force-pushed the fmassot/add-box-tokenizer-and-token-filter branch from a97c0b8 to b82cd08 Compare June 22, 2023 07:13

fmassot marked this pull request as ready for review June 22, 2023 07:16

fmassot requested review from trinity-1686a and PSeitz June 22, 2023 07:16

PSeitz reviewed Jun 23, 2023

View reviewed changes

Remove BoxTokenStream.

dc783f8

fmassot marked this pull request as draft June 23, 2023 15:04

fmassot mentioned this pull request Jun 25, 2023

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

Closed

fmassot closed this Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add boxed token filter to ease the building of TextAnalyzer with... #2096

Add boxed token filter to ease the building of TextAnalyzer with... #2096

fmassot commented Jun 19, 2023 •

edited

Loading

fmassot commented Jun 19, 2023

codecov-commenter commented Jun 19, 2023 •

edited

Loading

trinity-1686a left a comment

trinity-1686a Jun 20, 2023

guilload commented Jun 21, 2023

adamreichold commented Jun 21, 2023

guilload commented Jun 21, 2023

fmassot commented Jun 22, 2023

PSeitz Jun 23, 2023 •

edited

Loading

fmassot Jun 23, 2023

PSeitz Jun 23, 2023

fmassot Jun 23, 2023

fmassot commented Jun 23, 2023

fmassot commented Jun 26, 2023

		@@ -63,6 +63,12 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
		/// Simple wrapper of `Box<dyn TokenStream + 'a>`.
		pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);

Add boxed token filter to ease the building of TextAnalyzer with... #2096

Add boxed token filter to ease the building of TextAnalyzer with... #2096

Conversation

fmassot commented Jun 19, 2023 • edited Loading

fmassot commented Jun 19, 2023

codecov-commenter commented Jun 19, 2023 • edited Loading

Codecov Report

trinity-1686a left a comment

Choose a reason for hiding this comment

trinity-1686a Jun 20, 2023

Choose a reason for hiding this comment

guilload commented Jun 21, 2023

adamreichold commented Jun 21, 2023

guilload commented Jun 21, 2023

fmassot commented Jun 22, 2023

PSeitz Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

fmassot Jun 23, 2023

Choose a reason for hiding this comment

PSeitz Jun 23, 2023

Choose a reason for hiding this comment

fmassot Jun 23, 2023

Choose a reason for hiding this comment

fmassot commented Jun 23, 2023

fmassot commented Jun 26, 2023

fmassot commented Jun 19, 2023 •

edited

Loading

codecov-commenter commented Jun 19, 2023 •

edited

Loading

PSeitz Jun 23, 2023 •

edited

Loading