Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boxed token filter to ease the building of TextAnalyzer with... #2096

Closed
wants to merge 5 commits into from

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Jun 19, 2023

...a vec of filters.

All those additional traits to get rid of GATs are a bit confusing for the reader, unfortunately.

Example:

    #[test]
    fn test_text_analyzer_with_filters_boxed() {
        let mut analyzer = TextAnalyzer::build(
            WhitespaceTokenizer::default(),
            vec![
                BoxTokenFilter::from(AlphaNumOnlyFilter),
                BoxTokenFilter::from(LowerCaser),
                BoxTokenFilter::from(RemoveLongFilter::limit(6)),
            ],
        );
        let mut stream = analyzer.token_stream("- first bullet point");
        assert_eq!(stream.next().unwrap().text, "first");
        assert_eq!(stream.next().unwrap().text, "point");
    }

@fmassot
Copy link
Contributor Author

fmassot commented Jun 19, 2023

@trinity-1686a I'm interested by your opinion on this :)

@codecov-commenter
Copy link

codecov-commenter commented Jun 19, 2023

Codecov Report

Merging #2096 (b82cd08) into main (3b0cbf8) will increase coverage by 0.04%.
The diff coverage is 92.75%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2096      +/-   ##
==========================================
+ Coverage   94.39%   94.44%   +0.04%     
==========================================
  Files         321      321              
  Lines       60612    60666      +54     
==========================================
+ Hits        57213    57294      +81     
+ Misses       3399     3372      -27     
Impacted Files Coverage Δ
src/core/segment_reader.rs 90.57% <0.00%> (ø)
src/lib.rs 99.05% <ø> (ø)
src/tokenizer/mod.rs 97.84% <ø> (ø)
src/tokenizer/tokenizer.rs 95.65% <93.54%> (+2.96%) ⬆️
src/directory/ram_directory.rs 90.50% <100.00%> (ø)
src/indexer/segment_updater.rs 95.44% <100.00%> (ø)
src/postings/serializer.rs 98.95% <100.00%> (ø)
tokenizer-api/src/lib.rs 100.00% <100.00%> (+18.75%) ⬆️

... and 9 files with indirect coverage changes

Copy link
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do this to make less things public (also re-export BoxTokenFilter so it's easier to build)

diff --git a/src/tokenizer/mod.rs b/src/tokenizer/mod.rs
index dbc3dd867..046e497fb 100644
--- a/src/tokenizer/mod.rs
+++ b/src/tokenizer/mod.rs
@@ -154,7 +154,7 @@ pub use self::split_compound_words::SplitCompoundWords;
 pub use self::stemmer::{Language, Stemmer};
 pub use self::stop_word_filter::StopWordFilter;
 pub use self::tokenized_string::{PreTokenizedStream, PreTokenizedString};
-pub use self::tokenizer::TextAnalyzer;
+pub use self::tokenizer::{TextAnalyzer, BoxTokenFilter};
 pub use self::tokenizer_manager::TokenizerManager;
 pub use self::whitespace_tokenizer::WhitespaceTokenizer;
 
diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs
index 011978fb2..f04a798ed 100644
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -1,5 +1,3 @@
-use std::ops::Deref;
-
 /// The tokenizer module contains all of the tools used to process
 /// text in `tantivy`.
 use tokenizer_api::{BoxTokenStream, TokenFilter, TokenStream, Tokenizer};
@@ -12,7 +10,7 @@ pub struct TextAnalyzer {
 }
 
 /// A boxable `Tokenizer`, with its `TokenStream` type erased.
-pub trait BoxableTokenizer: 'static + Send + Sync {
+trait BoxableTokenizer: 'static + Send + Sync {
     /// Creates a boxed token stream for a given `str`.
     fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a>;
     /// Clone this tokenizer.
@@ -45,7 +43,7 @@ impl Tokenizer for BoxedTokenizer {
 }
 
 /// Trait for the pluggable components of `Tokenizer`s.
-pub trait BoxableTokenFilter: 'static + Send + Sync {
+trait BoxableTokenFilter: 'static + Send + Sync {
     /// Wraps a Tokenizer and returns a new one.
     fn box_transform(&self, tokenizer: BoxedTokenizer) -> Box<dyn BoxableTokenizer>;
 }
@@ -59,14 +57,6 @@ impl<T: TokenFilter> BoxableTokenFilter for T {
 
 pub struct BoxTokenFilter(Box<dyn BoxableTokenFilter>);
 
-impl Deref for BoxTokenFilter {
-    type Target = dyn BoxableTokenFilter;
-
-    fn deref(&self) -> &dyn BoxableTokenFilter {
-        &*self.0
-    }
-}
-
 impl<T: TokenFilter> From<T> for BoxTokenFilter {
     fn from(tokenizer: T) -> BoxTokenFilter {
         BoxTokenFilter(Box::new(tokenizer))
@@ -86,7 +76,7 @@ impl TextAnalyzer {
     ) -> TextAnalyzer {
         let mut boxed_tokenizer = BoxedTokenizer(Box::new(tokenizer));
         for filter in boxed_token_filters.into_iter() {
-            let filtered_boxed_tokenizer = filter.box_transform(boxed_tokenizer);
+            let filtered_boxed_tokenizer = filter.0.box_transform(boxed_tokenizer);
             boxed_tokenizer = BoxedTokenizer(filtered_boxed_tokenizer);
         }
         TextAnalyzer {

Comment on lines 79 to 82
/// When creating a `TextAnalyzer` from a `Tokenizer` alone, prefer using
/// `TextAnalyzer::from(tokenizer)`.
/// When creating a `TextAnalyzer` from a `Tokenizer` and a static set of `TokenFilter`,
/// prefer using `TextAnalyzer::builder(tokenizer).filter(token_filter).build()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should explain why it is preferred (i.e. less Boxes, better tokenization perf)

@guilload
Copy link
Member

We could remove the intermediate boxes between filters using the same technique as tower does. See builder (ServiceBuilder) and stacks (Stack).

We could also use dyn_clone to reduce the amount of boilerplate to implement Clone for trait objects. It has zero dependencies.

@adamreichold
Copy link
Collaborator

We could remove the intermediate boxes between filters using the same technique as tower does. See builder (ServiceBuilder) and stacks (Stack).

Isn't that already implemented via TextAnalyzer::builder?

@guilload
Copy link
Member

Yes, but that's for non-boxed token filters. Actually, I don't know if we can avoid boxing the intermediate tokenizers, this is more complex than the tower situation because of the returned associate types and generic associated types.

@fmassot fmassot force-pushed the fmassot/add-box-tokenizer-and-token-filter branch 3 times, most recently from 385c2a8 to a97c0b8 Compare June 22, 2023 07:06
@fmassot fmassot force-pushed the fmassot/add-box-tokenizer-and-token-filter branch from a97c0b8 to b82cd08 Compare June 22, 2023 07:13
@fmassot
Copy link
Contributor Author

fmassot commented Jun 22, 2023

@guilload I introduced dyn_clone.

As for the boxes... the first annoying part is getting rid of the GATs. Let's discuss it offline if you see better options.

@fmassot fmassot marked this pull request as ready for review June 22, 2023 07:16
@@ -63,6 +63,12 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
/// Simple wrapper of `Box<dyn TokenStream + 'a>`.
pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);
Copy link
Contributor

@PSeitz PSeitz Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not be a newtype

diff --git a/src/indexer/segment_writer.rs b/src/indexer/segment_writer.rs
index bad8638e5..7fa58e18a 100644
--- a/src/indexer/segment_writer.rs
+++ b/src/indexer/segment_writer.rs
@@ -209,7 +209,7 @@ impl SegmentWriter {
                     for value in values {
                         let mut token_stream = match value {
                             Value::PreTokStr(tok_str) => {
-                                PreTokenizedStream::from(tok_str.clone()).into()
+                                Box::new(PreTokenizedStream::from(tok_str.clone()))
                             }
                             Value::Str(ref text) => {
                                 let text_analyzer =
diff --git a/src/query/more_like_this/more_like_this.rs b/src/query/more_like_this/more_like_this.rs
index 994dd96c0..ca86c70b1 100644
--- a/src/query/more_like_this/more_like_this.rs
+++ b/src/query/more_like_this/more_like_this.rs
@@ -4,9 +4,7 @@ use std::collections::{BinaryHeap, HashMap};
 use crate::query::bm25::idf;
 use crate::query::{BooleanQuery, BoostQuery, Occur, Query, TermQuery};
 use crate::schema::{Field, FieldType, IndexRecordOption, Term, Value};
-use crate::tokenizer::{
-    BoxTokenStream, FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer,
-};
+use crate::tokenizer::{FacetTokenizer, PreTokenizedStream, TokenStream, Tokenizer};
 use crate::{DocAddress, Result, Searcher, TantivyError};
 
 #[derive(Debug, PartialEq)]
@@ -206,8 +204,7 @@ impl MoreLikeThis {
                 for value in values {
                     match value {
                         Value::PreTokStr(tok_str) => {
-                            let mut token_stream: BoxTokenStream =
-                                PreTokenizedStream::from(tok_str.clone()).into();
+                            let mut token_stream = PreTokenizedStream::from(tok_str.clone());
                             token_stream.process(&mut |token| {
                                 if !self.is_noise_word(token.text.clone()) {
                                     let term = Term::from_field_text(field, &token.text);
diff --git a/src/tokenizer/tokenizer.rs b/src/tokenizer/tokenizer.rs
index 29a11b396..d0e3921f7 100644
--- a/src/tokenizer/tokenizer.rs
+++ b/src/tokenizer/tokenizer.rs
@@ -19,7 +19,7 @@ trait BoxableTokenizer: 'static + Send + Sync + DynClone {
 
 impl<T: Tokenizer> BoxableTokenizer for T {
     fn box_token_stream<'a>(&'a mut self, text: &'a str) -> BoxTokenStream<'a> {
-        self.token_stream(text).into()
+        Box::new(self.token_stream(text))
     }
 }
 
diff --git a/tokenizer-api/src/lib.rs b/tokenizer-api/src/lib.rs
index eca0d9566..d0cac2b04 100644
--- a/tokenizer-api/src/lib.rs
+++ b/tokenizer-api/src/lib.rs
@@ -6,7 +6,6 @@
 //! Checkout the [tantivy repo](https://github.com/quickwit-oss/tantivy/tree/main/src/tokenizer) for some examples.
 
 use std::borrow::{Borrow, BorrowMut};
-use std::ops::{Deref, DerefMut};
 
 use serde::{Deserialize, Serialize};
 
@@ -61,35 +60,7 @@ pub trait Tokenizer: 'static + Clone + Send + Sync {
 }
 
 /// Simple wrapper of `Box<dyn TokenStream + 'a>`.
-pub struct BoxTokenStream<'a>(Box<dyn TokenStream + 'a>);
-
-impl<'a> From<BoxTokenStream<'a>> for Box<dyn TokenStream + 'a> {
-    fn from(token_stream: BoxTokenStream<'a>) -> Self {
-        token_stream.0
-    }
-}
-
-impl<'a, T> From<T> for BoxTokenStream<'a>
-where T: TokenStream + 'a
-{
-    fn from(token_stream: T) -> BoxTokenStream<'a> {
-        BoxTokenStream(Box::new(token_stream))
-    }
-}
-
-impl<'a> Deref for BoxTokenStream<'a> {
-    type Target = dyn TokenStream + 'a;
-
-    fn deref(&self) -> &Self::Target {
-        &*self.0
-    }
-}
-
-impl<'a> DerefMut for BoxTokenStream<'a> {
-    fn deref_mut(&mut self) -> &mut Self::Target {
-        &mut *self.0
-    }
-}
+pub type BoxTokenStream<'a> = Box<dyn TokenStream + 'a>;
 
 impl<'a> TokenStream for Box<dyn TokenStream + 'a> {
     fn advance(&mut self) -> bool {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to remove it, it does add a bunch of code and does not really provide some gains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it, and have a seperate hotfix that just fixes the visibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it.

have a seperate hotfix that just fixes the visibility

Do you mean this one? main...addconversion

@fmassot
Copy link
Contributor Author

fmassot commented Jun 23, 2023

I really don't like this version. I'm working on a better implem.

@fmassot
Copy link
Contributor Author

fmassot commented Jun 26, 2023

Closing in favor of #2101

@fmassot fmassot closed this Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants