Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

fmassot · 2023-06-25T10:37:30Z

I have a preference for this PR instead of this one: #2096

The main difference is that I changed the TokenFilter to make it independent of the Tokenizer trait as a TokenFilter should be defined only by the input token stream and its output token stream.

The main benefit compared to #2096 is that it avoids creating a BoxTokenizer and implementing the Tokenizer trait on it... it does not seem to be a substantial benefit but... except it is cleaner. For example in #2096, I tried to bypass BoxTokenizer and implemented the Tokenizer trait directly on Box<dyn BoxableTokenizer>, but I got stack overflow as I was calling recursively the same method box_token_stream but the compiler did not tell me anything...

Bench results:

cargo bench --bench analyzer
    Finished bench [optimized + debuginfo] target(s) in 0.12s
warning: the following packages contain code that will be rejected by a future version of Rust: quick-xml v0.22.0
note: to see what the problems were, use the option `--future-incompat-report`, or run `cargo report future-incompatibilities --id 1`
     Running benches/analyzer.rs (target/release/deps/analyzer-2a32537b8a3b3efa)
Gnuplot not found, using plotters backend
default-tokenize-alice  time:   [874.40 µs 875.34 µs 876.63 µs]
                        change: [-2.0579% -1.8172% -1.5706%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 200 measurements (8.00%)
  8 (4.00%) high mild
  8 (4.00%) high severe

default-dynamic-tokenize-alice
                        time:   [1.1158 ms 1.1165 ms 1.1173 ms]
                        change: [-1.9267% -1.6803% -1.4354%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 200 measurements (4.00%)
  2 (1.00%) high mild
  6 (3.00%) high severe

I also ran the benchmark on tantivy 0.19 before the introduction of GATs, box token streams were used everywhere at that time.

default-tokenize-alice  time:   [1.0371 ms 1.0376 ms 1.0383 ms]

So we have a performance regression on the box token filters, unfortunately. I tried to change the code to have something equivalent to what we had on 0.19 but... this bench results did not change.

…c of filters.

codecov-commenter · 2023-06-25T13:05:24Z

Codecov Report

Merging #2101 (f6a6b4a) into main (3b0cbf8) will increase coverage by 0.03%.
The diff coverage is 97.02%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2101      +/-   ##
==========================================
+ Coverage   94.39%   94.42%   +0.03%     
==========================================
  Files         321      321              
  Lines       60612    60612              
==========================================
+ Hits        57213    57235      +22     
+ Misses       3399     3377      -22

Impacted Files	Coverage Δ
src/core/segment_reader.rs	`90.57% <0.00%> (ø)`
src/lib.rs	`99.05% <ø> (ø)`
src/query/more_like_this/more_like_this.rs	`65.91% <0.00%> (+0.24%)`	⬆️
src/tokenizer/mod.rs	`97.84% <ø> (ø)`
src/tokenizer/tokenizer.rs	`95.55% <98.38%> (+2.87%)`	⬆️
src/directory/ram_directory.rs	`90.50% <100.00%> (ø)`
src/indexer/segment_updater.rs	`95.31% <100.00%> (-0.14%)`	⬇️
src/indexer/segment_writer.rs	`97.87% <100.00%> (ø)`
src/postings/serializer.rs	`98.95% <100.00%> (ø)`
src/tokenizer/alphanum_only.rs	`98.43% <100.00%> (+5.58%)`	⬆️
... and 7 more

... and 3 files with indirect coverage changes

fmassot · 2023-06-27T15:52:46Z

HDFS benchmark

tantivy 0.19:

index-hdfs/index-hdfs-no-commit
                        time:   [329.88 ms 332.48 ms 335.11 ms]
                        thrpt:  [63.766 MiB/s 64.272 MiB/s 64.777 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [473.94 ms 478.03 ms 482.20 ms]
                        thrpt:  [44.315 MiB/s 44.703 MiB/s 45.088 MiB/s]

tantivy main-branch, with default tokenizer built with boxed token filters, this is mostly equivalent to what was in tantivy 0.19

index-hdfs/index-hdfs-no-commit
                        time:   [286.20 ms 288.50 ms 290.75 ms]
                        thrpt:  [73.495 MiB/s 74.068 MiB/s 74.665 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [419.36 ms 422.07 ms 424.78 ms]
                        thrpt:  [50.306 MiB/s 50.629 MiB/s 50.956 MiB/s]

tantivy main-branch, with default tokenizer

index-hdfs/index-hdfs-no-commit
                        time:   [221.96 ms 224.18 ms 226.59 ms]
                        thrpt:  [94.309 MiB/s 95.320 MiB/s 96.276 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [352.42 ms 354.94 ms 357.60 ms]
                        thrpt:  [59.756 MiB/s 60.205 MiB/s 60.635 MiB/s]

fulmicoton · 2023-06-29T06:17:44Z

If I understand the PR correctly, this attempts to get a best of two world situation.
The TextAnalyzerBuilder makes it possible to have a statically linked chain of tokenizers... but it also makes it possible to append a list of dynamic Box. Correct?

You mentioned you are seeing a regression, how do we explain it?

fulmicoton · 2023-06-29T06:21:36Z

src/tokenizer/tokenizer.rs

        TextAnalyzer {
-            tokenizer: self.tokenizer.box_clone(),
+            tokenizer: Box::new(tokenizer),


I would have preferred to not have this public constructor, and instead have an extra

TextAnalyzerBuilder::build_with_dynamic_token_filters(token_filters: Vec<BoxTokenFilter>), with some documentation explaining about the purpose of two ways to add filters.

fulmicoton · 2023-06-29T06:22:29Z

src/tokenizer/ascii_folding_filter.rs

    tail: T,
 }

-impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<'a, T> {
+impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<T> {


fulmicoton · 2023-06-29T06:27:40Z

src/tokenizer/lower_caser.rs

-            tokenizer,
+    fn filter<T: TokenStream>(&self, token_stream: T) -> Self::OutputTokenStream<T> {
+        LowerCaserTokenStream {
+            tail: token_stream,
            buffer: String::new(),


you are reverting here the work done by Pascal to reduce the number of allocation we have. Is this necessary?

@PSeitz can you confirm?

Yes indeed :/

On each tokenizer.token_stream, it will call filter and thus will do a String::new().

That was not the case before.

I'd prefer if we can have an instance with a buffer in the pipeline like before

Yes, of course. I should have fixed it by adding the allocation at the TokenFilter level.

fmassot · 2023-06-30T13:57:42Z

closing in favor of #2110

fmassot added 5 commits June 22, 2023 09:12

Add boxed token filter to ease the building of TextAnalyzer with a ve…

ad9b825

…c of filters.

Clean code and improve docs.

6c6b97d

Use dyn_clone.

54f4313

Fix comment.

b82cd08

Remove BoxTokenStream.

dc783f8

fmassot force-pushed the fmassot/add-box-token-filter-and-refactor branch 3 times, most recently from a80bc6b to 5508fa4 Compare June 25, 2023 10:54

fmassot requested a review from trinity-1686a June 25, 2023 11:07

Refactor token filter.

2cab111

fmassot force-pushed the fmassot/add-box-token-filter-and-refactor branch from 5508fa4 to 2cab111 Compare June 25, 2023 11:21

Clean.

f6a6b4a

This was referenced Jun 25, 2023

Add support for custom tokenizers ngram and regex. quickwit-oss/quickwit#3575

Merged

Add boxed token filter to ease the building of TextAnalyzer with... #2096

Closed

fmassot requested a review from PSeitz June 26, 2023 07:38

fmassot marked this pull request as ready for review June 27, 2023 03:36

fulmicoton reviewed Jun 29, 2023

View reviewed changes

Avoid allocation in filters.

f777de1

fmassot closed this Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

fmassot commented Jun 25, 2023 •

edited

Loading

codecov-commenter commented Jun 25, 2023

fmassot commented Jun 27, 2023

fulmicoton commented Jun 29, 2023

fulmicoton Jun 29, 2023 •

edited

Loading

fulmicoton Jun 29, 2023

fulmicoton Jun 29, 2023

fulmicoton Jun 29, 2023

fmassot Jun 29, 2023

PSeitz Jun 29, 2023

fmassot Jun 29, 2023 •

edited

Loading

fmassot commented Jun 30, 2023

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

Conversation

fmassot commented Jun 25, 2023 • edited Loading

codecov-commenter commented Jun 25, 2023

Codecov Report

fmassot commented Jun 27, 2023

fulmicoton commented Jun 29, 2023

fulmicoton Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

fulmicoton Jun 29, 2023

Choose a reason for hiding this comment

fulmicoton Jun 29, 2023

Choose a reason for hiding this comment

fulmicoton Jun 29, 2023

Choose a reason for hiding this comment

fmassot Jun 29, 2023

Choose a reason for hiding this comment

PSeitz Jun 29, 2023

Choose a reason for hiding this comment

fmassot Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

fmassot commented Jun 30, 2023

fmassot commented Jun 25, 2023 •

edited

Loading

fulmicoton Jun 29, 2023 •

edited

Loading

fmassot Jun 29, 2023 •

edited

Loading