Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change TokenFilter trait to simplify (a bit) the boxing of filters #2101

Closed
wants to merge 8 commits into from

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Jun 25, 2023

I have a preference for this PR instead of this one: #2096

The main difference is that I changed the TokenFilter to make it independent of the Tokenizer trait as a TokenFilter should be defined only by the input token stream and its output token stream.

The main benefit compared to #2096 is that it avoids creating a BoxTokenizer and implementing the Tokenizer trait on it... it does not seem to be a substantial benefit but... except it is cleaner. For example in #2096, I tried to bypass BoxTokenizer and implemented the Tokenizer trait directly on Box<dyn BoxableTokenizer>, but I got stack overflow as I was calling recursively the same method box_token_stream but the compiler did not tell me anything...

Bench results:

cargo bench --bench analyzer
    Finished bench [optimized + debuginfo] target(s) in 0.12s
warning: the following packages contain code that will be rejected by a future version of Rust: quick-xml v0.22.0
note: to see what the problems were, use the option `--future-incompat-report`, or run `cargo report future-incompatibilities --id 1`
     Running benches/analyzer.rs (target/release/deps/analyzer-2a32537b8a3b3efa)
Gnuplot not found, using plotters backend
default-tokenize-alice  time:   [874.40 µs 875.34 µs 876.63 µs]
                        change: [-2.0579% -1.8172% -1.5706%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 200 measurements (8.00%)
  8 (4.00%) high mild
  8 (4.00%) high severe

default-dynamic-tokenize-alice
                        time:   [1.1158 ms 1.1165 ms 1.1173 ms]
                        change: [-1.9267% -1.6803% -1.4354%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 200 measurements (4.00%)
  2 (1.00%) high mild
  6 (3.00%) high severe

I also ran the benchmark on tantivy 0.19 before the introduction of GATs, box token streams were used everywhere at that time.

default-tokenize-alice  time:   [1.0371 ms 1.0376 ms 1.0383 ms]

So we have a performance regression on the box token filters, unfortunately. I tried to change the code to have something equivalent to what we had on 0.19 but... this bench results did not change.

@fmassot fmassot force-pushed the fmassot/add-box-token-filter-and-refactor branch 3 times, most recently from a80bc6b to 5508fa4 Compare June 25, 2023 10:54
@fmassot fmassot force-pushed the fmassot/add-box-token-filter-and-refactor branch from 5508fa4 to 2cab111 Compare June 25, 2023 11:21
@codecov-commenter
Copy link

Codecov Report

Merging #2101 (f6a6b4a) into main (3b0cbf8) will increase coverage by 0.03%.
The diff coverage is 97.02%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #2101      +/-   ##
==========================================
+ Coverage   94.39%   94.42%   +0.03%     
==========================================
  Files         321      321              
  Lines       60612    60612              
==========================================
+ Hits        57213    57235      +22     
+ Misses       3399     3377      -22     
Impacted Files Coverage Δ
src/core/segment_reader.rs 90.57% <0.00%> (ø)
src/lib.rs 99.05% <ø> (ø)
src/query/more_like_this/more_like_this.rs 65.91% <0.00%> (+0.24%) ⬆️
src/tokenizer/mod.rs 97.84% <ø> (ø)
src/tokenizer/tokenizer.rs 95.55% <98.38%> (+2.87%) ⬆️
src/directory/ram_directory.rs 90.50% <100.00%> (ø)
src/indexer/segment_updater.rs 95.31% <100.00%> (-0.14%) ⬇️
src/indexer/segment_writer.rs 97.87% <100.00%> (ø)
src/postings/serializer.rs 98.95% <100.00%> (ø)
src/tokenizer/alphanum_only.rs 98.43% <100.00%> (+5.58%) ⬆️
... and 7 more

... and 3 files with indirect coverage changes

@fmassot
Copy link
Contributor Author

fmassot commented Jun 27, 2023

HDFS benchmark

tantivy 0.19:

index-hdfs/index-hdfs-no-commit
                        time:   [329.88 ms 332.48 ms 335.11 ms]
                        thrpt:  [63.766 MiB/s 64.272 MiB/s 64.777 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [473.94 ms 478.03 ms 482.20 ms]
                        thrpt:  [44.315 MiB/s 44.703 MiB/s 45.088 MiB/s]

tantivy main-branch, with default tokenizer built with boxed token filters, this is mostly equivalent to what was in tantivy 0.19

index-hdfs/index-hdfs-no-commit
                        time:   [286.20 ms 288.50 ms 290.75 ms]
                        thrpt:  [73.495 MiB/s 74.068 MiB/s 74.665 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [419.36 ms 422.07 ms 424.78 ms]
                        thrpt:  [50.306 MiB/s 50.629 MiB/s 50.956 MiB/s]

tantivy main-branch, with default tokenizer

index-hdfs/index-hdfs-no-commit
                        time:   [221.96 ms 224.18 ms 226.59 ms]
                        thrpt:  [94.309 MiB/s 95.320 MiB/s 96.276 MiB/s]
index-hdfs/index-hdfs-with-commit
                        time:   [352.42 ms 354.94 ms 357.60 ms]
                        thrpt:  [59.756 MiB/s 60.205 MiB/s 60.635 MiB/s]

@fulmicoton
Copy link
Collaborator

If I understand the PR correctly, this attempts to get a best of two world situation.
The TextAnalyzerBuilder makes it possible to have a statically linked chain of tokenizers... but it also makes it possible to append a list of dynamic Box. Correct?

You mentioned you are seeing a regression, how do we explain it?

TextAnalyzer {
tokenizer: self.tokenizer.box_clone(),
tokenizer: Box::new(tokenizer),
Copy link
Collaborator

@fulmicoton fulmicoton Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to not have this public constructor, and instead have an extra

TextAnalyzerBuilder::build_with_dynamic_token_filters(token_filters: Vec<BoxTokenFilter>), with some documentation explaining about the purpose of two ways to add filters.

tail: T,
}

impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<'a, T> {
impl<'a, T: TokenStream> TokenStream for AsciiFoldingFilterTokenStream<T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy...

tokenizer,
fn filter<T: TokenStream>(&self, token_stream: T) -> Self::OutputTokenStream<T> {
LowerCaserTokenStream {
tail: token_stream,
buffer: String::new(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are reverting here the work done by Pascal to reduce the number of allocation we have. Is this necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PSeitz can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed :/

On each tokenizer.token_stream, it will call filter and thus will do a String::new().

That was not the case before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we can have an instance with a buffer in the pipeline like before

Copy link
Contributor Author

@fmassot fmassot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course. I should have fixed it by adding the allocation at the TokenFilter level.

@fmassot
Copy link
Contributor Author

fmassot commented Jun 30, 2023

closing in favor of #2110

@fmassot fmassot closed this Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants